This is a guide for how to move data from Elasticsearch to Vespa. By the end of this guide you will have exported documents from Elasticsearch, generated a deployable Vespa application package and tested this with documents and queries.
To get started, sign up to get an endpoint to deploy to. Set the tenant name from the signup:
$ export TENANT_NAME=vespa-team # Replace with your tenant name
Alternatively, test with local deployment.
This section sets up an index with 1000 sample documents using getting-started-index. Skip this part if you already have an index. Wait for Elasticsearch to start:
$ docker network create --driver bridge esnet $ docker run -d --rm --name esnode --network esnet -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" \ docker.elastic.co/elasticsearch/elasticsearch:7.10.2 $ while [[ "$(curl -s -o /dev/null -w ''%{http_code}'' localhost:9200)" != "200" ]]; do sleep 1; echo 'waiting ...'; done
Download test data, and feed it to the Elasticsearch instance:
$ curl 'https://raw.githubusercontent.com/elastic/elasticsearch/7.10/docs/src/test/resources/accounts.json' \ > accounts.json $ curl -H "Content-Type:application/json" --data-binary @accounts.json 'localhost:9200/bank/_bulk?pretty&refresh'
Verify that the index has 1000 documents:
$ curl 'localhost:9200/_cat/indices?v'
This guide uses ElasticDump to export the index contents and the index mapping. Export the documents and mappings, then delete the Docker network and the Elasticsearch container:
$ docker run --rm --name esdump --network esnet -v "$PWD":/dump -w /dump elasticdump/elasticsearch-dump \ --input=http://esnode:9200/bank --output=bank_data.json --type=data $ docker run --rm --name esdump --network esnet -v "$PWD":/dump -w /dump elasticdump/elasticsearch-dump \ --input=http://esnode:9200/bank --output=bank_mapping.json --type=mapping $ docker rm -f esnode && docker network remove esnet
ES_Vespa_parser.py is provided for conversion of Elasticsearch data and index mappings to Vespa data and configuration. It is a basic script with minimal error checking - it is designed for a simple export, modify this as needed for your application's needs. Generate Vespa documents and configuration:
$ curl 'https://raw.githubusercontent.com/vespa-engine/vespa/master/config-model/src/main/python/ES_Vespa_parser.py' \ > ES_Vespa_parser.py $ python3 ./ES_Vespa_parser.py --application_name bank bank_data.json bank_mapping.json
This generates documents in documents.json
(see JSON format)
where each document has IDs like this id:bank:_doc::1
.
It also generates a bank folder with an application package:
/bank │ ├── documents.json ├── hosts.xml ├── services.xml └── /schemas └── _doc.sd
Enter the application package directory:
$ cd bank
Install Vespa CLI. In this example we use Homebrew, you can also download from GitHub:
$ brew install vespa-cli
Configure for Vespa Cloud deployment, log in and add credentials:
$ vespa config set target cloud $ vespa config set application $TENANT_NAME.myapp.default
$ vespa auth login
$ vespa auth cert
Also see getting started guide. Deploy the application package:
$ vespa deploy --wait 300
Index the documents exported from Elasticsearch:
$ vespa feed documents.json
Export all documents:
$ vespa visit
Get a document:
$ vespa document get id:bank:_doc::1
Count documents, find "totalCount":1000
in the output:
$ vespa query 'select * from _doc where true'
Run a simple query against the firstname field:
$ vespa query 'select firstname,lastname from _doc where firstname contains "amber"'
Review the differences in document records, Vespa to the right:
{ "_index": "bank", "_type": "_doc", "_id": "1", "_score": 1, "_source": { "account_number": 1, "balance": 39225, "firstname": "Amber", "lastname": "Duke", "age": 32, "gender": "M", "address": "880 Holmes Lane", "employer": "Pyrami", "email": "amberduke@pyrami.com", "city": "Brogan", "state": "IL" } } |
{ "put": "id:bank:_doc::1", "fields": { "account_number": 1, "balance": 39225, "firstname": "Amber", "lastname": "Duke", "age": 32, "gender": "M", "address": "880 Holmes Lane", "employer": "Pyrami", "email": "amberduke@pyrami.com", "city": "Brogan", "state": "IL" } } |
The id field
id:bank:_doc::1
is composed of:
bank
_doc
1
Read more in Documents and
Schemas.
The schema is the key Vespa configuration file where field types
and ranking are configured.
The schema (found in schemas/_doc.sd
) also has
indexing settings, example:
search _doc { document _doc { field account_number type long { indexing: summary | attribute } field address type string { indexing: summary | index } ... } }
These settings impact both performance and how fields are matched. For example, the account_number above is using the attribute keyword, which makes the field available for sorting, ranking, grouping, but which by default does not have data structures for fast search. Read more in attributes and practical search performance guide.
To run the steps above, using a local deployment, follow the steps in the quickstart to start a local container running Vespa. Then, deploy the application package from the bank folder.