Dumping data

This guide explains how to dump data from a Vespa cloud application, how to copy documents from one application to another, and how to do mass updates or removals.

To get started with a data dump, find the namespace and document type by listing a few IDs. Hit the /document/v1/ ENDPOINT. Restrict to one CLUSTER, see content clusters:

$ curl --cert data-plane-public-cert.pem --key data-plane-private-key.pem \
  "$ENDPOINT/document/v1/?cluster=$CLUSTER"

For ID dump only, use a fieldset:

$ curl --cert data-plane-public-cert.pem --key data-plane-private-key.pem \
  "$ENDPOINT/document/v1/?cluster=$CLUSTER&fieldSet=%5Bid%5D"

From an ID, like id:open:doc::open/documentation/schemas.html, extract

  • NAMESPACE: open
  • DOCTYPE: doc

Example script:

#!/bin/bash

set -x

# The ENDPOINT must be a regional endpoint, do not use '*.global.public.vespa.oath.cloud/'
ENDPOINT="https://vespacloud-docsearch.vespa-team.aws-us-east-1c.public.vespa.oath.cloud"
NAMESPACE=open
DOCTYPE=doc
CLUSTER=documentation

continuation=""
idx=0

while
  ((idx+=1))
  echo "$continuation"
  printf -v out "%05g" $idx
  filename=${NAMESPACE}-${DOCTYPE}-${out}.data.gz
  echo "Fetching data..."
  token=$( curl -s --cert data-plane-public-cert.pem --key data-plane-private-key.pem \
           "${ENDPOINT}/document/v1/${NAMESPACE}/${DOCTYPE}/docid?wantedDocumentCount=1000&concurrency=4&cluster=${CLUSTER}&${continuation}" \
           | tee >( gzip > ${filename} ) | jq -re .continuation )
do
  continuation="continuation=${token}"
done

If only a few documents are returned per response, wantedDocumentCount (default 1, max 1024) can be specified for a lower bound on the number of documents per response, if that many documents still remain.

Specifying concurrency (default 1, max 100) increases throughput, at the cost of resource usage. This also increases the number of documents per response, and could lead to excessive memory usage in the HTTP container when many large documents are buffered to be returned in the same response.

Feed

Use the vespa-http-client to feed documents in batches to endpoints—example:

$ gunzip -c open-doc-00001.data.gz | jq '.documents[]' | \
  java -jar $HOME/github/vespa-engine/vespa/vespa-http-client/target/vespa-http-client-jar-with-dependencies.jar \
  --endpoint $ENDPOINT --disable-hostname-verification \
  --useTls --certificate data-plane-public-cert.pem --privateKey data-plane-private-key.pem

Note that the data dump cannot be directly fed, as extra fields like continuation token are added— hence the jq command to extract document objects only.

It's also possible to feed the stream of documents from one deployment directly to another, by replacing the tee >(gzip > ${filename}) in the sample script by the compound parse/feed command above!

Delete

To remove all documents in a Vespa deployment—or a selection of them—perform a deletion visit. Use the DELETE HTTP method, and fetch only the continuation token from the response:

#!/bin/bash

set -x

# The ENDPOINT must be a regional endpoint, do not use '*.global.public.vespa.oath.cloud/'
ENDPOINT="https://vespacloud-docsearch.vespa-team.aws-us-east-1c.public.vespa.oath.cloud"
NAMESPACE=open
DOCTYPE=doc
CLUSTER=documentation
SELECTION='doc.path%3D~%22%5E%2Fold%2F%22'  # doc.path =~ "^/old/" -- all documents under the /old/ directory 

continuation=""

while
  token=$( curl -X DELETE -s --cert data-plane-public-cert.pem --key data-plane-private-key.pem \
           "${ENDPOINT}/document/v1/${NAMESPACE}/${DOCTYPE}/docid?selection=${SELECTION}&cluster=${CLUSTER}&${continuation}" \
           | tee >( jq . > /dev/tty ) | jq -re .continuation )
do
  continuation="continuation=${token}"
done

Each request will return a response after roughly one minute—change this by specifying timeChunk (default 60).

To purge all documents in a data dump (above), generate a feed with remove-entries for each document ID, like:

$ gunzip -c open-doc-00001.data.gz | jq '[ .documents[] | {remove: .id} ]' | head

[
  {
    "remove": "id:open:doc::open/documentation/schemas.html"
  },
  {
    "remove": "id:open:doc::open/documentation/securing-your-vespa-installation.html"
  },

Complete example for a single chunk:

$ gunzip -c open-doc-00001.data.gz | jq '[ .documents[] | {remove: .id} ]' | \
  java -jar $HOME/github/vespa-engine/vespa/vespa-http-client/target/vespa-http-client-jar-with-dependencies.jar \
  --endpoint $ENDPOINT --disable-hostname-verification \
  --useTls --certificate data-plane-public-cert.pem --privateKey data-plane-private-key.pem

Update

To update all documents in a Vespa deployment—or a selection of them—perform an update visit. Use the PUT HTTP method, and specify a partial update in the request body:

#!/bin/bash

set -x

# The ENDPOINT must be a regional endpoint, do not use '*.global.public.vespa.oath.cloud/'
ENDPOINT="https://vespacloud-docsearch.vespa-team.aws-us-east-1c.public.vespa.oath.cloud"
NAMESPACE=open
DOCTYPE=doc
CLUSTER=documentation
SELECTION='doc.inlinks%3D%3D%22some-url%22'  # doc.inlinks == "some-url" -- the weightedset<string> inlinks has the key "some-url"

continuation=""

while
  token=$( curl -X PUT -s --cert data-plane-public-cert.pem --key data-plane-private-key.pem \
           --data '{ "fields": { "inlinks": { "remove": { "some-url": 0 } } } }' \
           "${ENDPOINT}/document/v1/${NAMESPACE}/${DOCTYPE}/docid?selection=${SELECTION}&cluster=${CLUSTER}&${continuation}" \
           | tee >( jq . > /dev/tty ) | jq -re .continuation )
do
  continuation="continuation=${token}"
done

Each request will return a response after roughly one minute—change this by specifying timeChunk (default 60).