Data dump

To get started, find namespace and document type by listing a few IDs. Hit the /document/v1/ ENDPOINT. Restrict to one CLUSTER, see content clusters (most apps have only one cluster):

$ curl --cert data-plane-public-cert.pem --key data-plane-private-key.pem \
  "$ENDPOINT/document/v1/?cluster=$CLUSTER"
For ID dump only, use a fieldset:
$ curl --cert data-plane-public-cert.pem --key data-plane-private-key.pem \
  "$ENDPOINT/document/v1/?cluster=$CLUSTER&fieldSet=%5Bid%5D"

From an ID, like id:open:doc::open/documentation/schemas.html, extract

  • NAMESPACE: open
  • DOCTYPE: doc
Example script:
#!/bin/bash

set -x

ENDPOINT="https://vespacloud-docsearch.vespa-team.aws-us-east-1c.public.vespa.oath.cloud"
NAMESPACE=open
DOCTYPE=doc
CLUSTER=documentation

continuation=""
idx=0

while [ "$continuation" != "continuation=" ]
do
  ((idx+=1))
  echo "$continuation"
  printf -v out "%05g" $idx
  filename=${NAMESPACE}-${DOCTYPE}-${out}.data.gz
  echo "Fetching data..."
  curl -s --cert data-plane-public-cert.pem --key data-plane-private-key.pem \
    "${ENDPOINT}/document/v1/${NAMESPACE}/${DOCTYPE}/docid?wantedDocumentCount=1000&concurrency=4&cluster=${CLUSTER}&${continuation}" | gzip > ${filename}
  continuation="continuation=`gunzip -c ${filename} | head -c 1000 | grep '"continuation":"' | cut -d'"' -f4`"
done
On concurrency: higher concurrency increases temporal memory use while dumping data, returning more documents per request. wantedDocumentCount is max 1000.

Feed

Use the vespa-http-client to feed documents in batches to endpoints - example:

$ gunzip -c open-doc-00001.data.gz | jq '.documents[]' | \
  java -jar $HOME/github/vespa-engine/vespa/vespa-http-client/target/vespa-http-client-jar-with-dependencies.jar \
  --endpoint $ENDPOINT --disable-hostname-verification \
  --useTls --certificate data-plane-public-cert.pem --privateKey data-plane-private-key.pem
Note that the data dump cannot be directly fed, as extra fields like continuation token are added - hence the jq command to extract document objects only.

Delete

To purge all documents, generate a feed with remove-entries for all document IDs, like:

$ gunzip -c open-doc-00001.data.gz | jq '[ .documents[] | {remove: .id} ]' | head

[
  {
    "remove": "id:open:doc::open/documentation/schemas.html"
  },
  {
    "remove": "id:open:doc::open/documentation/securing-your-vespa-installation.html"
  },
Hence, extracts all IDs and feed to the endpoint to remove:
$ gunzip -c open-doc-00001.data.gz | jq '[ .documents[] | {remove: .id} ]' | \
  java -jar $HOME/github/vespa-engine/vespa/vespa-http-client/target/vespa-http-client-jar-with-dependencies.jar \
  --endpoint $ENDPOINT --disable-hostname-verification \
  --useTls --certificate data-plane-public-cert.pem --privateKey data-plane-private-key.pem
Pro Tip: Reduce file size by only dumping IDs using &fieldSet=%5Bid%5D (above).