Cloning applications and data

This is a guide on how to replicate a Vespa application into different environments, with or without data. Use cases for cloning include:

  • Get a copy of the application and (some) data on a laptop to work offline, or attach a debugger.
  • Deploy local experiments to the dev environment to easily cooperate and share.
  • Set up a copy of the application and (some) data to test a new major version of Vespa.
  • Replicate a bug report in a non-production environment.
  • Set up a copy of the application and (some) data in a prod environment to experiment with a CI/CD pipeline, without touching the current production serving.
  • Onboard a new team member by setting up a copy of the application and test data in a dev environment.
  • Clone to a perf environment for load testing.

This guide uses applications. One can also use instances, but that will not work across Vespa major versions on Vespa Cloud - refer to tenant, applications, instances for details.

Vespa Cloud has different environments dev/perf and prod, with different characteristics - details. Clone to dev/perf for short-lived experiments/development, use prod for serving applications with a CI/CD pipeline.

As some steps are similar, it is a good idea to read through all, as details are added only first time for brevity. Examples are based on the album-recommendation sample application.

Cloning - self-hosted to self-hosted

Creating a copy from one self-hosted application to another. Self-hosted means running vespa.ai on a laptop or a multinode system.

This example sets up a source app and deploys the application package - use album-recommendation as an example. The application package is then exported from the source and deployed to a new target app. Steps:

Source setup:

$ docker run --detach --name vespa1 --hostname vespa-container1 \
  --publish 8080:8080 --publish 19071:19071 \
  vespaengine/vespa

$ vespa deploy -t http://localhost:19071

Target setup:

$ docker run --detach --name vespa2 --hostname vespa-container2 \
  --publish 8081:8080 --publish 19072:19071 \
  vespaengine/vespa

Export source application package

If the resource/laptop running Docker does not have tar, mount /tmp/d out of the container or just copy the files by other means. Export files:

$ docker exec vespa1 sh -c "mkdir -p /tmp/d && cd /tmp/d && /opt/vespa/bin/vespa-deploy fetch"

$ docker exec -w /tmp/d vespa1 tar cvf - . | tar xvf -

$ docker exec vespa1 rm -rf /tmp/d

Deploy application package to target

Before deploying, one can make changes to the application package files as needed. Deploy to target:

$ vespa deploy -t http://localhost:19072

Data copy from source to target

This pipes the source data directly into vespa-feed-client - another option is to save the data to files temporarily and feed these individually:

$ docker exec vespa1 /opt/vespa/bin/vespa-visit | \
  vespa-feed-client-cli/vespa-feed-client --stdin --endpoint http://localhost:8081

Data copy 5%

This is an example on how to use a selection to specify a subset of the documents - here a “random” 5% selection:

$ docker exec vespa1 /opt/vespa/bin/vespa-visit -s 'id.hash().abs() % 20 = 0' | \
  vespa-feed-client-cli/vespa-feed-client --stdin --endpoint http://localhost:8081

Get access log from source

Get the current query access log from the source application (there might be more files there):

$ docker exec vespa1 cat /opt/vespa/logs/vespa/access/JsonAccessLog.default

Cloning - self-hosted to Vespa Cloud

Source setup:

$ docker run -v --detach --name vespa1 --hostname vespa-container1 \
  --publish 8080:8080 --publish 19071:19071 \
  vespaengine/vespa

$ vespa deploy -t http://localhost:19071

Target setup:

Create a tenant in the Vespa Cloud console, in this guide using “mytenant”.

Export source application package:

$ docker exec vespa1 sh -c "mkdir -p /tmp/d && cd /tmp/d && /opt/vespa/bin/vespa-deploy fetch"

$ docker exec -w /tmp/d vespa1 tar cvf - . | tar xvf -

$ docker exec vespa1 rm -rf /tmp/d

Deploy target application package

The procedure differs a little whether deploying to dev/perf or prod environment. The mvn -U clean package step is only needed for applications with custom code. Configure application and instance names and create data plane credentials:

$ vespa config set target cloud && \
  vespa config set application mytenant.myapp.myinstance

$ vespa auth login

$ vespa auth cert -f

$ mvn -U clean package

Then deploy the application. Depending on the use case, deploy to dev/perf or prod:

  • dev/perf:
    $ vespa deploy
    
    Expect something like:
    Uploading application package ... done
    
    Success: Triggered deployment of . with run ID 1
    
    Use vespa status for deployment status, or follow this deployment at
    https://console.vespa-cloud.com/tenant/mytenant/application/myapp/dev/instance/myinstance/job/dev-aws-us-east-1c/run/1
    
  • Deployments to the prod environment requires deployment.xml - select which zone to deploy to:
    $ cat <<EOF > deployment.xml
    <deployment version="1.0">
        <prod>
            <region>aws-us-east-1c</region>
        </prod>
    </deployment>
    EOF
    
    prod deployments also require resources specifications in services.xml - use vespa-documentation-search as an example and add/replace nodes elements for container and content clusters. If in doubt, just add a small config to start with, and change later:
    <nodes count="2">
        <resources vcpu="2" memory="8Gb" disk="10Gb" />
    </nodes>
    
    Submit the application package:
    $ vespa prod submit
    
    Expect something like:
    Hint: See https://cloud.vespa.ai/en/getting-to-production
    Success: Submitted . for deployment
    See https://console.vespa-cloud.com/tenant/mytenant/application/myapp/prod/deployment for deployment progress
    
    A proper deployment to a prod zone should have automated tests, read more in automated deployments

Data copy

Get the vespa-feed-client first. Find the endpoint in the Vespa Cloud Console, then:

$ docker exec vespa1 /opt/vespa/bin/vespa-visit | \
  ./vespa-feed-client-cli/vespa-feed-client --stdin --show-errors \
  --certificate /Users/me/.vespa/mytenant.myapp.myinstance/data-plane-public-cert.pem \
  --private-key /Users/me/.vespa/mytenant.myapp.myinstance/data-plane-private-key.pem \
  --endpoint https://myinstance.myapp.mytenant.aws-us-east-1c.dev.z.vespa-app.cloud

Get access log from source:

$ docker exec vespa1 cat /opt/vespa/logs/vespa/access/JsonAccessLog.default

Cloning - Vespa Cloud to self-hosted

Download application from Vespa Cloud

The application package can be downloaded from the Vespa Cloud Console:

  • dev/perf: Navigate to https://console.vespa-cloud.com/tenant/mytenant/application/myapp/dev/instance/myinstance, Click the Application download:

    Application package download from dev environment
  • prod: Navigate to https://console.vespa-cloud.com/tenant/mytenant1/application/myapp/prod/deployment?tab=builds and select the version of the application to download:

    Application package download from prod environment

Target setup:

Note the name of the application package .zip-file. If changes are needed, unzip it and use vespa deploy -t http://localhost:19071 to deploy from current directory:

$ docker run --detach --name vespa1 --hostname vespa-container1 \
  --publish 8080:8080 --publish 19071:19071 \
  vespaengine/vespa

$ vespa config set target local

$ vespa deploy -t http://localhost:19071 mytenant.myapp.myinstance.dev.aws-us-east-1c.zip

Data copy

Modify dump.sh, use correct tenant.app.instance names - then start a dump/feed job. The json cannot be fed directly, hence the little JSON filtering using jq:

$ ./dump.sh | jq .documents[] | \
  vespa-feed-client-cli/vespa-feed-client --stdin --show-errors \
  --endpoint http://localhost:8081

data copy - minimal

For use cases requiring a few documents, visit just a few documents:

$ curl --cert data-plane-public-cert.pem --key data-plane-private-key.pem \
  "https://myinstance.myapp.mytenant.aws-us-east-1c.dev.z.vespa-app.cloud/document/v1/?cluster=music&wantedDocumentCount=10"

Get access log from source:

Use the Vespa Cloud Console to get access logs

Cloning - Vespa Cloud to Vespa Cloud

This is a combination of the procedures above. Download the application package from dev/perf or prod, make note of the source name, like mytenant.myapp.myinstance. Then use vespa deploy or vespa prod submit as above to deploy to dev/perf or prod.

If cloning from dev/perf to prod, pay attention to changes in deployment.xml and services.xml as in cloning to Vespa Cloud.

Data copy

Update dump.sh with source, e.g. mytenant.myapp.myinstance, and set the endpoint name / paths based on source name, e.g. mytenant.myapp-new.myinstance:

$ ./dump.sh | jq .documents[] | \
  vespa-feed-client-cli/vespa-feed-client --stdin --show-errors \
    --certificate /Users/me/.vespa/mytenant.myapp-new.myinstance/data-plane-public-cert.pem \
    --private-key /Users/me/.vespa/mytenant.myapp-new.myinstance/data-plane-private-key.pem \
    --endpoint https://myinstance.myapp-new.mytenant.aws-us-east-1c.dev.z.vespa-app.cloud

Data copy 5% Set the SELECTION variable in dump.sh to select a subset of the documents

Appendix: dump.sh

#!/bin/bash

ENDPOINT="https://myinstance.myapp.mytenant1.aws-us-east-1c.dev.z.vespa-app.cloud"
NAMESPACE=mynamespace
DOCTYPE=music
CLUSTER=music

unset SELECTION
# Use a selection to visit a subset - example 5% selection: id.hash().abs() % 20 = 0
# SELECTION='&selection=id.hash%28%29.abs%28%29%20%25%2020%20%3D%200'

continuation=""
idx=0

while
    ((idx+=1))
    printf -v out "%05g" $idx
    filename=${NAMESPACE}-${DOCTYPE}-${out}.data
    token=$( curl -s \
        --cert /Users/me/.vespa/mytenant.myapp.myinstance/data-plane-public-cert.pem \
        --key  /Users/me/.vespa/mytenant.myapp.myinstance/data-plane-private-key.pem \
        "${ENDPOINT}/document/v1/${NAMESPACE}/${DOCTYPE}/docid?wantedDocumentCount=1000&cluster=${CLUSTER}&${continuation}${SELECTION}" \
        | tee ${filename} | jq -re .continuation )
do
    continuation="continuation=${token}"
done

cat *.data