Getting started - Document Processing

Data written to Vespa pass through document processing, where indexing is one example. Applications can add custom processing, normally done before indexing. This is done by adding a Document Processor. Such processing is synchronous, and this is problematic for processing that requires other resources with high latency - this can saturate the threadpool.

This application demonstrates how to use Progress.LATER and the asynchronous Document API. Summary:

  • Document Processors: modify / enrich data in the feed pipeline
  • Multiple Schemas: store different kinds of data, like different database tables
  • Enrich data from multiple sources: here, look up data in one schema and add to another
  • Document API: write asynchronous code to fetch data

Flow:

  1. Feed album document with the music schema
  2. Look up in the lyrics schema if album with given ID has lyrics stored
  3. Store album with lyrics in the music schema

image

Install the Vespa CLI:

Using Homebrew:

$ brew install vespa-cli

You can also download Vespa CLI for Windows, Linux and macOS.

Create your tenant in the Vespa Cloud:

If you don’t already have a Vespa Cloud tenant, create one at console.vespa.ai. This requires a Google or GitHub account, and will start your free trial.

Initialize the application

Initialize myapp-docproc/ to a copy of a sample application package:

$ vespa clone vespa-cloud/document-processing myapp-docproc
$ cd myapp-docproc

Let pom.xml specify the right tenant and application name:

Set the properties tenant and application properties in pom.xml to your tenant and application name (“myapp-docproc”)

Tell the Vespa CLI to use Vespa Cloud, with your application:

$ vespa config set target cloud
$ vespa config set application <tenant-name>.myapp-docproc.default

Use the tenant and application name from step 2.

$ export VESPA_CLI_HOME=$PWD/.vespa TMPDIR=$PWD/.tmp
$ mkdir -p $TMPDIR
$ vespa config set target cloud
$ vespa config set application vespa-team.document-processing.my-instance

Create a user API key:

$ vespa api-key

Follow the instructions from the command to register the key.

$ echo "$VESPA_TEAM_API_KEY" | openssl base64 -A -a -d | openssl ec > $VESPA_CLI_HOME/vespa-team.api-key.pem

Create a self-signed certificate for accessing your application:

$ vespa cert

See the security model for more details.

Build and deploy the application:

$ mvn package vespa:deploy

The first deployment may take a few minutes.

$ mvn package vespa:deploy -Dinstance=my-instance -DapiKeyFile=$VESPA_CLI_HOME/vespa-team.api-key.pem

Verify that you can reach your application endpoint:

$ vespa status --wait 300

Feed a lyrics document:

… and get the document after the feed as well:

$ vespa document src/test/resources/A-Head-Full-of-Dreams-lyrics.json
$ vespa document get id:mynamespace:lyrics::a-head-full-of-dreams

Feed a music document:

$ vespa document src/test/resources/A-Head-Full-of-Dreams.json

Validate that the Document Processor works

Get the document to validate - see lyrics in music document:

$ vespa document get id:mynamespace:music::a-head-full-of-dreams

Compare, the original document did not have lyrics - it has been added in the LyricsDocumentProcessor:

$ cat src/test/resources/A-Head-Full-of-Dreams.json

Review logs:

Use the console to download logs, then inspect what happened:

Container.ai.vespa.example.album.LyricsDocumentProcessor	info	In process
Container.ai.vespa.example.album.LyricsDocumentProcessor	info	  Added to requests pending: 1
Container.ai.vespa.example.album.LyricsDocumentProcessor	info	  Request pending ID: 1, Progress.LATER
Container.ai.vespa.example.album.LyricsDocumentProcessor	info	In process
Container.ai.vespa.example.album.LyricsDocumentProcessor	info	  Request pending ID: 1, Progress.LATER
Container.ai.vespa.example.album.LyricsDocumentProcessor	info	In handleResponse
Container.ai.vespa.example.album.LyricsDocumentProcessor	info	  Async response to put or get, requestID: 1
Container.ai.vespa.example.album.LyricsDocumentProcessor	info	  Found lyrics for : document 'id:mynamespace:lyrics::1' of type 'lyrics'
Container.ai.vespa.example.album.LyricsDocumentProcessor	info	In process
Container.ai.vespa.example.album.LyricsDocumentProcessor	info	  Set lyrics, Progress.DONE

In the first invocation of process, an async request is made - set Progress.LATER In the second invocation of process, the async request has not yet completed (there can be many such invocations) - set Progress.LATER Then, the handler for the async operation is invoked as the call has completed In the subsequent process invocation, we see that the async operation has completed - set Progress.DONE

Further reading: