When bootstrapping an index, one must consider node resource configuration and number of nodes. The strategy is to iterate:
While doing this, ensure the cluster is never more than 50% full - this gives headroom to later increase/shrink the index and change schema configuration easily using automatic reindexing. It is easy to downscale resources after the bootstrap, and it saves a lot of time keeping the clusters within limits - hence max 50%.
Review the Vespa Overview to understand the different between container and content clusters before continuing.
The content node resource configuration should not have ranges for index bootstrap, as autoscaling will interfere with the evaluation in this step. This is a good starting point, make sure there are no ranges like [2,3]:
To evaluate how full the content cluster is, use metrics from content nodes - example:
$ curl \ --cert data-plane-public-cert.pem \ --key data-plane-private-key.pem \ https://vespacloud-docsearch.vespa-team.aws-us-east-1c.z.vespa-app.cloud/prometheus/v1/values | \ egrep 'disk.util|mem.util' | egrep 'clusterId="content/'
Once able to get the metrics above, you are ready to bootstrap the index.
The purpose of this step is to feed a tiny chunk of the corpus to:
Feed a small data set, using
Look at memory util. Say the 1% feed caused a 15% memory util - this means that the 10% feed will take 150%, or 3X more than the 50% max. There are two options, either increase memory/disk or add more nodes. A good rule of thumb at this stage is that the final 100% feed could fit on 4 or more nodes, and there is a 2-node minimum for redundancy. The default configuration at the start of this document is quite small, so a 3X at this stage means triple the disk and memory, and add more nodes in later steps.
Deploy changes (if needed). Whenever node count increases or resource configuration is modified, new nodes are added, and data is migrated to new nodes. Example: growing from 2 to 3 nodes means each of the 2 current nodes will migrate 33% of their data to the new node. Read more in elasticity. It saves time to let the cluster finish data migration before feeding more data. In this step it will be fast as the data volume is small, but nevertheless check the vds.idealstate.merge_bucket.pending.average metric. Wait for 0 for all nodes - this means data migration is completed:
$ curl \ --cert ~/.vespa/mytenant.myapp.default/data-plane-public-cert.pem \ --key ~/.vespa/mytenant.myapp.default/data-plane-private-key.pem \ https://vespacloud-docsearch.vespa-team.aws-us-east-1c.z.vespa-app.cloud/prometheus/v1/values?consumer=Vespa | \ egrep 'vds_idealstate_merge_bucket_pending_average'
At this point, you can validate that both memory and disk util is less than 5%, so the 10% feed will fit.
Feed the 10% corpus, still observing util metrics.
As the content cluster capacity is increased,
it is normal to eventually be CPU bound in the container or content cluster.
A 10% feed is a great baseline for the full capacity requirements.
Fine tune the resource config and number of hosts as needed.
If you deploy changes, wait for the
Again validate memory and disk util is less than 5% before the full feed.
Feed the full data set, observing the metrics. You should be able to estimate timing by extrapolation, this is linear at this scale. At feed completion, observe the util metrics for the final fine-tuning.
A great exercise at this point is to add a node then reduce a node, and take the time to completion
It can be a good idea to reduce node count to get the memory util closer to 70% at this step, to optimize for cost. However, do not spend too much time optimizing in this step, next step is normally sizing for query load. This will again possibly alter resource configuration and node counts / topology, but now you have a good grasp at how to easily bootstrap the index for these experiments.
Feeding too much will cause a
feed blocked state.
Add a node to the full content cluster in services.xml, and wait for data migration to complete -
i.e. wait for the
vds.idealstate.merge_bucket.pending.average metric to go to zero.
It is better to add a node than increasing node resources, as data migration is quicker.