Overload Handling

Handling overload comes down to two things:

  • what to do prior to a load surge
  • how to handle a current load surge


  • Having a proper Benchmark report is always helpful to understand expected behavior for different levels of load, i.e. where is the hockey stick.
  • Some applications are updated from batch applications like Hadoop / Spark. It is easy to overload the Vespa instance by running too many clients / using too many connections. In these cases, tune parallelism down for correct utilization.
  • Pre-define a light-weight rank profile for emergency use, enabled using a query profile - test this before it is used, update the benchmark report
  • Run through a case where capacity is increased, then decreased, and get a sense of time constants. Test this regularly. This is done as part of regular operations, so easy to do


Pre-configuring Vespa for overload handling makes handling load surge events easier.

Rate limiting

This guide does not cover how to handle DOS-attacks, however such an attack can also be addressed by using Rate limiting. Rate limiting assumes clients are separated by ID in requests. If the application does not use this, one can modify the RateLimitingSearcher for the use case.

Quality degradation

Many Vespa applications are resource-intensive, using two-phased ranking to assign most resources to the best result candidates. A load surge can make queries time out - in such cases, returning results with less coverage can be a good tradeoff.

Some times, some results are better than no results. Softtimeout returns the current result set at timeout, and the balance between first and second phase ranking is both adaptive and configurable.

Refer to Graceful degradation for a description of how to use these features.

It is also a good idea to have an inexpensive rank profile pre-defined, and use query profiles to define the default rank profile. By doing this, there is no need to change queries, just deploy a new application package with the light-weight rank profile as default.

Addressing overload

Overload manifests in increased query latency and/or timeouts. There are some hints in the open documentation. The key is to find which cluster is the bottleneck - a container or content cluster. For this, use CPU metrics.


Container clusters do not have state, and are easily scaled by adding resources.

The resource change is deployed as any other change, and production rollout can be managed in the console (e.g. skip tests, deploy in parallel)


Content nodes have state (i.e document data and index), changes to resource allocations are hence more complex. Data is migrated using Vespa's elasticity features.

A key observation is that during elastic operations, load increases temporarily, as data migrate between nodes.

To get a feel for timing, it is advised to add a content node during normal operations, to see how long it takes for data to complete migration (use metrics as described in elastic Vespa).

The Vespa Console lets users control where to route load, useful if the overload is confined to a subset of zones. This way, one can serve while building more capacity in other zones.