Vespa Cloud: Performance use cases
Vespa is built for really large realtime serving, and supports "unlimited" content node (proton) size. Proton is a C++ component and does not have memory limitations other than restrictions on attributes - a common use case is running in 256G containers. It has its own memory allocator called vespa-malloc.
Usages vary from applications with tens of billions of documents and a moderate query rate (example: image search) to millions of documents with query rates in 100.000/second (example: ad serving). Vespa supports performance groups for flexible replica placement to enable a wide range of use cases. All cases support a sustained, high throughput for updating documents.
Vespa supports a wide range of ML models by transforming them to tensors - and uses LLVM for high-performance ranking.
Read more in Vespa Performance.
It is hard to size an application for the highest possible load peak. Unexpected things happen. Instead of allocating idle resources for peak loads that almost never happen, a good tradeoff is degrading relevance quality, requiring less coverage. This keeps cost under control, still serving useful results during high peaks.
Most engines are multi-threaded to fully utilize the computing resources. In Vespa, the data layout on disk is fully orthogonal to threads used per query. It is hence easy to increase number of threads used per query without having to redistribute data. Balance capacity requirements, query latency and throughput by tuning num-threads-per-search.