Vector Search Without the Cloud: memista's SQLite-Backed ANN
Building approximate nearest-neighbour search on SQLite in pure Rust — and why you might not need a dedicated vector database.
The vector database market is projected to reach several billion dollars in the next few years. Pinecone, Weaviate, Qdrant, Milvus, Chroma — the landscape is crowded with products promising to store, index, and query high-dimensional vector embeddings at scale. Every RAG tutorial tells you to pick a vector database. Every AI infrastructure diagram includes one.
We built memista because we suspected that most people do not need one.
memista is a SQLite-backed approximate nearest-neighbour (ANN) search library written in pure Rust. It stores vectors in a SQLite database, indexes them using an IVF-style partitioning scheme, and retrieves approximate nearest neighbours with sub-millisecond latency for collections up to a few million vectors. It has no server process, no network protocol, no cluster management, and no cloud account.
This article explains why we built it, how it works, and when you should (and should not) use it.
The Vector Database Hype Cycle
To be clear: dedicated vector databases solve real problems. If you have billions of vectors, need sub-10ms query latency at high throughput, require horizontal scaling across machines, or need real-time index updates under concurrent writes, a dedicated vector database is the right tool.
But most AI applications do not have these requirements. Consider the typical use case: a RAG (Retrieval-Augmented Generation) system for a company’s internal documentation. The corpus might be 50,000 documents, chunked into 500,000 passages, each embedded into a 768-dimensional vector. The query rate is maybe 10 queries per second. Updates happen when someone adds a document, which is a few times per day.
For this workload, a cloud-hosted vector database introduces:
- Network latency: Every query crosses a network boundary. Even within the same cloud region, this adds 1-5ms per query.
- Operational complexity: Another service to deploy, monitor, scale, and pay for. Another set of credentials to manage.
- Data residency concerns: Your document embeddings — which encode semantic information about your proprietary data — now live on someone else’s servers.
- Cost: Vector database pricing is typically based on vector count and query volume. For a 500,000-vector collection, costs can range from $50 to $500 per month depending on the provider.
An embedded SQLite database adds none of these. It is a file on disk. It is queried in-process, with no network hop. It is backed up by copying a file. It costs nothing to run.
How memista Works
memista implements approximate nearest-neighbour search using a technique called IVF (Inverted File Index). The core idea is simple: divide the vector space into regions (Voronoi cells), assign each vector to its nearest region centroid, and at query time, search only the vectors in the regions closest to the query vector.
Index Construction
When you build a memista index, the following happens:
-
Centroid computation: memista runs k-means clustering on a sample of the vectors to identify k centroid vectors. The number of centroids is configurable and controls the granularity of the partitioning. A typical value for 500,000 vectors is k=1024.
-
Vector assignment: Each vector is assigned to its nearest centroid. The vectors are stored in SQLite, grouped by centroid assignment. Each centroid’s vectors are stored as a contiguous blob in a single SQLite row, which enables efficient sequential reading.
-
Centroid index: The centroids themselves are stored in memory as a flat array. With k=1024 and 768 dimensions, this is approximately 3 MB — trivial for any modern system.
Query Processing
When you query memista for the k nearest neighbours of a query vector:
-
Probe selection: The query vector is compared against all centroids to find the
nprobeclosest centroids. With 1024 centroids, this takes approximately 50 microseconds. -
Candidate retrieval: The vectors assigned to the
nprobeclosest centroids are loaded from SQLite and compared against the query vector. The nprobe parameter controls the accuracy-speed trade-off: higher nprobe means more candidates and better recall, at the cost of more comparisons. -
Distance computation: memista computes exact distances (cosine similarity or L2 distance) between the query vector and all candidates. This is implemented using SIMD-accelerated inner product computation in Rust, using the
std::simdnightly feature where available and a portable fallback otherwise. -
Top-k selection: The top k results are returned, sorted by distance.
SQLite as Storage Engine
Using SQLite as the storage engine is an unusual choice for a vector search system. Most ANN libraries (FAISS, Annoy, ScaNN) use custom binary file formats optimised for memory-mapped access. SQLite adds overhead: there is a B-tree lookup to find a centroid’s vectors, SQLite’s page cache adds a layer of indirection, and the data is stored in SQLite’s row format rather than a flat binary layout.
We chose SQLite anyway for several reasons:
Atomicity and durability: SQLite provides ACID transactions. If the system crashes during an index update, the database is not corrupted. Custom binary formats require careful crash recovery logic that is easy to get wrong.
Tooling: Every programming language has a SQLite binding. You can inspect a memista database with the sqlite3 command-line tool. You can copy it, back it up, or ship it to another machine by copying a single file.
Metadata co-location: Real applications store metadata alongside vectors — document titles, URLs, timestamps, access control lists. With SQLite, this metadata lives in the same database as the vectors, queryable with SQL. With a custom binary format, you need a separate metadata store and the complexity of keeping them in sync.
Good enough performance: For collections under a few million vectors, SQLite’s overhead is not the bottleneck. The bottleneck is distance computation, which memista handles with SIMD-accelerated code regardless of the storage backend.
How embedcache Eliminates Redundant Computation
memista handles storage and retrieval. But there is a cost upstream: computing the embeddings in the first place. Embedding models are expensive to run — a single embedding call to a cloud API costs fractions of a cent, but at scale, these costs are significant. More importantly, many applications repeatedly embed the same text. A document that appears in multiple queries, or a chunk that is re-embedded after a minor update to a neighbouring chunk, produces redundant computation.
embedcache is our solution. It is a content-addressed cache for embedding vectors, written in Rust. When you request an embedding for a piece of text, embedcache first hashes the text and checks whether the embedding is already cached. If so, it returns the cached embedding with no API call. If not, it calls the embedding model, stores the result, and returns it.
The content-addressing is semantic-aware. embedcache does not just hash the raw text — it normalises whitespace, canonicalises unicode, and strips formatting that does not affect the embedding. This means that "Hello, world!" and "Hello, world!" (with an extra space) produce the same cache key, because they will produce the same (or near-identical) embedding.
In our testing with polymathy’s document processing pipeline, embedcache reduces the number of embedding API calls by 40-60% during incremental index updates, where most documents have not changed since the last indexing run.
How polymathy Builds on Top
polymathy is our semantic search pipeline. It transforms keyword search into AI-powered question answering by combining document chunking, embedding, retrieval, and generation. Under the hood, it uses memista for vector storage and retrieval, and embedcache for efficient embedding computation.
The architecture is a pipeline:
- Ingestion: Documents are chunked into passages using a configurable chunking strategy (fixed-size, sentence-boundary, or semantic).
- Embedding: Each chunk is embedded through embedcache, which deduplicates redundant computation.
- Indexing: Embeddings are stored in memista’s SQLite-backed ANN index.
- Retrieval: At query time, the query is embedded and the nearest chunks are retrieved from memista.
- Generation: The retrieved chunks are passed to an LLM as context for answer generation.
All of this runs in a single process, with no external services. The entire stack — chunking, embedding cache, vector index, and retrieval — is compiled into a single Rust binary. This makes deployment trivial and eliminates the operational complexity of managing multiple services.
Benchmarks and Trade-offs
We benchmark memista against FAISS (the most widely used ANN library) on standard benchmark datasets:
SIFT-1M (1 million 128-dimensional vectors): memista achieves 95% recall@10 at approximately 2ms per query. FAISS IVF-Flat achieves the same recall at approximately 0.8ms per query. memista is slower, but both are well under the latency threshold where users notice.
GloVe-1.2M (1.2 million 200-dimensional vectors): memista achieves 92% recall@10 at approximately 3.5ms per query. FAISS achieves approximately 1.2ms.
The pattern is consistent: memista is 2-3x slower than FAISS for raw query latency. This is the cost of using SQLite instead of memory-mapped binary files. For applications where query latency under 10ms is acceptable, this is a non-issue. For applications that need sub-millisecond latency at millions of queries per second, memista is the wrong tool.
Index build time: memista’s index construction is approximately 1.5x slower than FAISS due to the overhead of SQLite writes. For a 1M vector index, this is the difference between 45 seconds and 30 seconds — not usually a concern, since indexing is a batch operation.
Memory usage: memista uses significantly less memory than FAISS in its default configuration, because the vectors are stored on disk in SQLite and loaded on demand. Only the centroids are kept in memory. FAISS IVF-Flat loads all vectors into memory by default.
Disk usage: A memista index for 1M 768-dimensional vectors occupies approximately 3.2 GB on disk, compared to approximately 3.0 GB for the raw vectors. The SQLite overhead is approximately 7%.
When to Use memista (and When Not To)
Use memista when:
- Your collection is under 5 million vectors
- Query latency under 10ms is acceptable
- You want zero operational overhead
- You need ACID guarantees on index updates
- You want metadata co-located with vectors
- Data residency matters
Do not use memista when:
- You have hundreds of millions or billions of vectors
- You need sub-millisecond query latency
- You need horizontal scaling across machines
- You need real-time concurrent writes from multiple processes
We built memista not to compete with Pinecone or Weaviate, but to provide an alternative for the many use cases where a dedicated vector database is overkill. The best infrastructure is often the infrastructure you do not have to manage.