Vector RAG in Production: Why Cosine Similarity Alone Is Costing You Quality
Most vector RAG implementations follow a recipe that fits in a tweet: chunk your documents, embed them with text-embedding-3, store in a vector DB, query with cosine similarity, top-5 results into the prompt. It works for a demo. It underperforms in production for reasons that are well-understood and fixable — but most teams never fix them because the demo 'works' on happy-path queries and nobody measures retrieval quality systematically.
In a multi-tenant AI reporting engine processing thousands of RAG calls per hour, I learned exactly where cosine-similarity-only retrieval breaks and what you need to add to make vector RAG production-grade. Here's the playbook.
Why cosine similarity alone fails
- Embedding drift: OpenAI updates their embedding model. Your old embeddings are now semantically misaligned with new queries. Cosine similarity silently degrades.
- Vocabulary mismatch: A user searches 'payment failure rate' and your docs say 'transaction error percentage.' Cosine similarity might catch the semantic overlap, or it might not — it depends on the embedding model's training data.
- The curse of the average: Embedding a long document produces a vector that averages all its topics. A highly specific query matches the 'average' poorly even if the document contains exactly the right paragraph.
- No exact match support: Cosine similarity cannot prioritize a document that contains the exact phrase the user typed. For technical queries (error codes, API names, function signatures), exact match often beats semantic.
The fix: hybrid retrieval (lexical + semantic)
Hybrid retrieval combines two independent scoring methods and merges the results. The lexical path uses BM25 — a term-frequency-based scoring algorithm that excels at exact matches and rare terms. The semantic path uses cosine similarity on embeddings — good at conceptual matches and paraphrases. Results from both paths are combined using reciprocal rank fusion (RRF), which doesn't require normalizing scores across different scoring distributions. In my experience, hybrid retrieval improves recall@10 by 15-25% over pure semantic on technical content — and it costs almost nothing to add since BM25 is computationally trivial.
Reranking: the quality multiplier you're skipping
Here's a pattern I see repeatedly: teams retrieve top-20 chunks with vector search, stuff all 20 into the prompt, and wonder why the answer is inconsistent. The problem is that vector similarity is a coarse filter, not a precision instrument. The 4th-ranked chunk might be more relevant than the 2nd-ranked one, but the embedding model can't tell the difference at that granularity. A reranker can. A cross-encoder reranker (like Cohere's rerank or a fine-tuned BERT variant) takes each query-chunk pair and scores them with full cross-attention — the model sees both the query and the chunk simultaneously, which produces dramatically better relevance judgments. Retrieve 20-30 chunks with fast vector search, rerank to the top 5-8 with a cross-encoder, and your prompt quality improves measurably.
Query rewriting: fix the question before searching
Users don't write queries optimized for vector search — they write questions optimized for humans. 'Why did the payment thing break again?' is a reasonable Slack message and a terrible vector search query. Query rewriting uses a small, fast LLM to expand the user query into one or more search-optimized formulations before retrieval. 'Payment service outage root cause analysis transaction failure last 48 hours' is what the vector search actually needs. One LLM call (cheap, using GPT-4o-mini or similar) before retrieval can improve recall by 20-30%. This is the single highest-ROI optimization for vector RAG and the one most teams skip.
Chunking: the foundation nobody talks about
Chunk size is the most consequential decision in a RAG pipeline, and most teams set it once and forget about it. Too small, and chunks lack context — the LLM sees fragments without the surrounding argument. Too large, and the embedding dilutes the signal — the vector represents too many concepts to match any query precisely. The right answer is content-aware chunking: split on semantic boundaries (sections, paragraphs, code blocks), aim for 256-512 tokens per chunk with 10-20% overlap, and test different strategies on a labeled evaluation set. There is no universal best chunk size — it depends on your content and your queries. The only universal rule is: measure it or get it wrong.
Vector RAG isn't one technique — it's a pipeline. Cosine similarity is the first stage. Production quality comes from everything you add after it: hybrid retrieval, reranking, query rewriting, and content-aware chunking.
The takeaway
If your vector RAG pipeline is retrieval → embed → cosine similarity → top-k → prompt, you have four high-ROI optimizations sitting on the table: add BM25 hybrid retrieval, add a cross-encoder reranker, add query rewriting before the search, and revisit your chunking strategy. Any one of these improves quality; all four together transform the pipeline. Designing and optimizing retrieval pipelines — from chunking strategy to hybrid search to evaluation frameworks — is a core part of the AI infrastructure work I take on. If your RAG answers aren't grounded enough, let's fix the retrieval before you blame the model.
Open to select projects
Building something with AI?
I take on select AI engineering projects end-to-end — from React frontend to LLM pipeline on AWS. Tell me what you're building.