How 3 rewrites of a RAG pipeline — from LLM summarization to Voyage Context 3 — unified 3 content sources, cut costs 450x, and improved AI draft acceptance.
Spent the last few months reworking the retrieval pipeline for a customer support platform I work on. Three complete rewrites in five days before landing somewhere both I and my client were happy with.
The platform has an AI assistant that drafts responses for every incoming customer message. When I started, agents accepted only 0.2% of those drafts without edits. After three rewrites of the retrieval pipeline, that number climbed to 10.9% — with the manual trigger acceptance rate hitting 35%.
Here’s what each version looked like, what broke, and why the final rewrite worked.
The Setup: Three Knowledge Sources#
The platform’s AI has three knowledge sources:
- Uploaded documents — product manuals, pricing sheets, policies
- Scraped website pages — the client’s site, crawled and indexed
- Resolved support interactions — real conversations between agents and customers
Documents and web pages were straightforward. Chunk the text, embed with voyage-context-3, store in pgvector, hybrid search at query time. Standard stuff.
Support interactions were a different story.
Version 1: Embed Individual Turns#
The first approach was simple: take each agent response, embed it, store it in a conversation_turn_embeddings table. At query time, search across both the KB articles and these individual turns.
Problems showed up fast:
- Low coverage — only 7.7% of conversations had enough messages to be useful
- Fragmented context — a single agent reply like "€120 extra" means nothing without knowing the customer asked about pricing
- Corrections lost — when agents corrected themselves or added follow-up details without a customer prompt, those never got captured
- AI pollution risk — automated and AI responses could leak into the knowledge base, creating a feedback loop
Individual turns lacked the conversation-level context that made them useful in the first place.
Version 2: Summarize Then Embed#

The second version tried to solve the context problem by adding an LLM preprocessing step:
- Take resolved conversations
- Slide a 50-message window (10-message overlap) across the messages
- Send each window to Claude Haiku for summarization
- Embed the summaries with voyage-4-large
- Store in a separate
conversation_chunkstable - Search with a separate function and a separate query embedding
This actually worked. The summaries captured context properly, and the AI started retrieving relevant past interactions. But the architecture had problems:
- Two embedding models — voyage-context-3 for documents, voyage-4-large for conversations
- Two tables —
ai_contentandconversation_chunks - Two search functions — each with their own query embedding
- An LLM call per chunk — Claude Haiku ran on every 50-message window just for preprocessing
The cost to embed 2,818 resolved conversations (59,000+ messages) was roughly $85 — almost entirely from the Haiku summarization step. And every time a conversation was reopened and re-resolved, the entire window had to be re-summarized.
It was working, but it was expensive and fragile. Two parallel pipelines meant twice the surface area for bugs.
The Insight: Voyage Context 3 Already Does This#

Turns out the summarization step was solving a problem that Voyage Context 3 already handles natively.
Voyage Context 3 is a contextualized embedding model. When you pass it all chunks from a document in a single API call, it embeds each chunk with awareness of the full document context. This is exactly what the Haiku summarization was trying to achieve — giving each chunk enough context to be meaningful on its own — but Voyage does it at the embedding level, without an LLM call.
From Voyage’s own benchmarks: Context 3 outperforms Anthropic’s contextual retrieval (which uses LLM-prepended context) by 6.76–20.54% on retrieval tasks. It eliminates the LLM preprocessing step entirely.
Once I understood this, the path was clear.
Version 3: One Model, One Pipeline, One Table#
The final version unified everything. All three sources go through the same pipeline:
1Source text (document / web page / conversation transcript)
2 → Format as plain text
3 → Recursive character split (2,000 chars, 200 overlap)
4 → contextualizedEmbeddings() with voyage-context-3
5 → Store in content_chunks table
6 → Single hybrid search (Vector + BM25 + RRF + Rerank)
For conversations specifically, the formatting step filters to human-only messages (no AI responses, no automated messages), labels each line as [Customer] or [Agent], and includes voice transcripts and image descriptions:
1function formatConversation(messages) {
2 const filtered = messages.filter(
3 (m) => m.direction === "incoming" ||
4 (m.direction === "outgoing" && m.senderType === "human"),
5 );
6
7 const lines = [];
8 for (const msg of filtered) {
9 const role = msg.direction === "incoming" ? "[Customer]" : "[Agent]";
10 const text = getMessageText(msg);
11 if (!text) continue;
12 lines.push(`${role}: ${text}`);
13 }
14
15 return lines.join("\n");
16}
The recursive character splitter tries \n\n → \n → . → before hard-splitting, which means it naturally breaks between messages since each is on its own line. A 2,000-character chunk fits roughly 10–50 messages.
The single content_chunks table uses a polymorphic foreign key design — each row points to either a document, a connector content item, or a conversation:
1CHECK (
2 (CASE WHEN document_id IS NOT NULL THEN 1 ELSE 0 END) +
3 (CASE WHEN connector_content_id IS NOT NULL THEN 1 ELSE 0 END) +
4 (CASE WHEN conversation_id IS NOT NULL THEN 1 ELSE 0 END) = 1
5)
Hybrid Search: Vector + BM25 + RRF + Rerank#
Search uses Reciprocal Rank Fusion to combine vector similarity and full-text search results:
1-- Vector candidates (cosine similarity via pgvector)
2WITH vector_candidates AS (
3 SELECT id, chunk_text,
4 1 - (chunk_embedding <=> $query::vector(1024)) as similarity
5 FROM content_chunks
6 WHERE chunk_embedding IS NOT NULL
7),
8-- BM25 full-text candidates
9text_candidates AS (
10 SELECT id, chunk_text,
11 ts_rank(to_tsvector('simple', chunk_text),
12 plainto_tsquery('simple', $query)) as text_score
13 FROM content_chunks
14 WHERE to_tsvector('simple', chunk_text)
15 @@ plainto_tsquery('simple', $query)
16),
17-- RRF fusion (k=60)
18combined AS (
19 SELECT COALESCE(v.id, t.id) as id,
20 COALESCE(1.0/(60 + v.vrank), 0) +
21 COALESCE(1.0/(60 + t.trank), 0) as rrf_score
22 FROM vector_ranked v
23 FULL OUTER JOIN text_ranked t ON v.id = t.id
24)
25SELECT * FROM combined ORDER BY rrf_score DESC
The top candidates from RRF are then reranked with voyage-rerank-2 to get the final relevance ordering. This two-stage approach — cheap hybrid search for recall, expensive reranking for precision — works well in practice.
The Numbers#
The acceptance rate here measures drafts accepted without any edits — the strictest possible metric. When agents deliberately trigger the AI (manual mode), the acceptance rate reaches 35%.
Tradeoffs#
One worth mentioning: Voyage Context 3 needs all chunks from a document together in a single API call for contextualization to work properly. So when a conversation reopens and re-resolves, I delete all existing chunks and re-embed from scratch. It’s not incremental.
At $0.002 per re-embed, this doesn’t matter in practice. But it’s a design constraint worth knowing about.
What I Took Away#
The RAG landscape has moved fast — agentic RAG, vectorless approaches, GraphRAG for multi-hop reasoning. Each solves different problems. But the principle that guided these three rewrites holds:
Before adding complexity, check if a simpler architecture can do the same job.
In my case, one embedding model replaced an LLM plus a separate pipeline. The retrieval quality improved. The cost dropped 450x. And the client’s team finally started accepting the AI’s suggestions.
Sometimes the right rewrite isn’t adding more — it’s finding the tool that was designed for exactly your problem.