How 3 rewrites of a RAG pipeline — from LLM summarization to Voyage Context 3 — unified 3 content sources, cut costs 450x, and improved AI draft acceptance.

Spent the last few months reworking the retrieval pipeline for a customer support platform I work on. Three complete rewrites in five days before landing somewhere both I and my client were happy with.

The platform has an AI assistant that drafts responses for every incoming customer message. When I started, agents accepted only 0.2% of those drafts without edits. After three rewrites of the retrieval pipeline, that number climbed to 10.9% — with the manual trigger acceptance rate hitting 35%.

Here’s what each version looked like, what broke, and why the final rewrite worked.

The Setup: Three Knowledge Sources#

The platform’s AI has three knowledge sources:

  1. Uploaded documents — product manuals, pricing sheets, policies
  2. Scraped website pages — the client’s site, crawled and indexed
  3. Resolved support interactions — real conversations between agents and customers

Documents and web pages were straightforward. Chunk the text, embed with voyage-context-3, store in pgvector, hybrid search at query time. Standard stuff.

Support interactions were a different story.

Version 1: Embed Individual Turns#

The first approach was simple: take each agent response, embed it, store it in a conversation_turn_embeddings table. At query time, search across both the KB articles and these individual turns.

Problems showed up fast:

  • Low coverage — only 7.7% of conversations had enough messages to be useful
  • Fragmented context — a single agent reply like "€120 extra" means nothing without knowing the customer asked about pricing
  • Corrections lost — when agents corrected themselves or added follow-up details without a customer prompt, those never got captured
  • AI pollution risk — automated and AI responses could leak into the knowledge base, creating a feedback loop

Individual turns lacked the conversation-level context that made them useful in the first place.

Version 2: Summarize Then Embed#

Blog image

The second version tried to solve the context problem by adding an LLM preprocessing step:

  1. Take resolved conversations
  2. Slide a 50-message window (10-message overlap) across the messages
  3. Send each window to Claude Haiku for summarization
  4. Embed the summaries with voyage-4-large
  5. Store in a separate conversation_chunks table
  6. Search with a separate function and a separate query embedding

This actually worked. The summaries captured context properly, and the AI started retrieving relevant past interactions. But the architecture had problems:

  • Two embedding models — voyage-context-3 for documents, voyage-4-large for conversations
  • Two tablesai_content and conversation_chunks
  • Two search functions — each with their own query embedding
  • An LLM call per chunk — Claude Haiku ran on every 50-message window just for preprocessing

The cost to embed 2,818 resolved conversations (59,000+ messages) was roughly $85 — almost entirely from the Haiku summarization step. And every time a conversation was reopened and re-resolved, the entire window had to be re-summarized.

It was working, but it was expensive and fragile. Two parallel pipelines meant twice the surface area for bugs.

The Insight: Voyage Context 3 Already Does This#

Blog image

Turns out the summarization step was solving a problem that Voyage Context 3 already handles natively.

Voyage Context 3 is a contextualized embedding model. When you pass it all chunks from a document in a single API call, it embeds each chunk with awareness of the full document context. This is exactly what the Haiku summarization was trying to achieve — giving each chunk enough context to be meaningful on its own — but Voyage does it at the embedding level, without an LLM call.

From Voyage’s own benchmarks: Context 3 outperforms Anthropic’s contextual retrieval (which uses LLM-prepended context) by 6.76–20.54% on retrieval tasks. It eliminates the LLM preprocessing step entirely.

Once I understood this, the path was clear.

Version 3: One Model, One Pipeline, One Table#

The final version unified everything. All three sources go through the same pipeline:

Text
1Source text (document / web page / conversation transcript)
2  → Format as plain text
3  → Recursive character split (2,000 chars, 200 overlap)
4  → contextualizedEmbeddings() with voyage-context-3
5  → Store in content_chunks table
6  → Single hybrid search (Vector + BM25 + RRF + Rerank)

For conversations specifically, the formatting step filters to human-only messages (no AI responses, no automated messages), labels each line as [Customer] or [Agent], and includes voice transcripts and image descriptions:

TypeScript
1function formatConversation(messages) {
2  const filtered = messages.filter(
3    (m) => m.direction === "incoming" ||
4           (m.direction === "outgoing" && m.senderType === "human"),
5  );
6
7  const lines = [];
8  for (const msg of filtered) {
9    const role = msg.direction === "incoming" ? "[Customer]" : "[Agent]";
10    const text = getMessageText(msg);
11    if (!text) continue;
12    lines.push(`${role}: ${text}`);
13  }
14
15  return lines.join("\n");
16}

The recursive character splitter tries \n\n\n. before hard-splitting, which means it naturally breaks between messages since each is on its own line. A 2,000-character chunk fits roughly 10–50 messages.

The single content_chunks table uses a polymorphic foreign key design — each row points to either a document, a connector content item, or a conversation:

SQL
1CHECK (
2  (CASE WHEN document_id IS NOT NULL THEN 1 ELSE 0 END) +
3  (CASE WHEN connector_content_id IS NOT NULL THEN 1 ELSE 0 END) +
4  (CASE WHEN conversation_id IS NOT NULL THEN 1 ELSE 0 END) = 1
5)

Hybrid Search: Vector + BM25 + RRF + Rerank#

Search uses Reciprocal Rank Fusion to combine vector similarity and full-text search results:

SQL
1-- Vector candidates (cosine similarity via pgvector)
2WITH vector_candidates AS (
3  SELECT id, chunk_text,
4    1 - (chunk_embedding <=> $query::vector(1024)) as similarity
5  FROM content_chunks
6  WHERE chunk_embedding IS NOT NULL
7),
8-- BM25 full-text candidates
9text_candidates AS (
10  SELECT id, chunk_text,
11    ts_rank(to_tsvector('simple', chunk_text),
12            plainto_tsquery('simple', $query)) as text_score
13  FROM content_chunks
14  WHERE to_tsvector('simple', chunk_text)
15    @@ plainto_tsquery('simple', $query)
16),
17-- RRF fusion (k=60)
18combined AS (
19  SELECT COALESCE(v.id, t.id) as id,
20    COALESCE(1.0/(60 + v.vrank), 0) +
21    COALESCE(1.0/(60 + t.trank), 0) as rrf_score
22  FROM vector_ranked v
23  FULL OUTER JOIN text_ranked t ON v.id = t.id
24)
25SELECT * FROM combined ORDER BY rrf_score DESC

The top candidates from RRF are then reranked with voyage-rerank-2 to get the final relevance ordering. This two-stage approach — cheap hybrid search for recall, expensive reranking for precision — works well in practice.

The Numbers#

The acceptance rate here measures drafts accepted without any edits — the strictest possible metric. When agents deliberately trigger the AI (manual mode), the acceptance rate reaches 35%.

Tradeoffs#

One worth mentioning: Voyage Context 3 needs all chunks from a document together in a single API call for contextualization to work properly. So when a conversation reopens and re-resolves, I delete all existing chunks and re-embed from scratch. It’s not incremental.

At $0.002 per re-embed, this doesn’t matter in practice. But it’s a design constraint worth knowing about.

What I Took Away#

The RAG landscape has moved fast — agentic RAG, vectorless approaches, GraphRAG for multi-hop reasoning. Each solves different problems. But the principle that guided these three rewrites holds:

Before adding complexity, check if a simpler architecture can do the same job.

In my case, one embedding model replaced an LLM plus a separate pipeline. The retrieval quality improved. The cost dropped 450x. And the client’s team finally started accepting the AI’s suggestions.

Sometimes the right rewrite isn’t adding more — it’s finding the tool that was designed for exactly your problem.

Artificial IntelligenceRAGLLMsAI Agents