← Writing

Redesigning ProustGPT

ProustGPT started as a simple RAG chatbot. You typed a question about In Search of Lost Time, the backend retrieved relevant passages from Pinecone, and a single LLM call synthesized an answer. Clean, direct, and good enough for a while.

After a few months I hit a ceiling. The app could answer factual questions about the book, but it couldn’t compare themes across volumes, couldn’t hold a multi-turn conversation that built on itself, and had no concept of why you were asking. Were you studying the novel? Or were you using it as a lens on your own life? Those are different conversations and a single flow can’t serve both well.

One redesign later: two modes, a complexity-aware routing system, and agentic orchestration on top of the retrieval pipeline.

Why the Original Design Failed

The core assumption was that every query is a retrieval problem. You ask → we fetch relevant passages → LLM answers. That’s the right model for a reference tool.

But Proust readers aren’t always looking for references. Some questions are analytical: “how does jealousy manifest differently in Swann and Marcel?” That requires finding passages about two different characters, reading what surrounds them, and synthesizing across volumes. A single retrieval step can’t do that.

Other questions aren’t retrieval problems at all. Someone describing a smell that brought back a childhood memory isn’t looking for a citation. They’re reaching for a conversation. Immediately fetching passages about involuntary memory would be the wrong move. Technically correct, tonally obtuse.

The original design collapsed these two things into one flow. The redesign separates them explicitly.

Designing Two Modes

The central design decision: split the product into two modes that reflect the actual two reasons people use the app.

  • Explore: literary analysis. You’re studying the book — comparing characters, tracing themes, understanding structure. Retrieval is the point.
  • Reflect: introspective conversation. You’re using the book as a lens on your own experience. The corpus is optional context, not the destination.

The temptation was to make this a UI preference, like a toggle that swapped the system prompt. I made a different call. Mode is a full architectural contract: different tool budgets, different backend endpoints, different routing logic. What you pick changes what the system can do, not just how it talks.

Explore Mode Literary analysis — all 6 tools search_passages search_by_volume get_adjacent_passages get_chapter_overview find_character_mentions get_toc API API FREE FREE Reflect Mode Introspective — 3 tools, used sparingly search_passages get_adjacent_passages find_character_mentions Default: pure conversation. Search only when resonance is specific.

This asymmetry is the core design decision. Mode isn’t UI skin. It changes the tool budget and the reasoning behavior.

The Complexity Router

Inside Explore mode, a second decision arises: not every query needs a multi-step agent. A simple first-turn question like “who is Albertine?” can be answered in a single retrieval pass. Routing it through a full LangGraph loop adds latency and cost for no benefit.

The solution is a lightweight router that decides at call time whether to invoke the agent or go straight to RAG:

Incoming Query + History needs_agent() complexity check False Fast RAG ~2–3 seconds True ReAct Agent ~5–15 seconds Triggers Fast RAG "What is the madeleine scene?" "Tell me about Combray" "Quote about time and memory" Simple, self-contained questions Triggers Agent "Compare Swann's jealousy to Marcel's" "How does memory evolve across volumes?" "What happens after the Guermantes party?" Comparative, sequential, or follow-up queries

The routing logic is intentionally heuristic: regex patterns for comparative/thematic language, conversation-depth checks, and short-query detection inside active sessions. No ML model. The heuristics are crude but that’s fine. The wrong call costs a few hundred milliseconds of extra latency, not a failed response.

# backend/agent.py
def needs_agent(query: str, history: list[dict] | None = None) -> bool:
    if history:
        user_msgs = [m for m in history if m.get("role") == "user"]
        if len(user_msgs) >= 2:
            return True          # Follow-ups need context
    if _COMPLEX_PATTERNS.search(query):
        return True              # Comparative / multi-hop queries
    if history and len(query.split()) <= 6:
        return True              # Short reference in active conversation
    return False

The design principle: optimize the common case. Most first-turn questions are simple. Keep those fast.

Agentic Orchestration: The ReAct Loop

When the router decides a query needs the agent, it invokes a LangGraph ReAct agent. ReAct stands for Reason + Act. The model thinks about what it needs, calls a tool, observes the result, and loops until it has enough to answer.

ReAct Agent Loop THINK "I need passages about Swann's jealousy" ACT search_passages("Swann jealousy") OBSERVE 5 passages about Swann's torment returned loop up to 4× enough info RESPOND Synthesize findings into prose with citations Example multi-step: 1. search_passages("Swann") 2. search_passages("Albertine") 3. get_adjacent(#4521) 4. → final synthesis

The agent caps at four steps (AGENT_MAX_STEPS = 4). That’s a deliberate product choice. It bounds latency and cost, and forces the agent to plan rather than over-search. In practice most complex queries resolve in two or three steps anyway.

Reflect Mode: Designing for Restraint

The interesting design problem in Reflect mode was the opposite of Explore: how do you keep an AI from being too eager to retrieve?

A first pass might give Reflect mode the same tools as Explore and rely on a good system prompt to restrain it. That doesn’t work well. LLMs find reasons to use tools when tools are available. The model convinces itself that fetching a Proust passage is “helpful” even when it would break the conversational register.

The solution was to remove the tools that enable aggressive retrieval entirely. Reflect gets three tools instead of Explore’s six: basic passage search, adjacent context, and character lookup. The volume-filtered search and table-of-contents tools are gone. This makes deep literary excavation mechanically harder, and ordinary conversation the natural default.

The system prompt makes the intent explicit:

Your DEFAULT mode is pure introspective conversation. Most responses should NOT use tools at all. Only search when the user’s experience strongly echoes a specific Proustian theme or moment.

But the prompt alone isn’t enough. LLMs find reasons to use tools when tools are available. The constraint in the tool budget is what actually enforces the behavior. You can’t prompt your way to reliable restraint.

The Retrieval Pipeline

Both modes share the same underlying retrieval stack. Whether a request goes through the agent’s tool calls or the fast RAG path, it hits the same three-stage pipeline:

"What is involuntary memory?" 1. embed query Cohere embed-v4.0 2. vector search (k=20) Pinecone 3. rerank → top 5 Cohere rerank-v3.5 4. generate Groq / Kimi K2 (131K ctx) 20 candidates 5 best passages Context Stitching Fetches adjacent passages if text starts/ends mid-sentence

Vector search. The query is embedded with Cohere’s multilingual model and matched against 12,900 Proust passages in Pinecone. Twenty candidates come back. One nice side effect: because Cohere’s embeddings are language-agnostic, a French query about jealousy returns relevant English passages without needing a separate index.

Rerank. The Cohere reranker reads all twenty candidates against the query and picks the five most relevant. This two-stage approach consistently beats just increasing top_k. Fast ANN search gets you in the neighborhood; the reranker gets you the right passages.

Context stitching. Reranked passages are checked for mid-sentence breaks. If a passage starts “…but this only made her jealousy worse,” we fetch what came before. Literary text doesn’t chunk cleanly.

One retrieval layer, two execution paths above it.

The Transport Layer

Both modes stream responses over Server-Sent Events. This part of the redesign was less interesting architecturally and more interesting as a debugging exercise.

Production proxies introduce failures that don’t exist locally: per-line byte limits that silently truncate large JSON payloads, idle timeouts while the agent is reasoning, and trailing buffer edge cases when the generator ends without a final delimiter. Each was a separate incident. Each needed a targeted fix: chunked SSE framing, keepalive comments, explicit buffer flushing on stream close.

One design choice worth noting: in the agent path, passage sources are emitted before the first text token. If a response gets interrupted mid-stream, the passage context has already arrived. The user sees what the model was working from even if they never see the full answer. Context is more durable than prose.

What the Redesign Was Actually About

Agentic routing landed February 14, the two-mode product model February 15, transport hardening through early March. Each piece landed separately, behind a feature flag, over about three weeks.

But the real change wasn’t architectural. It was the shift from thinking about the app as a retrieval system to thinking about it as two distinct modes with different intents. Once that framing clicked, the technical decisions followed naturally: separate endpoints, different tool budgets, a router to keep the fast path fast.

The original design was fine for what it was. The new one just recognizes that someone studying Proust’s prose and someone processing their own involuntary memories are having fundamentally different conversations, and tries to serve both on their own terms.

The app is at proustgpt.com. Explore if you’re reading the book. Reflect if you’re using it to think.