Redesigning ProustGPT
ProustGPT started as a simple RAG chatbot. You typed a question about In Search of Lost Time, the backend retrieved relevant passages from Pinecone, and a single LLM call synthesized an answer. Clean, direct, and good enough for a while.
After a few months I hit a ceiling. The app could answer factual questions about the book, but it couldn’t compare themes across volumes, couldn’t hold a multi-turn conversation that built on itself, and had no concept of why you were asking. Were you studying the novel? Or were you using it as a lens on your own life? Those are different conversations and a single flow can’t serve both well.
One redesign later: two modes, a complexity-aware routing system, and agentic orchestration on top of the retrieval pipeline.
Why the Original Design Failed
The core assumption was that every query is a retrieval problem. You ask → we fetch relevant passages → LLM answers. That’s the right model for a reference tool.
But Proust readers aren’t always looking for references. Some questions are analytical: “how does jealousy manifest differently in Swann and Marcel?” That requires finding passages about two different characters, reading what surrounds them, and synthesizing across volumes. A single retrieval step can’t do that.
Other questions aren’t retrieval problems at all. Someone describing a smell that brought back a childhood memory isn’t looking for a citation. They’re reaching for a conversation. Immediately fetching passages about involuntary memory would be the wrong move. Technically correct, tonally obtuse.
The original design collapsed these two things into one flow. The redesign separates them explicitly.
Designing Two Modes
The central design decision: split the product into two modes that reflect the actual two reasons people use the app.
- Explore: literary analysis. You’re studying the book — comparing characters, tracing themes, understanding structure. Retrieval is the point.
- Reflect: introspective conversation. You’re using the book as a lens on your own experience. The corpus is optional context, not the destination.
The temptation was to make this a UI preference, like a toggle that swapped the system prompt. I made a different call. Mode is a full architectural contract: different tool budgets, different backend endpoints, different routing logic. What you pick changes what the system can do, not just how it talks.
This asymmetry is the core design decision. Mode isn’t UI skin. It changes the tool budget and the reasoning behavior.
The Complexity Router
Inside Explore mode, a second decision arises: not every query needs a multi-step agent. A simple first-turn question like “who is Albertine?” can be answered in a single retrieval pass. Routing it through a full LangGraph loop adds latency and cost for no benefit.
The solution is a lightweight router that decides at call time whether to invoke the agent or go straight to RAG:
The routing logic is intentionally heuristic: regex patterns for comparative/thematic language, conversation-depth checks, and short-query detection inside active sessions. No ML model. The heuristics are crude but that’s fine. The wrong call costs a few hundred milliseconds of extra latency, not a failed response.
# backend/agent.py
def needs_agent(query: str, history: list[dict] | None = None) -> bool:
if history:
user_msgs = [m for m in history if m.get("role") == "user"]
if len(user_msgs) >= 2:
return True # Follow-ups need context
if _COMPLEX_PATTERNS.search(query):
return True # Comparative / multi-hop queries
if history and len(query.split()) <= 6:
return True # Short reference in active conversation
return False
The design principle: optimize the common case. Most first-turn questions are simple. Keep those fast.
Agentic Orchestration: The ReAct Loop
When the router decides a query needs the agent, it invokes a LangGraph ReAct agent. ReAct stands for Reason + Act. The model thinks about what it needs, calls a tool, observes the result, and loops until it has enough to answer.
The agent caps at four steps (AGENT_MAX_STEPS = 4). That’s a deliberate product choice. It bounds latency and cost, and forces the agent to plan rather than over-search. In practice most complex queries resolve in two or three steps anyway.
Reflect Mode: Designing for Restraint
The interesting design problem in Reflect mode was the opposite of Explore: how do you keep an AI from being too eager to retrieve?
A first pass might give Reflect mode the same tools as Explore and rely on a good system prompt to restrain it. That doesn’t work well. LLMs find reasons to use tools when tools are available. The model convinces itself that fetching a Proust passage is “helpful” even when it would break the conversational register.
The solution was to remove the tools that enable aggressive retrieval entirely. Reflect gets three tools instead of Explore’s six: basic passage search, adjacent context, and character lookup. The volume-filtered search and table-of-contents tools are gone. This makes deep literary excavation mechanically harder, and ordinary conversation the natural default.
The system prompt makes the intent explicit:
Your DEFAULT mode is pure introspective conversation. Most responses should NOT use tools at all. Only search when the user’s experience strongly echoes a specific Proustian theme or moment.
But the prompt alone isn’t enough. LLMs find reasons to use tools when tools are available. The constraint in the tool budget is what actually enforces the behavior. You can’t prompt your way to reliable restraint.
The Retrieval Pipeline
Both modes share the same underlying retrieval stack. Whether a request goes through the agent’s tool calls or the fast RAG path, it hits the same three-stage pipeline:
Vector search. The query is embedded with Cohere’s multilingual model and matched against 12,900 Proust passages in Pinecone. Twenty candidates come back. One nice side effect: because Cohere’s embeddings are language-agnostic, a French query about jealousy returns relevant English passages without needing a separate index.
Rerank. The Cohere reranker reads all twenty candidates against the query and picks the five most relevant. This two-stage approach consistently beats just increasing top_k. Fast ANN search gets you in the neighborhood; the reranker gets you the right passages.
Context stitching. Reranked passages are checked for mid-sentence breaks. If a passage starts “…but this only made her jealousy worse,” we fetch what came before. Literary text doesn’t chunk cleanly.
One retrieval layer, two execution paths above it.
The Transport Layer
Both modes stream responses over Server-Sent Events. This part of the redesign was less interesting architecturally and more interesting as a debugging exercise.
Production proxies introduce failures that don’t exist locally: per-line byte limits that silently truncate large JSON payloads, idle timeouts while the agent is reasoning, and trailing buffer edge cases when the generator ends without a final delimiter. Each was a separate incident. Each needed a targeted fix: chunked SSE framing, keepalive comments, explicit buffer flushing on stream close.
One design choice worth noting: in the agent path, passage sources are emitted before the first text token. If a response gets interrupted mid-stream, the passage context has already arrived. The user sees what the model was working from even if they never see the full answer. Context is more durable than prose.
What the Redesign Was Actually About
Agentic routing landed February 14, the two-mode product model February 15, transport hardening through early March. Each piece landed separately, behind a feature flag, over about three weeks.
But the real change wasn’t architectural. It was the shift from thinking about the app as a retrieval system to thinking about it as two distinct modes with different intents. Once that framing clicked, the technical decisions followed naturally: separate endpoints, different tool budgets, a router to keep the fast path fast.
The original design was fine for what it was. The new one just recognizes that someone studying Proust’s prose and someone processing their own involuntary memories are having fundamentally different conversations, and tries to serve both on their own terms.
The app is at proustgpt.com. Explore if you’re reading the book. Reflect if you’re using it to think.