Pranav Nanaware

Introduction

Semantic search over documents is increasingly vital for knowledge-driven apps. In this post, we explore a system that extends RAG beyond a single PDF: it fetches candidate files from the web or your database, processes and embeds their content, and streams relevance scores back to the user in real time.

1. Motivation and Problem Statement

Most RAG demos show you asking questions of one PDF at a time. That works if you know exactly which file holds the answer. In practice, you often need to search across many documents some in your local storage, some out on the web.

How can we:

Discover candidate files relevant to a query (e.g., via Google Search or your own API)?
Process each document into retrievable chunks and embeddings?
Combine semantic similarity with GPT-based ranking to surface the most relevant snippets?
Serve results with low latency under load and keep costs in check?

This system meets these needs with a cache-first, queued-processing architecture, streaming results back as they arrive.

2. High-Level Architecture

sys_%281%29.png

Client (Next.js App Router)
- Provides a search UI.
- Calls a REST endpoint on /api/search.
- Uses Server-Sent Events (SSE) or HTTP streaming to update relevance scores in real time.
Core Logic
- Cache layer: Redis stores entries for fast hits.
- Decision flow: on cache miss, look up embeddings in Supabase; if missing, enqueue work; otherwise compute similarity and rank.
Relational DB & Vector Store (Supabase)
- Stores document metadata, file URLs, text chunks, and embeddings.
- Choice of Supabase: built‑in storage, Postgres for metadata, and pgvector extension for vector queries.
Workers (Node.js + LangChain)
- Use a simple in-built queue for task scheduling.
- For each new URL or PDF: download, extract text, chunk, call OpenAI embeddings API, and save to Supabase.
Message Queue (In-Built)*
- Decouples on‑demand search from heavy embedding work.
- Enables horizontal scaling of workers.

3. Design Decisions & Trade‑Offs

decision_%281%29.png

3.1 Entry Point: Google Search vs. Custom Connector

Google Search API gives broad web coverage but adds cost and rate limits.
Custom connector (e.g., your internal doc store) offers tighter control and security.
Decision: start with Google for prototyping, allow pluggable connectors for future needs.

3.2 Cache Strategy (Redis)

Cache key: hash(search_query + url + chunk_size)
TTL: configurable (e.g., 1 hour) to balance freshness vs. cost.
Why Redis: in‑memory, atomic ops, streaming pub/sub for SSE triggers.

3.3 Chunking and Embedding

Chunk size: ~500 tokens with 50-token overlap to preserve context.
Embedding model: text-embedding-3-small vs. larger variants—trade latency vs. accuracy.
LangChain automates text splitting and vectorization.

3.4 Storage: Supabase + pgvector

Embeddings upserted into a Vector column.
Use vector_cosine_distance or ivfflat indexes for fast nearest‑neighbor.
Alternative: specialized vector DBs (Pinecone, Weaviate).
Decision: Supabase simplifies infra and costs for MVP.

3.5 Streaming API (SSE)

REST + SSE gives progressive UI updates.
Each time a worker completes a document or GPT ranking finishes, core logic publishes an event on Redis pub/sub.
Client listens and updates the table of relevancy scores in real time.

3.6 GPT‑Based Reranking

Initial similarity gives coarse ordering.
Passing top K (e.g., 5) snippets to GPT with a prompt like:

"Rank these snippets by relevance to the query: ..."
Trade‑off: extra cost for higher precision.
Fallback: skip rerank for ultra‑low latency mode.

4. API Specification

POST /api/search?source=google
Content-Type: application/json

{ "query": "Explain LangChain architecture" }

--stream--
event: partial
data: { "url": "https://...pdf", "snippet": "LangChain is...", "score": 0.82 }
event: partial
data: { ... }
event: done

source parameter: google or custom.
Streaming: send partial events as each document ranking arrives, then a final done event.

5. Implementation Highlights

Next.js Route Handler under /app/api/search/route.js.
Redis Client with ioredis for cache lookups and pub/sub.
In-built queue for job scheduling
Worker Script (e.g., worker.js) uses LangChain PDFLoader, RecursiveCharacterTextSplitter, and OpenAIEmbeddings.
Supabase integration with @supabase/supabase-js.
SSE Streaming via ReadableStream in the route handler.

Worker Task Details

processing_%281%29.png

After receiving a job from the in‑built queue, each worker executes the following steps:

Fetch PDF
- Download the file from the provided URL or Supabase storage.
- Verify file integrity (checksum or HTTP status).
Load and Extract Text
- Use LangChain PDFLoader to parse the PDF pages.
- Convert pages into raw text blocks.
Chunking
- Initialize RecursiveCharacterTextSplitter with a 500‑token chunk size and 50‑token overlap.
- Split the raw text into coherent chunks preserving semantic context.
Embedding Generation
- For each text chunk, call OpenAIEmbeddings.embedText(chunk) to produce a vector.
- Batch embedding calls where possible to reduce latency and cost.
Upsert to Supabase
- Insert or update each chunk’s text, metadata (page, position), and embedding vector into the documents_chunks table in Supabase.
- Ensure ON CONFLICT upsert logic uses (document_id, chunk_index) as the unique key.
Progress Publishing
- After each chunk upsert, publish a Redis pub/sub message on channel worker:progress:
```
{ "event": "chunk_processed", "url": "<source_url>", "chunk_index": 3 }
 
```
- Core logic listens to these events and streams partial updates to the client via SSE.
Initial Similarity Scoring
- Once all embeddings for a document are stored, compute cosine similarity between the query embedding and each chunk vector using Supabase’s vector_cosine_distance.
- Select the top K chunks (e.g., K=5) by similarity score.
GPT-Based Reranking
- Format a prompt bundling the original query and the top K snippets.
- Call OpenAI.ChatCompletion to rank snippets by relevance.
- Parse the response and generate a refined ordering with updated scores.
Final Result Publication
- Save the reranked snippets and their scores into Redis under the key results:<query_hash>.
- Publish a worker:done event on Redis pub/sub with the final batch of data.
Cleanup and Metrics

Emit internal logs or metrics (execution time, embedding cost) to a monitoring system.
Acknowledge job completion in the in‑built queue.

6. Future Enhancements

Pluggable document connectors (S3, Google Drive).
Advanced chunk merging: semantic boundary detection.
Multi‑modal support (images, audio transcripts).
Real‑time embeddings update with user feedback loop.

7. Conclusion

This system demonstrates how to build a resilient, high‑throughput PDF search platform that:

Discovers candidate files via web or custom sources.
Processes large batches of documents asynchronously.
Embeds and ranks content semantically.
Streams results to deliver instant feedback.

With Next.js, Supabase, Redis, and LangChain, you can bootstrap powerful document‑driven applications with minimal ops overhead. The complete code and deployment instructions are linked below—feel free to clone and adapt for your own use case!

🚀 GitHub Repo with code samples and config files