Large Language Models (LLMs) are powerful, but they only know what they were trained on. To make them truly useful, we need to give them access to our own data. This is the core idea behind Retrieval-Augmented Generation (RAG), and it's now possible to build a complete RAG pipeline on the edge using Cloudflare's serverless stack.
I built a tool that lets a user upload a PDF, ask questions about it, and get answers sourced directly from the document's content. This article explains the end-to-end architecture, combining Workers AI for intelligence, R2 for file storage, and the new Vectorize database for blazingly fast semantic search.
The Challenge: Building a RAG Pipeline is Hard
A typical RAG system requires a complex, multi-stage pipeline:
- A way to upload and parse documents (like PDFs).
- An embedding model to convert text chunks into vector representations.
- A dedicated vector database (like Pinecone or Weaviate) to store and search these vectors.
- An LLM to generate answers based on the search results.
- A backend server to orchestrate all these moving parts.
This is expensive, difficult to scale, and involves stitching together multiple services.
The All-in-One Edge Solution: The Cloudflare AI Stack
Cloudflare now provides all the necessary components in one tightly integrated, serverless platform. Cloudflare Vectorize is a vector database built for the edge, designed to work seamlessly with Workers and Workers AI.
The RAG Architecture on Cloudflare
- Upload & Chunk: A user uploads a PDF. A Worker saves it to R2, parses the text, and splits it into smaller, manageable chunks.
- Embed & Insert: The Worker sends each text chunk to a Workers AI embedding model (`@cf/baai/bge-base-en-v1.5`). The model returns a vector. The Worker then inserts this vector, along with the original text chunk, into a Vectorize index.
- Query: The user asks a question. The Worker takes the question and sends it to the *same* embedding model to create a query vector.
- Search: The Worker uses this query vector to search the Vectorize index using `.query()`. Vectorize returns the most semantically similar text chunks from the original PDF.
- Generate: The Worker constructs a new prompt for an LLM (like Llama 2), including the user's question and the relevant text chunks from the search. It sends this prompt to Workers AI, which generates a final, context-aware answer.
This entire sophisticated process runs without a single traditional server, with data and compute co-located at the edge for maximum performance.
Diving into the Code: Querying Vectorize
The real magic of RAG is finding the right context. Here’s a simplified look at how a Worker can query a Vectorize index to find relevant document chunks.
// Simplified worker logic for querying Vectorize and generating a response
export default {
async fetch(request, env, ctx) {
const { question, vectorIndexName } = await request.json();
// 1. Create a vector embedding from the user's question
const embeddingResponse = await env.AI.run('@cf/baai/bge-base-en-v1.5', { text: [question] });
const queryVector = embeddingResponse.data[0];
// 2. Query the Vectorize index to find similar vectors
const index = env.VECTORIZE_INDEX; // Binding to the Vectorize index
const searchResults = await index.query(queryVector, { topK: 3 });
// 3. Get the original text from the search results
const context = searchResults.matches.map(match => match.metadata.text).join('\n---\n');
// 4. Build a prompt and ask the LLM
const prompt = `Context: ${context}\n\nQuestion: ${question}\n\nAnswer:`;
const llmResponse = await env.AI.run('@cf/meta/llama-2-7b-chat-int8', { prompt });
return new Response(JSON.stringify(llmResponse));
}
};
This code demonstrates the core RAG loop: embed, search, and generate. `env.VECTORIZE_INDEX` is the binding to our index, and `index.query()` performs the powerful similarity search.
The Future of AI Applications
This project showcases a powerful new reality for developers:
- Accessible RAG: Sophisticated AI architectures are no longer the exclusive domain of large tech companies.
- Integrated Stack: No more "glue code" between different cloud providers. Everything works together seamlessly.
- Unmatched Performance: Vector search at the edge means your AI can retrieve context and answer questions faster than ever.
- Predictable Cost: With generous free tiers across the stack, building powerful AI tools is now more affordable than ever.
The ability to build full-stack, stateful, and intelligent applications entirely on the edge is here, and Cloudflare is leading the charge.