Chatting with Large PDFs (100–500 Pages): Using RAG with OpenAI Embeddings (Local vs. API)

At COXIT, in 2025, we observe a trend when more and more potential customers ask us to implement a way to chat with massive documents or even sets of documents, search information in them, and perform an analysis dynamically. Those documents happen to be in various domains: technical specification, documentation, and legal contracts.

How to read this article

This article contains a wealth of information, including both practically validated results and insights for further considerations and research. While it’s dense with numbers and domain-specific terms that probably may make it challenging to read on its own. However, its true value emerges when explored alongside with LLM (Large Language Model). So, if you’d like to obtain specific information, create summaries, or better understand technical details, please, do not hesitate to use LLM for that.

Motivation and Purpose

Having a request from the market, COXIT decided to invest resources to evaluate research and develop a prototype that could’ve been integrated in existing and future projects. COXIT wanted to validate and then evaluate approaches to consciously embed large documents (100–500 pages) into LLM’s context, how to make LLM operate, refer to, quote information from those PDFs, what document extraction and processing pipelines should look like.

Hence, I, the author of this article, Valerii, AI Developer, ex-Founding Engineer with a Code Generation Background, was summoned to COXIT to design and implement prototypes that use large language models in their chassis.

App’s Architecture

Before I began implementing the prototype this article is about, Iryna Mykytyn (CEO of COXIT), Vladyslav (Lead Engineer) and I decided to use the following stack of technologies:

RAG (Retrieval Augmented Generation), MCP (Model Context Protocol) and Function Calling.

All three became merely a standard in AI Chat Applications. Function Calling, for example, is supported not only by bigger models such as GPT-4o, Claude-3.7-Sonnet but also by much smaller models with open weights that could be run locally.

If I was to explain to a person who hasn’t had much prior experience working with LLMs, I would define those technologies as:

RAG: Combines generation with a search to provide more accurate, up-to-date, hallucinations-robust answers.

MCP: Keeps access to external functions or APIs in a dedicated server with standardized API.

Function Calling: Lets the model decide to call external functions or APIs through structured outputs.

On a lower level, I choose to use Python in most of the services, which is de-facto a standard language for AI-based applications with minimal technological overhead and dependencies. I keep intentionally avoiding Agentic Frameworks like LangChain, as they show no benefit from a long-term perspective, as they are generally poorly optimized and overbloated.

Speaking about App’s Architecture (Figure 1): I designed it to consist of 4 microservices:

ChatUI—React-powered WebUI. I declare that the React code is notoriously badly written, as it was mainly gpt-4o and claude-3.7-sonnet that coded this service (Figure 2).
Chat Tools Service—Handles completion requests, implements message history limiting, local tools execution and model validation, and files handling, including file stream uploads.
Chat Proxy Streamer— re-routes incoming chat completion requests to various LLM Providers e.g. OpenAI, Anthropic, Google using a standardized OpenAI Format, provides authentication, usage stats collecting.
Document Search MCP—a dedicated service that is only used to perform operations with documents: extraction, processing, storage, and vector search.

Figure 1: App’s Architecture

Figure 2: ChatUI’s UI

Why do we use RAG?

In reality, many of the documents could’ve fit into LLM’s context directly, especially using Llama 4 that is mentioned to have 10M Context Window, or Gemini-2.0-Flash which has 1M Context Window. Which also means, we could fit ~2000 pages of text into 1M Context Window.

At first sight, giving an entire document at once seems beneficial over giving a model only small chunks of text, in case of RAG.

Not to mention the price of a request to LLM, its performance: TTFT (time to first token), TPS (tokens per second), and more importantly, accuracy, depend on how many tokens model received on the input.

When a model receives a big continuous corpus of text, or a lot of messages in message history, it tends to become distracted, unfocused, making it prone to miss vital details in a document. This specifically–missing details–we endeavor to avoid. Thus, RAG, its Retrieval, that produces only relevant to query text chunks–is specifically what we needed. Keeping context as low as possible, filled with only what’s needed–is a right way to get a high-quality response.

Another valid point of using RAG is to give an ability to chat with documents not only to models with hundreds of billions parameters and hundreds of thousands of tokens in context window, but also to much smaller self-hosted models such as Qwen2.5-72B-Instruct with GPTQ-Int8 quantization, that could easily be run on somebody’s premise or in the cloud using, for example, 2x A6000 GPUs.

Having seen a high-level App architecture, let’s descend into the domain of the documents and how they are used to providing additional context for LLMs.

Retrieval of Relevant Information from Documents

For a Retrieval, which is ‘R’ in RAG, we perform vector search within all chunks from a single document, with an ability to include prefilter arguments to search not in a whole document but in a chapter or a section of a document. That pre-filter can only be accessed by a model in a case when it decides that a user would like to search only in specific parts of the document.

That Retrieval, in our case, produces up to 10 most relevant search results for a given query (Figure 3).

Figure 3: Typical message history when RAG tool is called

Alas, with no reranking and post-filtration added, often, not all top-10 results would be relevant to a search, top-1, top-3 might be, others might be not.

Therefore, those top-N results should better be post-filtered by an additional not expensive LLM (e.g., Gemini-2.0-Flash) or simply adding a threshold value to a proximity metric.

The thing one should consider when embedding RAG in the chat is limiting the number of messages in a message history, as a number of messages could grow indefinitely. Indeed, up-to-date LLMs rapidly increase their token windows; those 128-200k token windows we have now feel really refreshing after having worked with 2-4-8k context windows in the past.

Yet, it does not really mean we have to utilize all of those without having a compelling reason for that. There could be different strategies to limit messages in message history, e.g., prioritizing dropping old Retrieval context messages over dropping other messages–is a good practice.

For limiting messages, I prefer not to use tokenizers to count the number of tokens in each message; in cases like that, I would simply assume that 1 token = 4 chars to maximize performance when accuracy doesn’t really matter.

Now, where did those chunks come from?

Extracting Text from Documents

When a document is shared with Document Search MCP, it gets extracted:

Using regexp and library pymupdf in python, from each page set of paragraphs: blocks of text surrounded by line breaks are extracted. Along with paragraphs comes metadata.

That metadata would normally include a page number, chapters, sections, and coordinates. Those coordinates: 2 pairs of numeric values represent a location of an imaginary box that wraps a paragraph. A box could be highlighted by UI, if needed, to visually communicate to users where those answers came from. Additionally, paragraphs are split into chunks of text 250–1000 tokens (using a linebreak-based strategy) which are passed for vectorization. Embeddings from vectorization are to be used in vector search.

However, this approach imposes a serious limitation of being unable to work with raster PDFs e.g., PDFs where text is embedded into an image – scanned documents. Also, coordinate extraction is a weak part of this approach as it relies on regexps and heuristics, so it would only work well for the specific limited set of PDFs it was tested on.

In a real-world production-ready application, it’s vital to use a CV model that would detect boxes of paragraphs, schemas, images, tables and to use an OCR model to extract text values from text-type boxes and technical summarizations from image-type boxes. The proposed pipeline will enable extractor working with any text format, raster PDF or not, simply converting any page of a document into an image, and having a single pipeline for them all.

Figure 4: Document’s Lifetime

Approaches to work with extracted data from Documents

As OpenAI’s Vector Storage API was released shortly after I began developing this prototype, we decided to test it as well. It was interesting to try, as I would rather shove extra complexity on OpenAI’s side, then implement storing, indexing and vector search locally by myself. And theoretically, OpenAI’s API should be able to handle dozens of thousands of files, so there’d be no need to implement a complex architecture locally.

After paragraphs, metadata, and text chunks are extracted from a document, they enter into the Processing Phase, performed by a Thread-Processor.

At the moment, Document Search MCP utilizes 2 different approaches:

Approach 1: Storing, Vectorizing, Searching using OpenAI’s Files and OpenAI’s Vector Storage APIs.

Approach 2: Vectorizing is performed using any 3-rd party Embeddings Model e.g., OpenAI’s text-embedding-3-small, and a vector search is performed locally.

Approaches were tested on the same 100-page documents, which contains: ~600 parsed paragraphs or ~44k tokens.

Approach 1: OpenAI-based vector search

As supposed, this approach brought us merely no extra architectural complexity. Although, that benefit went along with a pretty disappointing, yet expected, processing performance.

For each text-chunk or a paragraph (in case you are OK with OpenAI’s chunking strategies), you’d need to perform 2 requests to upload a paragraph as a file, that’d be 1200 requests for the document above.

I achieved sending all 1200 requests in ~140s of real time*, with 10 concurrent requests at a time and 5s of waiting until timeout.

*–10 concurrent requests at a time and 5s of awaiting response until timeout. Enlarged concurrent requests’ bandwidth resulted in an increased number of requests resulting in timeout errors, so I select 10 as an optimal amount.

Even if a time span of 140s doesn’t seem too big for a 100-page document, it actually appears to be that on a second thought. This is a service that should be accessible by many users at once, or even multiple apps. A user that uploads a document will happen to have to wait not just 140s for their document to be processed, but also additional 140s for each user who uploaded a file before. In other words, when users A, B, C uploaded 100-page documents in that order at the same time, user C would wait for their document for 140s × 3 = 7 minutes.

Approach 2: local vector search and storage, external embedding requests

Funny enough, an approach that is based on storing vectors locally and, consequently, performing vector search also locally, uses only 6* requests to fully process a given document. That’s a 200x improvement over the previous approach. Total waiting time for processing a document resulted in 11s – that’s 14x improvement.

*– For 100-page document embeddings requests with batch-size = 128 were sent. For documents of other sizes, batch-size should also be adjusted to keep a minimal response time from an external embedding model. E.g., for 400-page document you might try 128 × 4 = 512 batch-size.

Time of processing alone had me convinced to declare supremacy of locally-based approach over OpenAI-based approach. Therefore, I would hereinafter suggest that approach over other cloud-based solutions, where a significant architectural overhead is acceptable. For storing and vector search there is an endless list of solutions: pgvector, milvus, qdrant, or milvus-lite for in-memory vector search.

Approaches’ Pricing Comparison

Pricing of Approach 1: OpenAI-based vector search has 3 components:

File Storage: $0.1/Gb per day, 1st Gb is free.

Vector Search Calls: $2.5 / 1000 calls

Embedding model: $0.02 / 1M tokens for text-embed-3-small.

1Gb of text really is a lot, it’s difficult to exceed, as a 100-page document is ~171kb. That means, there could be ~6132 100-page documents. Thus, pricing would seem like a sweet deal if performance wasn’t that inefficient in our case.

In a case of Approach 2: Local-based vector search, you would just need an embedding model from that list above. Which means, the price of processing a 100-page document would be ~$0.0009. Infrastructure cost, however, could vary widely depending on the project’s needs.

RAG Results Validation

For validation out of all domains we choose a technical specification as we already have a specifically designed set of questions of a group of documents that LLM with RAG should be able to handle on a daily basis. For a set of documents (nearly a dozen) we made LLM compose answers using RAG, after which results were carefully validated by a person.

During a validation phase, we also noticed that our extracting wasn’t working on some documents properly, which disabled the model to produce valid results on a whole set of questions on said documents. For a large majority of documents which didn’t have issues with extracting, the model responded with accurate results in ~80% of all cases.

When this prototype is used in real-life projects, we’d have to set up a more precise validation pipeline, taking into account documents’ domain. That pipeline should be centered about the same idea: measure the difference between model response in 2 cases: when the model has all context needed to answer a question e.g., provide a complete document and a 2nd case: when the model will have to use RAG calls to retrieve that context.

Having a set of ground truth answers and set of predicted answers, we could use different metrics to evaluate how close we get with RAG-used answers: e.g., BLEU/ ROUGE metrics or an embedding-based similarity. Human analysis metric could also be important when there’s a person with domain knowledge.

In areas where a set of questions will be difficult / unreasonable to create, we could use a model-generated set of questions instead. That set of questions could be as diverse, thematic-rich, large as needed. Additionally, that set of questions could be post-filtered by a technical specialist.

Results

This article shows step-by-step implementation and concrete engineering insights of an app that would allow chatting with no-matter how big PDFs. The app is designed to follow distributed architecture guidelines to become scalable and resilient in production usage. Along with validated engineering approaches, the article provides approaches, including hardware, ML models, and practical advices, how to embed this application into a production-ready project.