When I'm building RAG apps, the thing that quietly kills quality (and budgets) is turning a document into a pile of "semantic confetti", i.e. chunk everything, embed everything, then hope an ANN search surface the right bits.
If your document base is considerably large, retrieval can easily turn into semantic hide-and-seek across 1200K chunks, and you can spend thousands of tokens confidently hallucinating near the answer.
Approaches like PageIndex might be a better solution for your case and worth an experiment, hence the point of this article.
So instead of vectorizing chunks and doing ANN lookup, you can represent the document as a hierarchical tree, which is basically an LLM-optimized table of contents.
The model reasons its way down the tree (e.g. "we're in Risk Factors, then Liquidity, then Covenant breaches…") and pulls context from the exact branch it needs.
This will eliminate embedding drifts, "close enough" chunks and accidental detours into unrelated sections that just happen to share a few buzzwords.
Mafin 2.5 is a reasoning-based RAG system for financial document analysis, powered by PageIndex. It achieved a state-of-the-art 98.7% accuracy on the FinanceBench benchmark, significantly outperforming traditional vector-based RAG systems.

Disclaimer: I'm not affiliated with PageIndex in any capacity.
At a high level, here's how PageIndex works:
(1) Tree generation (indexing): PDF is converted into a hierarchy of nodes (sections/subsections), each with metadata like title, node_id, page_index, and associated text.

(2) Reasoning-based retrieval: LLM chooses which nodes to open next (tree search), optionally producing a rationale + a list of node_ids to consult.

This is important for applications, especially ones that have to behave in front of users, auditors, or myself at 2 am:
- Precision without brute force: You are not paying to embed and re-embed every paragraph or to ship huge top-k results downstream.
- Better UX and explainability: "I went to Management's Discussion and Analysis, and Results of Operations" is a story you can show users which is way more defensible than "the embedding returned these chunks."
- Lower latency + lower token burn: Tree navigation can keep context tight. Tight context means faster responses and fewer "let me restate the entire document back to you" moments.
- Structure-aware truthfulness: In complex financial docs like SEC filings, earnings releases, footnotes, where something appears is half the meaning. PageIndex can treat that structure as the retrieval primitive, not an afterthought.
I want you to give it a try and let me know if it was more performant in your case or not, this is truly promising approach.
Repository has a very simple structure:
run_pageindex.py: entry-point script to run indexing on a PDF (local usage).pageindex/: the core library package used by the runner script.cookbook/: example notebooks / demos (often the fastest way to understand intended workflows).tutorials/: guided examples.tests/: sample PDFs + expected/generated outputs (useful for regression checks).
It's also very easy to setup.
Make sure to grab the agentic SaaS patterns winning in 2026 First issue goes out tomorrow!

Let's continue!
Getting Started with PageIndex
(1) Create a virtualenv + install deps
python -m venv .venv
source .venv/bin/activate # (Windows: .venv\Scripts\activate)
pip install --upgrade -r requirements.txt(2) Create a .env file in the repo root with
CHATGPT_API_KEY=your_openai_key_here(3) Run the entry script
python run_pageindex.pyYou can customize the processing with additional optional arguments:
--model OpenAI model to use (default: gpt-4o-2024-11-20)
--toc-check-pages Pages to check for table of contents (default: 20)
--max-pages-per-node Max pages per node (default: 10)
--max-tokens-per-node Max tokens per node (default: 20000)
--if-add-node-id Add node ID (yes/no, default: yes)
--if-add-node-summary Add node summary (yes/no, default: yes)
--if-add-doc-description Add doc description (yes/no, default: yes)There is also a markdown support for PageIndex.
You can use the -md_path flag to generate a tree structure for a markdown file.
python3 run_pageindex.py --md_path /path/to/your/document.mdHow the indexing + retrieval flow fits together
Here's the end-to-end flow you should design around:
(1) Indexing / tree generation
- Input: PDF (current support is PDF-only)
- Output: hierarchical tree of nodes (TOC-like).
(2) Retrieval (tree search)
- Provide (query + tree structure)
- Ask for JSON output containing
"thinking"and"node_list"
This is the core mechanism for reasoning-based navigation.
(3) Answer synthesis
- gather node text (and/or page-level content),
- feed it into your final answer prompt.
Production integration options (SDK + HTTP APIs)
Even if you run local indexing, you may want the official SDK/API patterns for production workflows
(1) Python SDK (hosted service)
from pageindex import PageIndexClient
pi_client = PageIndexClient(api_key="YOUR_API_KEY")
result = pi_client.submit_document("./2023-annual-report.pdf")
doc_id = result["doc_id"]Then:
- check status:
pi_client.get_document(doc_id) - fetch tree:
pi_client.get_tree(doc_id)(supportsnode_summary) - OCR:
pi_client.get_ocr(doc_id, format="page|node|raw")
(2) Direct REST endpoints (hosted service)
POSThttps://api.pageindex.ai/doc/: Upload PDF; returnsdoc_id.GEThttps://api.pageindex.ai/doc/{doc_id}/?type=tree: Retrieve processing status and the tree;summarytoggles node summaries.GEThttps://api.pageindex.ai/doc/{doc_id}/?type=ocr&format=page|node|raw: Retrieve OCR results in different formats.POSThttps://api.pageindex.ai/chat/completions: Supportsmessages, optionaldoc_id(string or array), optionalstream.
Multi-document search patterns
By default, PageIndex focuses on reasoning-based RAG within a single document, and there are 3 recommended multi-doc workflows:
- Search by Metadata (when docs are distinguishable by metadata)
- Search by Semantics (when docs differ by topic/content)
- Search by Description (lightweight approach for a small number of docs)
In practice, this often becomes:
- Pick candidate doc(s) (metadata/semantic/description stage)
- Run tree search inside the chosen doc(s)
- Synthesize an answer
Here are other resources where you can dive deeper:
- Cookbooks: hands-on, runnable examples and advanced use cases.
- Tutorials: practical guides and strategies, including Document Search and Tree Search.
- Blog: technical articles, research insights, and product updates.
- MCP setup & API docs: integration details and configuration options.
Good luck in your experiements!