Skip to content

LangChain Integration

Most LangChain RAG pipelines start with RecursiveCharacterTextSplitter — fast and simple, but it loses document structure, breaks tables mid-row, and requires manual chunk size and overlap tuning. POMA's LangChain integration replaces the splitter step with hierarchical chunksets that preserve full document context, while keeping the rest of your LangChain pipeline unchanged.

The integration provides drop-in replacements for LangChain's document loading, splitting, and retrieval steps — no need to rewrite your chain.

Installation

Install the integration:

bash
pip install 'poma[langchain]'

The LangChain integration gives you three helpers:

  • PomaFileLoader to load files from a path
  • PomaChunksetSplitter to turn documents into POMA chunkset documents
  • PomaCheatsheetRetrieverLC to wrap a LangChain vector store and return cheatsheet documents

Chunk documents with PrimeCut

python
from poma import PrimeCut
from poma.integrations.langchain import PomaFileLoader, PomaChunksetSplitter

client = PrimeCut()
documents = PomaFileLoader("./docs").load()

splitter = PomaChunksetSplitter(client, verbose=True)
chunkset_docs = splitter.split_documents(documents)

print(len(chunkset_docs))
print(chunkset_docs[0].metadata.keys())

Each output Document stores the chunkset text in page_content and includes the source chunks in metadata. The splitter expects each input document to carry a valid metadata["source_path"].

Add the chunksets to your vector store

python
# Replace this with your preferred LangChain vector store.
vector_store = ...
vector_store.add_documents(chunkset_docs)

Retrieve cheatsheets instead of raw chunksets

python
from poma.integrations.langchain import PomaCheatsheetRetrieverLC

retriever = PomaCheatsheetRetrieverLC(vector_store, top_k=4)
cheatsheet_docs = retriever.invoke("How do I authenticate?")

print(cheatsheet_docs[0].page_content)

PomaCheatsheetRetrieverLC groups hits by document and returns one cheatsheet Document per document.

Why replace RecursiveCharacterTextSplitter?

LangChain's default text splitter cuts documents into fixed-size fragments that lose structural context — section headers get separated from content, tables break mid-row, and overlap inflates your index with near-duplicates. POMA's PomaChunksetSplitter preserves the full document hierarchy as chunksets, so every retrieved fact arrives with its lineage (chapter → section → paragraph). The result: more accurate retrieval with typically 77% fewer tokens.

For a detailed comparison of all chunking strategies, see The Ultimate Guide to RAG Chunking Strategies.

Continue reading