The first month of any RAG project ends with the same realization: your chunks are the bottleneck. You can swap embedding models, tune the reranker, add HyDE, fiddle with k. None of it matters if the chunks you’re embedding are nonsense.
I’ve shipped enough of these to know what bad chunks look like. They cut mid-section. They strip out heading context. They split a single procedural answer across three chunks, none of which contain the whole instruction. The retrieval looks fine in your dev evals because your evals don’t cover the questions real users ask. Then your CEO asks "what’s our delivery policy for Connecticut" and gets back a chunk that’s the middle of the Connecticut delivery section with no header context, and the actual policy is in the next chunk you didn’t retrieve.
This post is the chunker I actually ship. It’s not novel, but the combination of details matters, and most public examples skip half of them.
What "fixed-size with overlap" does to real docs
The standard advice (LangChain’s RecursiveCharacterTextSplitter with chunk_size=1000 and chunk_overlap=200, the universal first thing you try) does this:
- Picks a separator list, usually
["\n\n", "\n", " ", ""]. - Recursively splits until each piece fits.
- Greedy-packs back to chunk_size with overlap.
The problem isn’t the algorithm. The problem is that it’s word-aware, not structure-aware. The character separators don’t know what a heading is. They don’t know that a sub-heading at the start of a section is the most important context to preserve. They don’t know that an unordered list belongs together, and that splitting a 5-item list at item 3 creates two useless half-lists.
On a doc like "Connecticut Operations Manual" with H2 sections "Delivery", "Returns", "Pricing", the recursive splitter happily produces chunk #14 whose content is the back half of Delivery and the front half of Returns. The embedding for that chunk is a smear of both topics. The retrieval is worse for it.
The fix: structure first, tokens second
The chunker I run has four levels of fallback:
- Split on heading boundaries first.
- If a section is too large, sub-split on paragraphs.
- If a paragraph is still too large, split on sentences.
- If a sentence is somehow too large, hard token-cap.
The point is: you only fall back to dumber splits when the smarter ones can’t fit. 90% of your chunks come out of step 1 or step 2 and they preserve the document’s actual structure. The last two steps exist because some doc somewhere will have a 4000-token paragraph and you need to handle it without crashing.
The section-split is the interesting one. Markdown-style heading detection is easy enough:
const HEADING_RE = /^(#{1,6})\s+(.+)$/m;
interface Section {
heading: string[]; // e.g. ['Operations', 'Delivery', 'Connecticut']
body: string;
}
const sectionsOf = (text: string): Section[] => {
const lines = text.split('\n');
const sections: Section[] = [];
let cur: Section = { heading: [], body: '' };
for (const line of lines) {
const m = HEADING_RE.exec(line);
if (m) {
if (cur.body.trim()) sections.push(cur);
const depth = m[1].length;
// Inherit parent headings up to this depth, then append this one.
const path = cur.heading.slice(0, Math.max(0, depth - 1));
path.push(m[2].trim());
cur = { heading: path, body: '' };
} else {
cur.body += line + '\n';
}
}
if (cur.body.trim()) sections.push(cur);
return sections;
};
The heading-path inheritance is the part nobody talks about. When you hit an H3 in the middle of an H2 section, the resulting chunks should know they’re inside both. So you carry the H1, H2, H3 down as a path. Stored on the chunk metadata, you can later reconstruct "Operations > Delivery > Connecticut" and either show it in the citation, or feed it back into the LLM context.
Heading-aware chunking is half the win. The other half is overlap that actually carries information.
Token-aware tail-carry overlap
"Chunk overlap" in most implementations is byte-level: take the last N characters of the previous chunk, prepend them to the next chunk. This is dumb. You end up with overlaps that start mid-word ("...the delivery process for Conn"). The embedding model handles it, sort of, but you’re wasting tokens on incoherent fragments.
Better: token-aware tail carry. When you flush a chunk, take the last overlap tokens worth of complete atoms (paragraphs or sentences) and carry them into the next chunk as a head.
const flush = () => {
if (!bucket.length) return;
const content = bucket.join('\n\n');
chunks.push({ content, tokenCount: bucketTokens });
if (overlap > 0) {
// Carry complete atoms from the tail into the next chunk.
// Token-aware so we never start mid-sentence.
const tail: string[] = [];
let tailTokens = 0;
for (let i = bucket.length - 1; i >= 0 && tailTokens < overlap; i--) {
const t = tokens(bucket[i]);
tail.unshift(bucket[i]);
tailTokens += t;
}
bucket = tail;
bucketTokens = tailTokens;
} else {
bucket = [];
bucketTokens = 0;
}
};
Atoms (whole paragraphs or whole sentences) are the smallest unit. Overlap is measured in tokens, but the boundary is always at an atom edge. The result: clean overlaps that read like the natural end of one passage continuing into the next.
The cleanup pass nobody mentions
If you’re ingesting from Google Drive, every Doc with active suggestions or comments comes through with inline markers:
The delivery process [suggested edit: should be updated for 2025] runs as follows.
[comment: do we still do same-day?] Same-day delivery is available Mon-Fri.
If you embed that text as-is, your vectors are poisoned by review meta-commentary. The fix is two regexes in a pre-clean pass:
text.replace(/\[suggested edit:[^\]]*\]/g, '')
.replace(/\[comment:[^\]]*\]/g, '')
Not glamorous. But your retrieval quality jumps. There are equivalent cleanups for PDF (footers, page numbers, line numbers in legal PDFs), .docx (track-changes markup), and spreadsheets (empty rows, merged-cell shadows). The principle: anything that’s metadata-about-the-doc, not content-of-the-doc, goes before the chunker sees it.
Chunk metadata that survives retrieval
Every chunk gets stored with:
heading_path: the array of parent headings like["Operations", "Delivery", "Connecticut"].source: file_id + mime_type so you know what kind of doc this came from.chunk_index: position in the doc, so you can fetch neighbors at retrieval time.token_count: for cost accounting and debugging.
The heading_path is the one that earns its keep. At retrieval time you can prepend it to the chunk content before passing to the LLM, giving the model the section context it needs to answer correctly. You can also surface it as part of the citation: "according to Connecticut Operations Manual > Delivery, same-day is Mon-Fri."
The chunk_index is the second prize. With it, you can do window expansion at retrieval time: fetch chunk_index-1 and chunk_index+1 from the same doc and concatenate around the hit. This is the cheapest precision-boost in RAG. It’s free, it’s deterministic, and it covers the "the answer spans two paragraphs and you only retrieved the first" failure mode that drives most "the AI didn’t find it" complaints.
Where to save your effort
You don’t need a vector DB to test chunking. A flat JSON dump with the raw chunks, the heading_paths, and a tiny script to compute cosine on the fly is enough to evaluate.
You don’t need overlap above 100 tokens. Larger overlap rarely improves retrieval and burns embedding cost.
You don’t need a semantic chunker (one that calls an LLM to decide where to split). They’re slow, expensive, and the gains over structural splitting on real docs are smaller than the marketing claims. Save semantic chunking for very-long-form text without obvious structure (transcripts, novels). For business documents, structural is almost always better.
Related
- Permission-aware RAG. Chunking matters less if the wrong user can see the chunk.
- MCP tool responses that don’t make Claude lie. What to do with the chunks once you’ve retrieved them.