Chunking Strategies — Why Getting This Wrong Quietly Kills Your RAG
In the last post, I talked about vector databases and how retrieval actually works. At the end I mentioned chunking — the step that happens before any of that.
It's also the step most people get wrong without realizing it.
Let's talk about why.
What Is Chunking?
Before you can store a document in a vector database, you have to break it into pieces. Those pieces are called chunks.
Each chunk gets converted into an embedding and stored. When a query comes in, the system retrieves the most relevant chunks and passes them as context to the LLM.
Simple enough, right?
Except here's the problem: the quality of your chunks directly determines the quality of your retrieval. And if your retrieval is bad, your LLM has nothing useful to work with — and it'll hallucinate to fill the gap.
Bad chunking. Garbage in. Confident nonsense out.
The Naive Approach: Fixed-Size Chunking
Most people start here, and honestly, it makes sense as a first step:
Split every document into chunks of N characters (or tokens), with some overlap.
chunk_size = 500 # characters chunk_overlap = 50 # overlap between chunks
It's fast, easy to implement, and works okay for simple use cases.
The problem? Text doesn't care about your chunk boundaries. A paragraph explaining a concept can get sliced in half, leaving one chunk with the setup and the next with the punchline — and neither one makes much sense alone.
Your retrieval ends up pulling incomplete thoughts.
Better: Structure-Aware Chunking
Instead of splitting blindly by size, split along natural boundaries in the document:
- Paragraphs — most documents have logical paragraph breaks for a reason
- Sentences — for fine-grained retrieval
- Sections/headers — especially useful for structured docs like wikis or technical manuals
- Semantic boundaries — grouping sentences that cover the same idea
This produces chunks that are actually about something, which means embeddings that actually represent something, which means retrieval that actually finds the right thing.
The tradeoff? More work to implement and it doesn't fit every document type cleanly.
The Overlap Problem
Whichever strategy you use, overlap matters more than people think.
If a key idea spans the boundary between two chunks and there's no overlap, it might not get retrieved at all. A small overlap (10–20% of chunk size) is usually enough to prevent important context from falling through the cracks.
Too much overlap though, and you're bloating your vector store with near-duplicate embeddings — which hurts retrieval quality in a different way.
It's a balance. Most people set it once and forget it. Don't.
Chunk Size Is Not One-Size-Fits-All
Here's the bit that trips people up the most:
The right chunk size depends on what you're building.
| Use Case | Smaller Chunks | Larger Chunks |
|---|---|---|
| Precise Q&A | ✅ Better precision | ❌ Too much noise |
| Summarization | ❌ Loses context | ✅ More complete |
| Code retrieval | ✅ Function-level | ❌ File-level is too broad |
| Long-form docs | ❌ Fragments ideas | ✅ Keeps context intact |
There's no magic number. Test with your actual data and your actual queries.
One More Thing: Don't Forget Metadata
Chunks don't have to live alone. Attach metadata to each one:
{ "chunk": "...", "source": "docs/architecture.md", "section": "API Design", "page": 3 }
This unlocks metadata filtering during retrieval — so instead of searching your entire knowledge base, you can scope it down to relevant sections, sources, or document types before doing the similarity search.
It's one of the easiest wins in RAG quality and most people skip it.
The Bottom Line
Chunking is boring to talk about and easy to dismiss. It's not LLMs, it's not embeddings, it's not any of the exciting stuff.
But it sits at the foundation of your entire retrieval pipeline. Get it wrong and you'll spend hours debugging hallucinations, adjusting prompts, and tweaking model parameters — when the real problem was upstream the whole time.
Get it right and everything downstream gets easier.
That's the kind of detail that separates a RAG pipeline that works from one that works reliably.
— Cheers, NP