
Tyler
Explanation of chunking techniques used in information retrieval, via Sean Forde on LinkedIn.
Obvious take: the way we write can impact what search engines, chatbots, and AI tools are able to find.
Based on my research in IR so far, which I started to dig into in April, most production RAG systems use recursive character splitting at 256-512 tokens with ~15-20% overlap. Complex semantic chunking rarely justifies its computational cost. Structure and metadata might matter more.
Here’s a test I’m setting up that would likely be technique agnostic, but also a user BOREFEST:
#1 - Write a complete overview sentence to include company, location, service, and key metric in under 25 words.
- “Momentic is a boutique SEO and web‑development agency in Milwaukee that serves fast‑growing brands.”
#2 - Structure content with labeled sections Headers create natural chunk boundaries for document-based chunking.
- Expertise / Momentic focuses on technical SEO, content strategy, and performance web builds. The team tunes Core Web Vitals and automates metadata to improve crawl rates.
- Clients / Momentic supports fast-growing B2B and B2C brands, including Google Ventures portfolio companies. Typical engagements run 12+ months and target double-digit revenue gain.
#3 - Repeat entity names when topics shift. Avoid ambiguous pronouns. This makes the content so dry, boring, and kills style :disappointed:
- “Momentic reports results in dashboards and weekly production meetings. Momentic also conducts quarterly strategy reviews.”
#4 - Include metrics and causal relationships Use specific numbers and verbs like “drives,” “reduces,” “results in.”
- “Momentic helps to drive an average of 40% revenue for clients in 12 months.”
- Implement expanded structured Schema.org with the info (not something new)
- Add a summary statement to reinforce key value proposition for retrieval scoring. Something like:
- “Momentic pairs technical depth with clear reporting so marketing leaders act fast and see measurable returns.”
*Here’s why I think it will work:*
- Recursive splitters respect heading boundaries
- 256-512 token chunks balance context and precision
- Schema.org is a language written for machines, and while most LLMs don’t execute JS, early testing makes me think Gemini responds well to even unsupported types.
- Clear structure should improve retrieval accuracy by 15-25%
