← Oleksii Turovskyi

Semantic HTML for Machines — Why Skipping Headings Breaks AI Optimization

· 6 min read

Machines don't see your design. They read structure.

When GPTBot, ClaudeBot, or PerplexityBot hits a page, their first move is to build a document outline from headings. H1 is the topic. H2 is sections. H3 is subsections. That tree is then converted into chunks (data blocks) for RAG indexing, into snippet candidates for chat answers, into nodes in the citation graph. Broken structure, broken comprehension. It's that simple.

1. Why an LLM parser demands a clean outline#

A modern crawler for a language model works differently than a classic search bot from a decade ago. Instead of extracting keywords and counting their density, it builds the document's semantic tree and slices it into chunks of a fixed size. During this process, every chunk inherits context from its parent headings. This is called hierarchical chunking, and the entire procedure decides whether the model finds your page in response to a user query.

Three concrete things that break with bad structure:

  1. The outline algorithm starts hallucinating. A parser builds the document outline level by level. If you skip from H2 to H4, it inserts a phantom H3 — an empty node with no name. The chunk that belongs to that phantom lands in the index without a readable heading. The model sees it as "unnamed fragment under section X" and downranks it.
  2. The RAG ranker indexes the wrong place. Production splitters — MarkdownHeaderTextSplitter in LangChain, HierarchicalNodeParser in LlamaIndex, or custom regex-based pipelines — anchor specifically on heading levels. Two H1 tags on one page mean two "root" documents in the index. Your content gets split. A query that should have returned one cohesive answer returns half of one.
  3. The accessibility tree matches what AI agents see. This is the most interesting development of the last year. Agents like Claude Computer Use, OpenAI's Operator, and Claude in Chrome don't parse visual CSS — they read the accessibility tree, the same tree screen readers use for blind users. Broken structure gives an agent the same disorientation that a vision-impaired user gets. Your A11y practices now correlate directly with how well an AI agent can perform actions on your site.

Bottom line: H1 is not "the biggest font." It's the document's topic declaration for machine readers. Treat it accordingly.

2. Correct structure: one H1, sequential descent#

The rule is short: one H1 per page, no skipping levels, going back up is fine, jumping forward is not.

Here's what a correct structure looks like for a typical product page:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Automation platform for engineering teams</title>
</head>
<body>
  <header>
    <a href="/" aria-label="Home">
      <img src="/logo.svg" alt="Company logo">
    </a>
  </header>
 
  <main>
    <article>
      <h1>Automation platform for engineering teams</h1>
 
      <section aria-labelledby="features">
        <h2 id="features">Features</h2>
 
        <h3>Speed</h3>
        <p>Orchestrates tasks in 200 ms on average.</p>
 
        <h3>Security</h3>
        <p>SOC 2 Type II, end-to-end encryption.</p>
 
        <h4>Audit logs</h4>
        <p>Export to SIEM via webhook or S3.</p>
      </section>
 
      <section aria-labelledby="pricing">
        <h2 id="pricing">Pricing</h2>
        <h3>Starter plan</h3>
        <p>For small teams up to 5 people.</p>
 
        <h3>Team plan</h3>
        <p>For growing businesses without limits.</p>
      </section>
    </article>
  </main>
</body>
</html>

Three details to note:

  • The logo in <header> is not an H1. It's a link with aria-label. The logo repeats on every page; it doesn't describe the topic of a specific document.
  • The H1 lives in <main> and names the page exactly. Not "Welcome!", not "We are." A concrete topic the model can index.
  • The H4 is logically nested inside the H3 "Security," not flying solo. You did not skip a level, because "Audit logs" is a sub-point of "Security." If you placed the H4 directly under H2 "Features," the parser would conclude there's an invisible H3 between them — and add a phantom node to the index.

⚠️ A note on the HTML5 spec. Formally, HTML5 allows multiple H1 tags inside sectioning content (<article>, <section>, <nav>, <aside>) — each supposedly with a "local" level. But no real browser ever implemented the outline algorithm for this case, and in 2022 W3C recommended treating headings as if the sectioning algorithm did not exist. Conclusion: one H1 per document, period. The spec be damned.

3. Visually hidden headings: when semantics matters more than design#

There's text users don't need a visible heading for — primary navigation, search forms, sidebars with filters, footers. A designer says, "It's clear from context." The machine — not. To a crawler-agent, an unnamed <nav> is just a list of links without context.

The right compromise is an sr-only (screen-reader only) heading. Present in the DOM, present in the accessibility tree, read by LLM agents and screen readers, but invisible visually.

<nav aria-label="Main navigation">
  <h2 class="sr-only">Main navigation</h2>
  <ul>
    <li><a href="/products">Products</a></li>
    <li><a href="/pricing">Pricing</a></li>
    <li><a href="/docs">Documentation</a></li>
  </ul>
</nav>
 
<aside aria-label="Catalog filters">
  <h2 class="sr-only">Catalog filters</h2>
  <form action="/search" method="GET">
    <label for="category">Category</label>
    <select id="category" name="category">
      <option value="all">All</option>
      <option value="tools">Tools</option>
    </select>
  </form>
</aside>

The canonical sr-only implementation (the same one used by Tailwind CSS and Bootstrap):

sr-only.css
.sr-only {
  position: absolute;
  width: 1px;
  height: 1px;
  padding: 0;
  margin: -1px;
  overflow: hidden;
  clip: rect(0, 0, 0, 0);
  white-space: nowrap;
  border: 0;
}

What you must not do here:

  • display: none — removes the node from the accessibility tree. Invisible to both agents and screen readers. The heading effectively does not exist.
  • visibility: hidden — same problem, plus leaves an empty space in the layout. Worst option.
  • opacity: 0 — stays in the tree but gets read by focus and breaks keyboard tab navigation.
  • The clip + position: absolute technique — the element renders outside the visible area but is fully present in the DOM and AT.

An sr-only heading is not "an SEO crutch." It's a declaration: "a logically separate text section with a concrete topic starts here." That's exactly what the ranker reads, and what the agent uses when planning actions on the page.

Quick check for your site#

Open any page, go to the Console tab in DevTools, and run this snippet. Note that it uses modern standards (no jQuery or other deprecated libraries).

audit-headings.js
const headings = document.querySelectorAll('h1, h2, h3, h4, h5, h6');
const headingsData = Array.from(headings).map((heading, index) => {
  return {
    order: index,
    level: heading.tagName,
    text: heading.innerText.trim().slice(0, 60),
    hidden:
      heading.classList.contains('sr-only') ||
      heading.offsetParent === null,
  };
});
 
console.table(headingsData);

Look at the level column in the resulting table. Answer three questions:

  1. Is there exactly one H1 on the page?
  2. Is the descent sequential, with no skipped levels?
  3. Is every H2/H3 you don't see visually explicitly marked with the sr-only class (and not hidden via display: none)?

If the answer to any of these is "no," you have two paths. The first (correct one): rewrite the DOM so the structure mirrors the logic of the content. The second (quick band-aid): add sr-only headings where a semantic "bridge" between levels is needed, until proper refactoring catches up with the backlog.


Did your structure audit reveal problems on your site?

Follow me on LinkedIn to keep up with the technical side of optimization for AI. If your project needs architectural review, a clean semantic migration, or an AEO-aware Next.js setup, get in touch directly.