Synoppy v1.0 is here— start free
Blog
Engineering

Turning messy HTML into clean, agent-ready markdown

How the Read engine strips boilerplate and preserves structure — deterministically, with no LLM in the loop.

The Synoppy team
Jun 20, 2026 · 5 min read

Feeding raw HTML to a language model is a waste of tokens and attention. Nav bars, cookie banners, ad slots, and script tags drown out the few hundred words that actually matter. The Read endpoint exists to hand your model only the signal.

The pipeline

A read is three deterministic steps — no model in the loop, so it's fast, cheap, and repeatable. First we fetch the page with a real browser user-agent and follow redirects. Then we run Mozilla Readability — the library behind Firefox Reader View — on a clone of the DOM to isolate the main article and drop the chrome. Finally we convert that subtree to Markdown with Turndown, normalizing headings, links, lists, and code blocks.

bash
curl -X POST https://synoppy.com/api/scrape \
  -H "Authorization: Bearer $SYNOPPY_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://stripe.com/blog", "formats": ["markdown"] }'

You get the markdown plus metadata — title, description, language, word count — at the top level of the response, so it drops straight into a prompt or a chunker. No nested envelope to unwrap.

json
{
  "success": true,
  "markdown": "# Post title\n\nClean article body...",
  "metadata": { "title": "Post title", "language": "en", "wordCount": 842 },
  "latencyMs": 512,
  "creditsUsed": 1
}

Only the main content

By default onlyMainContent is on, so navigation, headers, footers, and boilerplate are stripped down to the primary article body. Flip it off when you genuinely want the whole page — a pricing grid, a directory listing — instead of a single article.

Why deterministic beats “ask an LLM to clean it”

Using a model to extract content is slow and non-deterministic, and it bills tokens on the messiest part of the page. Readability + Turndown gives you the same clean result every time, in well under a second, for one credit. We reserve the model for the one job it's uniquely good at — structured extraction — where you actually want judgment, not cleanup.

Read uses a plain HTTP fetch by default — fast, cheap, and enough for the vast majority of the web. For pages that render their text entirely client-side, pass render: true (or render: "auto") and the request runs through a real browser so you get the fully hydrated content.

More from the blog

Give your agents the whole web

Read, crawl, map, extract, enrich, classify, and images are live today — all on one key. Agent actions are on the way. Build the next thing on Synoppy.