Turning messy HTML into clean, agent-ready markdown

Feeding raw HTML to a language model is a waste of tokens and attention. Nav bars, cookie banners, ad slots, and script tags drown out the few hundred words that actually matter. The Read endpoint exists to hand your model only the signal.

The pipeline

A read is three deterministic steps — no model in the loop, so it's fast, cheap, and repeatable. First we fetch the page with a real browser user-agent and follow redirects. Then we run Mozilla Readability — the library behind Firefox Reader View — on a clone of the DOM to isolate the main article and drop the chrome. Finally we convert that subtree to Markdown with Turndown, normalizing headings, links, lists, and code blocks.

bash

curl -X POST https://synoppy.com/api/scrape \
  -H "Authorization: Bearer $SYNOPPY_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://stripe.com/blog", "formats": ["markdown"] }'

You get the markdown plus metadata — title, description, language, word count — at the top level of the response, so it drops straight into a prompt or a chunker. No nested envelope to unwrap.

json

{
  "success": true,
  "markdown": "# Post title\n\nClean article body...",
  "metadata": { "title": "Post title", "language": "en", "wordCount": 842 },
  "latencyMs": 512,
  "creditsUsed": 1
}

Only the main content

By default onlyMainContent is on, so navigation, headers, footers, and boilerplate are stripped down to the primary article body. Flip it off when you genuinely want the whole page — a pricing grid, a directory listing — instead of a single article.

Why deterministic beats “ask an LLM to clean it”

Using a model to extract content is slow and non-deterministic, and it bills tokens on the messiest part of the page. Readability + Turndown gives you the same clean result every time, in well under a second, for one credit. We reserve the model for the one job it's uniquely good at — structured extraction — where you actually want judgment, not cleanup.

Read uses a plain HTTP fetch by default — fast, cheap, and enough for the vast majority of the web. For pages that render their text entirely client-side, pass render: true (or render: "auto") and the request runs through a real browser so you get the fully hydrated content.

All posts Try the API free

Turning messy HTML into clean, agent-ready markdown

The pipeline

Only the main content

Why deterministic beats “ask an LLM to clean it”

More from the blog

Building a RAG pipeline on Synoppy

Map vs Crawl: scope a site without burning credits

Give your agent the live web with MCP

Give your agents the whole web