Building a RAG pipeline on Synoppy

Retrieval is only as good as the corpus behind it. If your knowledge base is a stale PDF dump, your agent answers from stale facts. Synoppy lets you build retrieval on top of the live web and keep it fresh on a schedule — here's the whole shape of it.

1. Discover the URLs

Start with Mapto get every URL on a domain without reading each page — the cheapest way (1 credit) to scope what you'll ingest.

typescript

import { Synoppy } from "@synoppy/sdk";
const client = new Synoppy({ apiKey: process.env.SYNOPPY_API_KEY! });

const { urls } = await client.map("https://docs.yourtarget.com");

2. Read each page as markdown

Pipe the URLs through Read (or use Crawl to do discovery and reading in one call). Markdown chunks cleanly on headings, which gives you tidy, semantically coherent chunks instead of arbitrary character windows.

typescript

const pages = await Promise.all(
  urls.slice(0, 50).map((u) => client.read(u, { formats: ["markdown"] }))
);

const chunks = pages.flatMap((p) =>
  p.markdown.split(/\n## /).map((c) => ({ url: p.metadata.sourceUrl, text: c }))
);

3. Embed and store

Embed the chunks with your model of choice and upsert them into your vector store with the source URL as metadata, so every answer can cite where it came from.

4. Keep it fresh

Re-run the whole flow on a cron. Because the input is the live web — not a one-time export — your retrieval never quietly drifts out of date. Diff the new markdown against the last run to re-embed only what changed and keep costs flat.

Scope with Map before you Crawl. Crawl bills per page it actually reads, so checking the URL count first keeps a 10,000-page domain from turning into a surprise bill.

All posts Try the API free

Building a RAG pipeline on Synoppy

1. Discover the URLs

2. Read each page as markdown

3. Embed and store

4. Keep it fresh

More from the blog

Turning messy HTML into clean, agent-ready markdown

Map vs Crawl: scope a site without burning credits

Give your agent the live web with MCP

Give your agents the whole web