How to Convert Web Pages to Markdown for LLMs

Extract clean text from web pages and prepare it as context for ChatGPT, Claude, and other AI tools.

5 min read

You want to give an LLM context from a web page. Maybe it's documentation you want it to reference, a blog post you want it to summarize, or a product page you want it to analyze. You could copy-paste the page, but you'll end up with navigation menus, cookie banners, sidebar links, footer text, and all the other junk that has nothing to do with the actual content.

Converting the page to Markdown first strips all that away and gives you clean, structured text that LLMs can work with much more effectively.

Why LLMs Need Clean Text

LLMs process text in tokens, and every token counts. When you paste raw HTML or a messy copy-paste from a browser, a large percentage of those tokens go to things like <div class="nav-wrapper"> and JavaScript snippets. The model has to wade through all of that to find the actual information.

Clean Markdown solves this in a few ways:

  • Fewer tokens. Markdown is much more compact than HTML. A page that's 50KB of HTML might be 5KB of Markdown. That means you can fit more content into the context window, or save money on API calls.
  • Better signal-to-noise ratio. Without navigation, ads, and boilerplate, the model can focus on the content that actually matters for your question.
  • Preserved structure. Unlike plain text, Markdown keeps headings, lists, code blocks, and links intact. This helps the model understand the document's organization.

How URL-to-Markdown Works

The conversion process typically works like this:

  1. Fetch the HTML content of the page
  2. Parse the DOM and identify the main content area
  3. Strip navigation, headers, footers, scripts, styles, and ads
  4. Convert the remaining HTML elements to their Markdown equivalents
  5. Clean up whitespace and formatting

The tricky part is step 2 — figuring out where the "real" content lives. Most tools use heuristics similar to browser reader modes. They look for <article> tags, content density (text-to-tag ratio), and common patterns to separate content from chrome.

The conversion maps HTML to Markdown roughly like this:

  • <h1> through <h6> become # headings
  • <p> becomes plain text with blank lines
  • <a> becomes [text](url)
  • <ul>/<ol> become Markdown lists
  • <pre>/<code> become fenced code blocks
  • <table> becomes pipe tables
  • <img> becomes ![alt](src)

Using MDConvert's URL to Markdown Tool

The URL to Markdown tool on MDConvert is the simplest way to do this. Paste a URL, and it extracts the main content as Markdown.

A few things to keep in mind when using it:

  • It works best on content-heavy pages. Blog posts, documentation, articles, and wiki pages convert cleanly. Single-page apps that render everything in JavaScript may not work as well, since the content isn't in the initial HTML.
  • Review the output. Automated extraction isn't perfect. Sometimes a sidebar or related-posts section sneaks in. A quick scan and trim fixes that.
  • Images are preserved as links. The Markdown will include image references, but the images themselves aren't downloaded. For LLM context, this is usually fine since most models work with text anyway.

Preparing Good Context for AI

Getting clean Markdown is step one. How you feed it to the LLM matters too.

Trim What You Don't Need

If you only care about one section of a long article, cut the rest. Smaller, focused context gives better answers than dumping an entire page and hoping the model finds the relevant part.

Add a Prompt Wrapper

Tell the model what the content is and what you want it to do with it. Something like:

Here is the documentation for the Stripe Checkout API:

---
[your extracted Markdown here]
---

Based on this documentation, how do I add a discount
code field to the checkout session?

Separating the context from your question with dividers helps the model distinguish between reference material and instructions.

Respect Token Limits

Check the model's context window size. If your extracted Markdown is 20,000 tokens and you're using a model with a 4K context window, you need to trim. Even with large context windows, shorter and more relevant input tends to produce better output.

Batch Processing Tips

If you need to convert multiple pages — say, all the pages in a documentation site — you have a few options:

  • Scripted fetching. Use a tool like wget or curl to download pages, then pipe them through an HTML-to-Markdown converter like turndown (JavaScript) or markdownify (Python).
  • Sitemaps. Most sites have a sitemap.xml. Parse it to get a list of all URLs, then convert each one. This is better than crawling because you get a clean list upfront.
  • Rate limiting. If you're hitting a site repeatedly, add a delay between requests. Being a good citizen means not hammering someone's server.
# Example: batch convert using curl and a Python script
while read url; do
  curl -s "$url" | python3 html_to_md.py >> output.md
  echo "---" >> output.md
  sleep 1
done < urls.txt

Limitations and Workarounds

URL-to-Markdown conversion isn't perfect. Here are the common issues and how to deal with them:

  • JavaScript-rendered content. If the page content loads via JavaScript (common with SPAs), the fetched HTML may be empty or just a loading spinner. Workaround: use a headless browser like Puppeteer to render the page first, then convert the rendered HTML.
  • Login-gated content. Pages behind authentication can't be fetched without credentials. Workaround: use your browser's developer tools to copy the rendered HTML, then paste it into MDConvert's HTML to Markdown converter.
  • Complex layouts. Multi-column layouts, tabbed content, and accordion sections often don't convert cleanly. The converter may flatten or reorder content. Workaround: review the output and manually fix any structural issues.
  • Tables in images. If data is in an image rather than an HTML table, no text converter will extract it. You'd need an OCR tool for that.
  • Encoding issues. Pages with unusual character encodings may produce garbled text. Most modern sites use UTF-8, but if you hit issues, check the page's encoding and convert accordingly.

For most typical web pages — articles, docs, blog posts, wiki entries — the conversion works well enough that a quick manual review is all you need before feeding it to an LLM.

Next Steps

If you're working with LLMs regularly, check out our Markdown for AI hub for a complete overview of all the tools and workflows for preparing content for AI. And if you're specifically using ChatGPT or Claude, our guide on converting URLs to Markdown for ChatGPT and Claude walks through prompt wrapping, token optimization, and platform-specific tips.

Building a RAG pipeline? See our Markdown for RAG Pipelines guide for chunking strategies, embedding workflows, and best practices for vector database ingestion.