Web-to-Markdown RAG RAG Tools
Tools that crawl websites, documentation, and web content to convert into clean Markdown format optimized for RAG pipelines and offline use. Does NOT include PDF extraction, search indexing, or tools that don't produce Markdown output.
There are 101 web-to-markdown rag tools tracked. 7 score above 50 (established tier). The highest-rated is any4ai/AnyCrawl at 69/100 with 2,763 stars. 3 of the top 10 are actively maintained.
Get all 101 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=web-to-markdown-rag&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
any4ai/AnyCrawl
AnyCrawl 🚀: A Node.js/TypeScript crawler that turns websites into LLM-ready... |
|
Established |
| 2 |
ScrapeGraphAI/Scrapegraph-ai
Python scraper based on AI |
|
Established |
| 3 |
adbar/trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling,... |
|
Established |
| 4 |
kreuzberg-dev/html-to-markdown
High performance and CommonMark compliant HTML to Markdown converter.... |
|
Established |
| 5 |
lightfeed/extractor
Using LLMs and AI browser automation to robustly extract web data |
|
Established |
| 6 |
paulpierre/markdown-crawler
A multithreaded 🕸️ web crawler that recursively crawls a website and creates... |
|
Established |
| 7 |
luisleo526/doc2mark
AI-powered Python library that converts any document (PDF, Word, Excel,... |
|
Established |
| 8 |
rodricios/wxpath
wxpath - declarative web crawling with XPath; a Web Query Language (WQL) |
|
Emerging |
| 9 |
AnkitNayak-eth/CrawlAI-RAG
CrawlAI RAG is an AI-powered website intelligence platform that allows users... |
|
Emerging |
| 10 |
firecrawl/firecrawl-app-examples
🔥 This repository contains complete application examples, including websites... |
|
Emerging |
| 11 |
sigoden/rag-crawler
Crawl a website to generate knowledge file for RAG |
|
Emerging |
| 12 |
raintree-technology/docpull
Crawl any website and convert it to clean, AI-ready Markdown — async Python... |
|
Emerging |
| 13 |
apify/rag-web-browser
RAG Web Browser is an Apify Actor to feed your LLM applications and RAG... |
|
Emerging |
| 14 |
intergalacticalvariable/reader
📚 This is an adapted version of Jina AI's Reader for local deployment using... |
|
Emerging |
| 15 |
m92vyas/llm-reader
Turn Webpage to LLM friendly input text. Similar to Firecrawl and Jina... |
|
Emerging |
| 16 |
opendatalab/MinerU-HTML
MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean... |
|
Emerging |
| 17 |
Thordata/thordata-firecrawl
Thordata Firecrawl – Firecrawl-compatible web crawling & scraping API built... |
|
Emerging |
| 18 |
KimSeogyu/undocx
Extract clean, structured Markdown from DOCX for LLM and RAG contexts. |
|
Emerging |
| 19 |
BjornMelin/ai-docs-vector-db-hybrid-scraper
Retrieval-augmented docs ingestion stack: Firecrawl + Crawl4AI + Qdrant... |
|
Emerging |
| 20 |
vishwajeetdabholkar/eGet-Crawler-for-ai
Web scraping framework built for AI applications. Extract clean, structured... |
|
Emerging |
| 21 |
dezoito/markitdown-api
Ultra lightweight API server to convert files (.pdf, .docx, .xlsx) into... |
|
Emerging |
| 22 |
Tendo33/arxiv-md
One-click conversion of arXiv papers to Markdown with perfect LaTeX formula... |
|
Emerging |
| 23 |
mensfeld/llm-docs-builder
Transform and optimize your markdown documentation for Large Language Models... |
|
Emerging |
| 24 |
supacrawler/supacrawler
Supacrawler's ultralight engine for scraping and crawling the web. Written... |
|
Emerging |
| 25 |
mrmps/pdf2md
Browser based tool to convert PDFs to Markdown |
|
Emerging |
| 26 |
Thordata/Thordata
> Official Thordata developer portal repository. Curated overview of... |
|
Emerging |
| 27 |
KylinMountain/markify
Convert files into markdown to help RAG or LLM understand, based on... |
|
Emerging |
| 28 |
philschmid/clipper.js
HTML to Markdown converter and crawler. |
|
Emerging |
| 29 |
jtgsystems/free-sitemap-generator
🗺️ Free sitemap generator - Create XML sitemaps for SEO |
|
Emerging |
| 30 |
iamarunbrahma/pdf-to-markdown
Conversion of PDF documents to structured Markdown, optimized for Retrieval... |
|
Emerging |
| 31 |
BrowserCash/browser-serp
Real-time Google Search API for AI Agents & RAG pipelines. Get structured... |
|
Emerging |
| 32 |
pc8544/Website-Crawler
Extract data from websites in LLM ready JSON or CSV format. Crawl or Scrape... |
|
Emerging |
| 33 |
yaniv-golan/ostruct
Schema-first AI analysis CLI that transforms messy data into structured... |
|
Emerging |
| 34 |
483218131/github-stars-to-markdown
一个轻量级工具,一键将 GitHub Star 导出为纯净的 Markdown 笔记。 | A lightweight tool to export... |
|
Emerging |
| 35 |
malvads/mojo
Non sucking cross-platform extremely fast C++ crawler to convert entire... |
|
Emerging |
| 36 |
agoodway/html2markdown
Convert HTML to Markdown with Elixir |
|
Emerging |
| 37 |
aqueeb/confluence2md
Convert Confluence MIME exports (.doc) to clean Markdown |
|
Experimental |
| 38 |
buildwithfiroz/Web2-LLM.txt
Web2LLM.txt – A fast, open-source website-to-LLM context file generator.... |
|
Experimental |
| 39 |
WebCrawlerAPI/webcrawlerapi-js-sdk
A WebcrawlerAPI SDK for Node JS |
|
Experimental |
| 40 |
ctokx/url-to-markdown
Convert webpages to clean Markdown for LLM and RAG workflows. Browser-based... |
|
Experimental |
| 41 |
TylerMorrison21/paperflow
Open-source PDF-to-Markdown post-processor with footnotes, LaTeX... |
|
Experimental |
| 42 |
Paparusi/crawlkit
🕷️ Open-source web crawling toolkit — Video, OCR, NLP, Stealth, 10+ parsers |
|
Experimental |
| 43 |
sethupavan12/Markdownify
Convert documents, images to high-quality Markdown using Vision LLMs. Built... |
|
Experimental |
| 44 |
Karthick-840/Crawl4ai-RAG-with-Local-LLM
A tool for scraping web documentation using Crawl4AI, converting it to... |
|
Experimental |
| 45 |
wldevries/confluence-rag
Tool that fetches Confluence pages, converts them to markdown and chunks... |
|
Experimental |
| 46 |
pgEdge/pgedge-docloader
A tool for converting HTML and RST docs into Markdown, and loading them into... |
|
Experimental |
| 47 |
arkeodev/scraper
RAG-based Web Scraping |
|
Experimental |
| 48 |
sgowdaks/nichirin
RAG and Webcrawler in a single package |
|
Experimental |
| 49 |
isSpicyCode/scrappe-tout
Scrappe-Tout is a web scraping tool designed to convert HTML documentation... |
|
Experimental |
| 50 |
jackise69/pdf-sentinel
🛡️ Convert PDF files to Markdown for LLM workflows with event-driven... |
|
Experimental |
| 51 |
vinaes/md-succ-ai
URL to Markdown API — md.succ.ai |
|
Experimental |
| 52 |
pengboyu-dev/Athanor-Epub-Converter
📘EPUB to RAG-ready Markdown with chunking, diagnostics, and clean structured output. |
|
Experimental |
| 53 |
ngpepin/pdftomd-RAG
RAG workflow-friendly enhancement of Marker that converts PDFs into a... |
|
Experimental |
| 54 |
bill-work/md-pdf-md
📄 Convert Markdown to visually appealing PDFs and extract Markdown from PDFs... |
|
Experimental |
| 55 |
chris-c-thomas/LexBuild
Open-source toolchain that converts the U.S. Code from legislative XML... |
|
Experimental |
| 56 |
GTA509FX/scrappe-tout
🚀 Convert web pages to clean Markdown fast with Playwright, perfect for... |
|
Experimental |
| 57 |
zcag/readdown
HTML to clean Markdown optimized for LLMs. Replaces readability + turndown.... |
|
Experimental |
| 58 |
Santex12/confluence2md
🛠️ Convert Confluence `.doc` exports to clean Markdown effortlessly with... |
|
Experimental |
| 59 |
Quippy22/web2llm
Fetch web pages and convert to clean Markdown for LLM pipelines |
|
Experimental |
| 60 |
sumit7235/Domfie
🛠️ Simplify web scraping with Domfie, the self-healing scraper that adapts... |
|
Experimental |
| 61 |
Horlicks-p/Moelog-LLMs.txt
This plugin implements the emerging llms.txt specification for WordPress,... |
|
Experimental |
| 62 |
marimo-marine23/xlmelt
Convert complex Excel files into AI-readable JSON/HTML |
|
Experimental |
| 63 |
nadya1992024/llm-parse
Parse HTML and markdown offline with a lightweight, single-header C++... |
|
Experimental |
| 64 |
AlphaDev007/AlphaCrawl
A high-performance, asynchronous Go web crawler built to extract LLM-ready... |
|
Experimental |
| 65 |
auto-medica-labs/md-tree
Convert Markdown files into hierarchical JSON tree structures with optional... |
|
Experimental |
| 66 |
danke-global/crawl2kb
Crawl a website and export embedding-ready chunks for RAG pipelines |
|
Experimental |
| 67 |
pinion05/llm-page-context
Turn any web page into clean LLM-ready context strings and structured documents. |
|
Experimental |
| 68 |
Thordata/thordata-cookbook
Real-world recipes and examples for building AI data pipelines with Thordata. |
|
Experimental |
| 69 |
Thordata/thordata-web-qa-agent
> Web-native QA agent built on Thordata that delivers a Perplexity-style... |
|
Experimental |
| 70 |
moria97/fastpdf4llm
Lightweight and fast library to convert PDF to markdown format. |
|
Experimental |
| 71 |
davidjsors/br-pdf-to-md-to-rag
Conversor de PDFs para Markdown estruturado, otimizado para ingestão em... |
|
Experimental |
| 72 |
EasyDevv/project-to-markdown
Project To Markdown: Project files into structured markdown, optimizing... |
|
Experimental |
| 73 |
ilyashusterman/doc-to-readable
Universal document-to-markdown and section splitter for HTML, URLs, and PDFs. |
|
Experimental |
| 74 |
wmahfoudh/pdf-to-md
Automates the pipeline of converting PDF documents and images into clean... |
|
Experimental |
| 75 |
gsusI/llm-docs-sync
Fetch official LLM provider docs (OpenAI, Gemini) from llms.txt into... |
|
Experimental |
| 76 |
StripFeed/stripfeed-js
Official TypeScript SDK for StripFeed - convert any URL to clean Markdown... |
|
Experimental |
| 77 |
pedrokohler/github-repo-to-single-file
TypeScript CLI that pulls a GitHub repo and merges all text-like files into... |
|
Experimental |
| 78 |
Ai4GenXers/pdf-sentinel
Event-driven PDF to Markdown conversion for LLM workflows - 60x faster, zero... |
|
Experimental |
| 79 |
JamesN-dev/Scroll-Scribe
ScrollScribe is a Python CLI toolkit that grab docs or index website pages... |
|
Experimental |
| 80 |
PetrAPConsulting/image2md
Convert batch of pictures with structured data like tables, formulas, charts... |
|
Experimental |
| 81 |
itsmeyessir/Domfie
An autonomous web scraper that fixes its own broken selectors using a... |
|
Experimental |
| 82 |
QLangstaff/qrawl
Composable web crawling tools for Rust |
|
Experimental |
| 83 |
abcd2113004/url-reader
🔍 Extract content from any URL with smart platform detection and automatic... |
|
Experimental |
| 84 |
ShaniPlayx/newsweek-scraper
📰 Collect and analyze fresh articles, headlines, and stories from Newsweek... |
|
Experimental |
| 85 |
the-ai-entrepreneur-ai-hub/ai-training-data-scraper
AI Training Data Scraper - Extract LLM & RAG-Ready Web Content for Machine... |
|
Experimental |
| 86 |
OutofAi/manemark
Manemark allows users to capture and save the text content of webpages so it... |
|
Experimental |
| 87 |
beepboop2025/social-scraper
Economic data collection & AI analysis platform — 14 collectors (RBI, NSE,... |
|
Experimental |
| 88 |
elementarpartikel/ultimate-web-crawler
Webbdammsugare Pro v3.0 är en GUI-baserad webbcrawler för AI- och... |
|
Experimental |
| 89 |
siddueswar/doc-crawler-rag
🕷️ Ingest clean documentation into LLM pipelines effortlessly, filtering out... |
|
Experimental |
| 90 |
Thordata/thordata-rag-pipeline
🚀 Production-grade RAG pipeline powered by Thordata Scrapers. Turn any... |
|
Experimental |
| 91 |
SupervisedCo/HyperCrawlTurbo
HypercrawlTurbo is a turbocharged web scraper for extracting URLs from a webpage. |
|
Experimental |
| 92 |
kwanLeeFrmVi/Crawler4AI-to-mardown-files
This project is designed to crawl documentation websites and convert them... |
|
Experimental |
| 93 |
aaronlifton/fastcrawl
an agentic, atomics-driven Rust web crawler optimized for low heap usage,... |
|
Experimental |
| 94 |
amadou-6e/pymdt2json
pymdt2json is a Python CLI and library for converting markdown tables into... |
|
Experimental |
| 95 |
AhmedZeyadTareq/Content_To_Markdown_OCR
convert any file to markdown format |
|
Experimental |
| 96 |
bloomresearch/InSite
A lightning fast tool for crawling websites and compiling PDFs of their pages |
|
Experimental |
| 97 |
QuiddityAI/PDFerret
An all-in-one converter to make your files LLM-understandable |
|
Experimental |
| 98 |
m1r4g3-code/Distill
Distill — Turn any URL into clean, structured data for AI pipelines, RAG... |
|
Experimental |
| 99 |
JeremySmythDigital/sitevac
scrape any docs site into one AI-ready file TXT, Markdown, or pre-chunked... |
|
Experimental |
| 100 |
im-shashanks/PdfToMarkdown
Lightweight PDF to Markdown converter. |
|
Experimental |
| 101 |
Edgaras0x4E/web-loader-engine
High-performance web content extraction engine built in Rust. Primary... |
|
Experimental |