Web-to-Markdown RAG RAG Tools

Tools that crawl websites, documentation, and web content to convert into clean Markdown format optimized for RAG pipelines and offline use. Does NOT include PDF extraction, search indexing, or tools that don't produce Markdown output.

There are 101 web-to-markdown rag tools tracked. 7 score above 50 (established tier). The highest-rated is any4ai/AnyCrawl at 69/100 with 2,763 stars. 3 of the top 10 are actively maintained.

Get all 101 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=web-to-markdown-rag&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	any4ai/AnyCrawl AnyCrawl 🚀: A Node.js/TypeScript crawler that turns websites into LLM-ready...	69	Established	2,763	TypeScript
2	ScrapeGraphAI/Scrapegraph-ai Python scraper based on AI	62	Established	22,929	Python
3	adbar/trafilatura Python & Command-line tool to gather text and metadata on the Web: Crawling,...	60	Established	5,481	Python
4	kreuzberg-dev/html-to-markdown High performance and CommonMark compliant HTML to Markdown converter....	60	Established	565	HTML
5	lightfeed/extractor Using LLMs and AI browser automation to robustly extract web data	56	Established	60	TypeScript
6	paulpierre/markdown-crawler A multithreaded 🕸️ web crawler that recursively crawls a website and creates...	53	Established	431	Python
7	luisleo526/doc2mark AI-powered Python library that converts any document (PDF, Word, Excel,...	51	Established	47	Python
8	rodricios/wxpath wxpath - declarative web crawling with XPath; a Web Query Language (WQL)	49	Emerging	108	Python
9	AnkitNayak-eth/CrawlAI-RAG CrawlAI RAG is an AI-powered website intelligence platform that allows users...	47	Emerging	93	Python
10	firecrawl/firecrawl-app-examples 🔥 This repository contains complete application examples, including websites...	46	Emerging	690	Jupyter Notebook
11	sigoden/rag-crawler Crawl a website to generate knowledge file for RAG	45	Emerging	50	TypeScript
12	raintree-technology/docpull Crawl any website and convert it to clean, AI-ready Markdown — async Python...	43	Emerging	20	Python
13	apify/rag-web-browser RAG Web Browser is an Apify Actor to feed your LLM applications and RAG...	42	Emerging	72	TypeScript
14	intergalacticalvariable/reader 📚 This is an adapted version of Jina AI's Reader for local deployment using...	42	Emerging	295	TypeScript
15	m92vyas/llm-reader Turn Webpage to LLM friendly input text. Similar to Firecrawl and Jina...	41	Emerging	280	Python
16	opendatalab/MinerU-HTML MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean...	41	Emerging	217	HTML
17	Thordata/thordata-firecrawl Thordata Firecrawl – Firecrawl-compatible web crawling & scraping API built...	39	Emerging	2	Python
18	KimSeogyu/undocx Extract clean, structured Markdown from DOCX for LLM and RAG contexts.	39	Emerging	2	Rust
19	BjornMelin/ai-docs-vector-db-hybrid-scraper Retrieval-augmented docs ingestion stack: Firecrawl + Crawl4AI + Qdrant...	38	Emerging	10	Python
20	vishwajeetdabholkar/eGet-Crawler-for-ai Web scraping framework built for AI applications. Extract clean, structured...	38	Emerging	53	Python
21	dezoito/markitdown-api Ultra lightweight API server to convert files (.pdf, .docx, .xlsx) into...	38	Emerging	65	Python
22	Tendo33/arxiv-md One-click conversion of arXiv papers to Markdown with perfect LaTeX formula...	37	Emerging	4	JavaScript
23	mensfeld/llm-docs-builder Transform and optimize your markdown documentation for Large Language Models...	37	Emerging	80	Ruby
24	supacrawler/supacrawler Supacrawler's ultralight engine for scraping and crawling the web. Written...	35	Emerging	52	Go
25	mrmps/pdf2md Browser based tool to convert PDFs to Markdown	34	Emerging	303	TypeScript
26	Thordata/Thordata > Official Thordata developer portal repository. Curated overview of...	34	Emerging	4	—
27	KylinMountain/markify Convert files into markdown to help RAG or LLM understand, based on...	33	Emerging	133	Python
28	philschmid/clipper.js HTML to Markdown converter and crawler.	33	Emerging	614	TypeScript
29	jtgsystems/free-sitemap-generator 🗺️ Free sitemap generator - Create XML sitemaps for SEO	32	Emerging	1	Python
30	iamarunbrahma/pdf-to-markdown Conversion of PDF documents to structured Markdown, optimized for Retrieval...	32	Emerging	115	Python
31	BrowserCash/browser-serp Real-time Google Search API for AI Agents & RAG pipelines. Get structured...	32	Emerging	22	TypeScript
32	pc8544/Website-Crawler Extract data from websites in LLM ready JSON or CSV format. Crawl or Scrape...	32	Emerging	74	Java
33	yaniv-golan/ostruct Schema-first AI analysis CLI that transforms messy data into structured...	32	Emerging	8	Python
34	483218131/github-stars-to-markdown 一个轻量级工具，一键将 GitHub Star 导出为纯净的 Markdown 笔记。 \| A lightweight tool to export...	32	Emerging	1	Python
35	malvads/mojo Non sucking cross-platform extremely fast C++ crawler to convert entire...	30	Emerging	12	C++
36	agoodway/html2markdown Convert HTML to Markdown with Elixir	30	Emerging	37	Elixir
37	aqueeb/confluence2md Convert Confluence MIME exports (.doc) to clean Markdown	29	Experimental	37	Go
38	buildwithfiroz/Web2-LLM.txt Web2LLM.txt – A fast, open-source website-to-LLM context file generator....	29	Experimental	7	Python
39	WebCrawlerAPI/webcrawlerapi-js-sdk A WebcrawlerAPI SDK for Node JS	28	Experimental	2	TypeScript
40	ctokx/url-to-markdown Convert webpages to clean Markdown for LLM and RAG workflows. Browser-based...	26	Experimental	7	JavaScript
41	TylerMorrison21/paperflow Open-source PDF-to-Markdown post-processor with footnotes, LaTeX...	26	Experimental	5	Python
42	Paparusi/crawlkit 🕷️ Open-source web crawling toolkit — Video, OCR, NLP, Stealth, 10+ parsers	26	Experimental	5	Python
43	sethupavan12/Markdownify Convert documents, images to high-quality Markdown using Vision LLMs. Built...	25	Experimental	21	Python
44	Karthick-840/Crawl4ai-RAG-with-Local-LLM A tool for scraping web documentation using Crawl4AI, converting it to...	25	Experimental	6	Python
45	wldevries/confluence-rag Tool that fetches Confluence pages, converts them to markdown and chunks...	24	Experimental	1	C#
46	pgEdge/pgedge-docloader A tool for converting HTML and RST docs into Markdown, and loading them into...	24	Experimental	10	Go
47	arkeodev/scraper RAG-based Web Scraping	24	Experimental	14	Python
48	sgowdaks/nichirin RAG and Webcrawler in a single package	23	Experimental	2	Python
49	isSpicyCode/scrappe-tout Scrappe-Tout is a web scraping tool designed to convert HTML documentation...	23	Experimental	7	JavaScript
50	jackise69/pdf-sentinel 🛡️ Convert PDF files to Markdown for LLM workflows with event-driven...	23	Experimental	1	JavaScript
51	vinaes/md-succ-ai URL to Markdown API — md.succ.ai	23	Experimental	1	JavaScript
52	pengboyu-dev/Athanor-Epub-Converter 📘EPUB to RAG-ready Markdown with chunking, diagnostics, and clean structured output.	23	Experimental	1	Go
53	ngpepin/pdftomd-RAG RAG workflow-friendly enhancement of Marker that converts PDFs into a...	22	Experimental	4	Shell
54	bill-work/md-pdf-md 📄 Convert Markdown to visually appealing PDFs and extract Markdown from PDFs...	22	Experimental	—	TypeScript
55	chris-c-thomas/LexBuild Open-source toolchain that converts the U.S. Code from legislative XML...	22	Experimental	—	TypeScript
56	GTA509FX/scrappe-tout 🚀 Convert web pages to clean Markdown fast with Playwright, perfect for...	22	Experimental	—	JavaScript
57	zcag/readdown HTML to clean Markdown optimized for LLMs. Replaces readability + turndown....	22	Experimental	—	JavaScript
58	Santex12/confluence2md 🛠️ Convert Confluence `.doc` exports to clean Markdown effortlessly with...	22	Experimental	—	Go
59	Quippy22/web2llm Fetch web pages and convert to clean Markdown for LLM pipelines	22	Experimental	—	Rust
60	sumit7235/Domfie 🛠️ Simplify web scraping with Domfie, the self-healing scraper that adapts...	22	Experimental	—	Jupyter Notebook
61	Horlicks-p/Moelog-LLMs.txt This plugin implements the emerging llms.txt specification for WordPress,...	22	Experimental	—	PHP
62	marimo-marine23/xlmelt Convert complex Excel files into AI-readable JSON/HTML	22	Experimental	—	Python
63	nadya1992024/llm-parse Parse HTML and markdown offline with a lightweight, single-header C++...	22	Experimental	—	C++
64	AlphaDev007/AlphaCrawl A high-performance, asynchronous Go web crawler built to extract LLM-ready...	22	Experimental	—	Go
65	auto-medica-labs/md-tree Convert Markdown files into hierarchical JSON tree structures with optional...	22	Experimental	—	TypeScript
66	danke-global/crawl2kb Crawl a website and export embedding-ready chunks for RAG pipelines	22	Experimental	—	Go
67	pinion05/llm-page-context Turn any web page into clean LLM-ready context strings and structured documents.	22	Experimental	—	JavaScript
68	Thordata/thordata-cookbook Real-world recipes and examples for building AI data pipelines with Thordata.	21	Experimental	2	Jupyter Notebook
69	Thordata/thordata-web-qa-agent > Web-native QA agent built on Thordata that delivers a Perplexity-style...	21	Experimental	2	Python
70	moria97/fastpdf4llm Lightweight and fast library to convert PDF to markdown format.	20	Experimental	1	Python
71	davidjsors/br-pdf-to-md-to-rag Conversor de PDFs para Markdown estruturado, otimizado para ingestão em...	20	Experimental	1	Python
72	EasyDevv/project-to-markdown Project To Markdown: Project files into structured markdown, optimizing...	20	Experimental	17	Python
73	ilyashusterman/doc-to-readable Universal document-to-markdown and section splitter for HTML, URLs, and PDFs.	20	Experimental	6	JavaScript
74	wmahfoudh/pdf-to-md Automates the pipeline of converting PDF documents and images into clean...	19	Experimental	—	Shell
75	gsusI/llm-docs-sync Fetch official LLM provider docs (OpenAI, Gemini) from llms.txt into...	19	Experimental	—	Shell
76	StripFeed/stripfeed-js Official TypeScript SDK for StripFeed - convert any URL to clean Markdown...	19	Experimental	—	TypeScript
77	pedrokohler/github-repo-to-single-file TypeScript CLI that pulls a GitHub repo and merges all text-like files into...	18	Experimental	12	TypeScript
78	Ai4GenXers/pdf-sentinel Event-driven PDF to Markdown conversion for LLM workflows - 60x faster, zero...	17	Experimental	2	JavaScript
79	JamesN-dev/Scroll-Scribe ScrollScribe is a Python CLI toolkit that grab docs or index website pages...	16	Experimental	1	Python
80	PetrAPConsulting/image2md Convert batch of pictures with structured data like tables, formulas, charts...	16	Experimental	1	Python
81	itsmeyessir/Domfie An autonomous web scraper that fixes its own broken selectors using a...	16	Experimental	1	Jupyter Notebook
82	QLangstaff/qrawl Composable web crawling tools for Rust	15	Experimental	—	Rust
83	abcd2113004/url-reader 🔍 Extract content from any URL with smart platform detection and automatic...	14	Experimental	—	Python
84	ShaniPlayx/newsweek-scraper 📰 Collect and analyze fresh articles, headlines, and stories from Newsweek...	14	Experimental	—	—
85	the-ai-entrepreneur-ai-hub/ai-training-data-scraper AI Training Data Scraper - Extract LLM & RAG-Ready Web Content for Machine...	14	Experimental	—	—
86	OutofAi/manemark Manemark allows users to capture and save the text content of webpages so it...	14	Experimental	—	JavaScript
87	beepboop2025/social-scraper Economic data collection & AI analysis platform — 14 collectors (RBI, NSE,...	14	Experimental	—	Python
88	elementarpartikel/ultimate-web-crawler Webbdammsugare Pro v3.0 är en GUI-baserad webbcrawler för AI- och...	14	Experimental	—	Python
89	siddueswar/doc-crawler-rag 🕷️ Ingest clean documentation into LLM pipelines effortlessly, filtering out...	14	Experimental	—	Python
90	Thordata/thordata-rag-pipeline 🚀 Production-grade RAG pipeline powered by Thordata Scrapers. Turn any...	13	Experimental	2	Python
91	SupervisedCo/HyperCrawlTurbo HypercrawlTurbo is a turbocharged web scraper for extracting URLs from a webpage.	13	Experimental	10	Python
92	kwanLeeFrmVi/Crawler4AI-to-mardown-files This project is designed to crawl documentation websites and convert them...	13	Experimental	2	Python
93	aaronlifton/fastcrawl an agentic, atomics-driven Rust web crawler optimized for low heap usage,...	13	Experimental	2	HTML
94	amadou-6e/pymdt2json pymdt2json is a Python CLI and library for converting markdown tables into...	12	Experimental	1	Jupyter Notebook
95	AhmedZeyadTareq/Content_To_Markdown_OCR convert any file to markdown format	12	Experimental	1	Python
96	bloomresearch/InSite A lightning fast tool for crawling websites and compiling PDFs of their pages	12	Experimental	1	Python
97	QuiddityAI/PDFerret An all-in-one converter to make your files LLM-understandable	11	Experimental	2	HTML
98	m1r4g3-code/Distill Distill — Turn any URL into clean, structured data for AI pipelines, RAG...	11	Experimental	—	TypeScript
99	JeremySmythDigital/sitevac scrape any docs site into one AI-ready file TXT, Markdown, or pre-chunked...	11	Experimental	—	HTML
100	im-shashanks/PdfToMarkdown Lightweight PDF to Markdown converter.	11	Experimental	—	Python
101	Edgaras0x4E/web-loader-engine High-performance web content extraction engine built in Rust. Primary...	11	Experimental	—	Rust

Comparisons in this category

Scrapegraph-ai and eGet-Crawler-for-ai (62 vs 38)