Web-to-Markdown RAG RAG Tools

Tools that crawl websites, documentation, and web content to convert into clean Markdown format optimized for RAG pipelines and offline use. Does NOT include PDF extraction, search indexing, or tools that don't produce Markdown output.

There are 101 web-to-markdown rag tools tracked. 7 score above 50 (established tier). The highest-rated is any4ai/AnyCrawl at 69/100 with 2,763 stars. 3 of the top 10 are actively maintained.

Get all 101 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=web-to-markdown-rag&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 any4ai/AnyCrawl

AnyCrawl 🚀: A Node.js/TypeScript crawler that turns websites into LLM-ready...

69
Established
2 ScrapeGraphAI/Scrapegraph-ai

Python scraper based on AI

62
Established
3 adbar/trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling,...

60
Established
4 kreuzberg-dev/html-to-markdown

High performance and CommonMark compliant HTML to Markdown converter....

60
Established
5 lightfeed/extractor

Using LLMs and AI browser automation to robustly extract web data

56
Established
6 paulpierre/markdown-crawler

A multithreaded 🕸️ web crawler that recursively crawls a website and creates...

53
Established
7 luisleo526/doc2mark

AI-powered Python library that converts any document (PDF, Word, Excel,...

51
Established
8 rodricios/wxpath

wxpath - declarative web crawling with XPath; a Web Query Language (WQL)

49
Emerging
9 AnkitNayak-eth/CrawlAI-RAG

CrawlAI RAG is an AI-powered website intelligence platform that allows users...

47
Emerging
10 firecrawl/firecrawl-app-examples

🔥 This repository contains complete application examples, including websites...

46
Emerging
11 sigoden/rag-crawler

Crawl a website to generate knowledge file for RAG

45
Emerging
12 raintree-technology/docpull

Crawl any website and convert it to clean, AI-ready Markdown — async Python...

43
Emerging
13 apify/rag-web-browser

RAG Web Browser is an Apify Actor to feed your LLM applications and RAG...

42
Emerging
14 intergalacticalvariable/reader

📚 This is an adapted version of Jina AI's Reader for local deployment using...

42
Emerging
15 m92vyas/llm-reader

Turn Webpage to LLM friendly input text. Similar to Firecrawl and Jina...

41
Emerging
16 opendatalab/MinerU-HTML

MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean...

41
Emerging
17 Thordata/thordata-firecrawl

Thordata Firecrawl – Firecrawl-compatible web crawling & scraping API built...

39
Emerging
18 KimSeogyu/undocx

Extract clean, structured Markdown from DOCX for LLM and RAG contexts.

39
Emerging
19 BjornMelin/ai-docs-vector-db-hybrid-scraper

Retrieval-augmented docs ingestion stack: Firecrawl + Crawl4AI + Qdrant...

38
Emerging
20 vishwajeetdabholkar/eGet-Crawler-for-ai

Web scraping framework built for AI applications. Extract clean, structured...

38
Emerging
21 dezoito/markitdown-api

Ultra lightweight API server to convert files (.pdf, .docx, .xlsx) into...

38
Emerging
22 Tendo33/arxiv-md

One-click conversion of arXiv papers to Markdown with perfect LaTeX formula...

37
Emerging
23 mensfeld/llm-docs-builder

Transform and optimize your markdown documentation for Large Language Models...

37
Emerging
24 supacrawler/supacrawler

Supacrawler's ultralight engine for scraping and crawling the web. Written...

35
Emerging
25 mrmps/pdf2md

Browser based tool to convert PDFs to Markdown

34
Emerging
26 Thordata/Thordata

> Official Thordata developer portal repository. Curated overview of...

34
Emerging
27 KylinMountain/markify

Convert files into markdown to help RAG or LLM understand, based on...

33
Emerging
28 philschmid/clipper.js

HTML to Markdown converter and crawler.

33
Emerging
29 jtgsystems/free-sitemap-generator

🗺️ Free sitemap generator - Create XML sitemaps for SEO

32
Emerging
30 iamarunbrahma/pdf-to-markdown

Conversion of PDF documents to structured Markdown, optimized for Retrieval...

32
Emerging
31 BrowserCash/browser-serp

Real-time Google Search API for AI Agents & RAG pipelines. Get structured...

32
Emerging
32 pc8544/Website-Crawler

Extract data from websites in LLM ready JSON or CSV format. Crawl or Scrape...

32
Emerging
33 yaniv-golan/ostruct

Schema-first AI analysis CLI that transforms messy data into structured...

32
Emerging
34 483218131/github-stars-to-markdown

一个轻量级工具,一键将 GitHub Star 导出为纯净的 Markdown 笔记。 | A lightweight tool to export...

32
Emerging
35 malvads/mojo

Non sucking cross-platform extremely fast C++ crawler to convert entire...

30
Emerging
36 agoodway/html2markdown

Convert HTML to Markdown with Elixir

30
Emerging
37 aqueeb/confluence2md

Convert Confluence MIME exports (.doc) to clean Markdown

29
Experimental
38 buildwithfiroz/Web2-LLM.txt

Web2LLM.txt – A fast, open-source website-to-LLM context file generator....

29
Experimental
39 WebCrawlerAPI/webcrawlerapi-js-sdk

A WebcrawlerAPI SDK for Node JS

28
Experimental
40 ctokx/url-to-markdown

Convert webpages to clean Markdown for LLM and RAG workflows. Browser-based...

26
Experimental
41 TylerMorrison21/paperflow

Open-source PDF-to-Markdown post-processor with footnotes, LaTeX...

26
Experimental
42 Paparusi/crawlkit

🕷️ Open-source web crawling toolkit — Video, OCR, NLP, Stealth, 10+ parsers

26
Experimental
43 sethupavan12/Markdownify

Convert documents, images to high-quality Markdown using Vision LLMs. Built...

25
Experimental
44 Karthick-840/Crawl4ai-RAG-with-Local-LLM

A tool for scraping web documentation using Crawl4AI, converting it to...

25
Experimental
45 wldevries/confluence-rag

Tool that fetches Confluence pages, converts them to markdown and chunks...

24
Experimental
46 pgEdge/pgedge-docloader

A tool for converting HTML and RST docs into Markdown, and loading them into...

24
Experimental
47 arkeodev/scraper

RAG-based Web Scraping

24
Experimental
48 sgowdaks/nichirin

RAG and Webcrawler in a single package

23
Experimental
49 isSpicyCode/scrappe-tout

Scrappe-Tout is a web scraping tool designed to convert HTML documentation...

23
Experimental
50 jackise69/pdf-sentinel

🛡️ Convert PDF files to Markdown for LLM workflows with event-driven...

23
Experimental
51 vinaes/md-succ-ai

URL to Markdown API — md.succ.ai

23
Experimental
52 pengboyu-dev/Athanor-Epub-Converter

📘EPUB to RAG-ready Markdown with chunking, diagnostics, and clean structured output.

23
Experimental
53 ngpepin/pdftomd-RAG

RAG workflow-friendly enhancement of Marker that converts PDFs into a...

22
Experimental
54 bill-work/md-pdf-md

📄 Convert Markdown to visually appealing PDFs and extract Markdown from PDFs...

22
Experimental
55 chris-c-thomas/LexBuild

Open-source toolchain that converts the U.S. Code from legislative XML...

22
Experimental
56 GTA509FX/scrappe-tout

🚀 Convert web pages to clean Markdown fast with Playwright, perfect for...

22
Experimental
57 zcag/readdown

HTML to clean Markdown optimized for LLMs. Replaces readability + turndown....

22
Experimental
58 Santex12/confluence2md

🛠️ Convert Confluence `.doc` exports to clean Markdown effortlessly with...

22
Experimental
59 Quippy22/web2llm

Fetch web pages and convert to clean Markdown for LLM pipelines

22
Experimental
60 sumit7235/Domfie

🛠️ Simplify web scraping with Domfie, the self-healing scraper that adapts...

22
Experimental
61 Horlicks-p/Moelog-LLMs.txt

This plugin implements the emerging llms.txt specification for WordPress,...

22
Experimental
62 marimo-marine23/xlmelt

Convert complex Excel files into AI-readable JSON/HTML

22
Experimental
63 nadya1992024/llm-parse

Parse HTML and markdown offline with a lightweight, single-header C++...

22
Experimental
64 AlphaDev007/AlphaCrawl

A high-performance, asynchronous Go web crawler built to extract LLM-ready...

22
Experimental
65 auto-medica-labs/md-tree

Convert Markdown files into hierarchical JSON tree structures with optional...

22
Experimental
66 danke-global/crawl2kb

Crawl a website and export embedding-ready chunks for RAG pipelines

22
Experimental
67 pinion05/llm-page-context

Turn any web page into clean LLM-ready context strings and structured documents.

22
Experimental
68 Thordata/thordata-cookbook

Real-world recipes and examples for building AI data pipelines with Thordata.

21
Experimental
69 Thordata/thordata-web-qa-agent

> Web-native QA agent built on Thordata that delivers a Perplexity-style...

21
Experimental
70 moria97/fastpdf4llm

Lightweight and fast library to convert PDF to markdown format.

20
Experimental
71 davidjsors/br-pdf-to-md-to-rag

Conversor de PDFs para Markdown estruturado, otimizado para ingestão em...

20
Experimental
72 EasyDevv/project-to-markdown

Project To Markdown: Project files into structured markdown, optimizing...

20
Experimental
73 ilyashusterman/doc-to-readable

Universal document-to-markdown and section splitter for HTML, URLs, and PDFs.

20
Experimental
74 wmahfoudh/pdf-to-md

Automates the pipeline of converting PDF documents and images into clean...

19
Experimental
75 gsusI/llm-docs-sync

Fetch official LLM provider docs (OpenAI, Gemini) from llms.txt into...

19
Experimental
76 StripFeed/stripfeed-js

Official TypeScript SDK for StripFeed - convert any URL to clean Markdown...

19
Experimental
77 pedrokohler/github-repo-to-single-file

TypeScript CLI that pulls a GitHub repo and merges all text-like files into...

18
Experimental
78 Ai4GenXers/pdf-sentinel

Event-driven PDF to Markdown conversion for LLM workflows - 60x faster, zero...

17
Experimental
79 JamesN-dev/Scroll-Scribe

ScrollScribe is a Python CLI toolkit that grab docs or index website pages...

16
Experimental
80 PetrAPConsulting/image2md

Convert batch of pictures with structured data like tables, formulas, charts...

16
Experimental
81 itsmeyessir/Domfie

An autonomous web scraper that fixes its own broken selectors using a...

16
Experimental
82 QLangstaff/qrawl

Composable web crawling tools for Rust

15
Experimental
83 abcd2113004/url-reader

🔍 Extract content from any URL with smart platform detection and automatic...

14
Experimental
84 ShaniPlayx/newsweek-scraper

📰 Collect and analyze fresh articles, headlines, and stories from Newsweek...

14
Experimental
85 the-ai-entrepreneur-ai-hub/ai-training-data-scraper

AI Training Data Scraper - Extract LLM & RAG-Ready Web Content for Machine...

14
Experimental
86 OutofAi/manemark

Manemark allows users to capture and save the text content of webpages so it...

14
Experimental
87 beepboop2025/social-scraper

Economic data collection & AI analysis platform — 14 collectors (RBI, NSE,...

14
Experimental
88 elementarpartikel/ultimate-web-crawler

Webbdammsugare Pro v3.0 är en GUI-baserad webbcrawler för AI- och...

14
Experimental
89 siddueswar/doc-crawler-rag

🕷️ Ingest clean documentation into LLM pipelines effortlessly, filtering out...

14
Experimental
90 Thordata/thordata-rag-pipeline

🚀 Production-grade RAG pipeline powered by Thordata Scrapers. Turn any...

13
Experimental
91 SupervisedCo/HyperCrawlTurbo

HypercrawlTurbo is a turbocharged web scraper for extracting URLs from a webpage.

13
Experimental
92 kwanLeeFrmVi/Crawler4AI-to-mardown-files

This project is designed to crawl documentation websites and convert them...

13
Experimental
93 aaronlifton/fastcrawl

an agentic, atomics-driven Rust web crawler optimized for low heap usage,...

13
Experimental
94 amadou-6e/pymdt2json

pymdt2json is a Python CLI and library for converting markdown tables into...

12
Experimental
95 AhmedZeyadTareq/Content_To_Markdown_OCR

convert any file to markdown format

12
Experimental
96 bloomresearch/InSite

A lightning fast tool for crawling websites and compiling PDFs of their pages

12
Experimental
97 QuiddityAI/PDFerret

An all-in-one converter to make your files LLM-understandable

11
Experimental
98 m1r4g3-code/Distill

Distill — Turn any URL into clean, structured data for AI pipelines, RAG...

11
Experimental
99 JeremySmythDigital/sitevac

scrape any docs site into one AI-ready file TXT, Markdown, or pre-chunked...

11
Experimental
100 im-shashanks/PdfToMarkdown

Lightweight PDF to Markdown converter.

11
Experimental
101 Edgaras0x4E/web-loader-engine

High-performance web content extraction engine built in Rust. Primary...

11
Experimental