sigoden/rag-crawler
Crawl a website to generate knowledge file for RAG
Extracts page content via CSS selectors and outputs structured JSON or individual markdown files, with configurable concurrency limits and path exclusion patterns. Includes auto-detected presets for popular platforms like GitHub Wiki and Markdown repositories, eliminating manual configuration for common documentation sources. Supports both HTML crawling and direct GitHub tree traversal for markdown-native documentation.
No commits in the last 6 months. Available on npm.
Stars
50
Forks
11
Language
TypeScript
License
MIT
Category
Last pushed
Apr 03, 2025
Monthly downloads
11
Commits (30d)
0
Dependencies
6
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/sigoden/rag-crawler"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
any4ai/AnyCrawl
AnyCrawl π: A Node.js/TypeScript crawler that turns websites into LLM-ready data and extracts...
kreuzberg-dev/html-to-markdown
High performance and CommonMark compliant HTML to Markdown converter. Maintained by the...
lightfeed/extractor
Using LLMs and AI browser automation to robustly extract web data
ScrapeGraphAI/Scrapegraph-ai
Python scraper based on AI
paulpierre/markdown-crawler
A multithreaded πΈοΈ web crawler that recursively crawls a website and creates a π½ markdown file...