philschmid/clipper.js

HTML to Markdown converter and crawler.

40

/ 100

Emerging

Leverages Mozilla's Readability for intelligent content extraction and Turndown for HTML-to-Markdown conversion, with optional Playwright-based crawling for batch processing entire sites. Supports multiple input formats (URLs, local HTML files, directories) and output formats (Markdown, JSONL), making it useful for dataset generation and web archival workflows. Can be chained with tools like poppler for PDF-to-Markdown conversion pipelines.

614 stars. No commits in the last 6 months.

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 14 / 25

How are scores calculated?

Stars

614

Forks

39

Language

TypeScript

License

Apache-2.0

Category

web-to-markdown-rag

Last pushed

Jan 09, 2024

Commits (30d)

0

Web-to-Markdown RAG · 101 tools

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/rag/philschmid/clipper.js"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

Higher-rated alternatives

any4ai/AnyCrawl

AnyCrawl 🚀: A Node.js/TypeScript crawler that turns websites into LLM-ready data and extracts...

kreuzberg-dev/html-to-markdown

High performance and CommonMark compliant HTML to Markdown converter. Maintained by the...

lightfeed/extractor

Using LLMs and AI browser automation to robustly extract web data

ScrapeGraphAI/Scrapegraph-ai

Python scraper based on AI

paulpierre/markdown-crawler

A multithreaded 🕸️ web crawler that recursively crawls a website and creates a 🔽 markdown file...

Explore RAG Tools

All categories Trending RAG directory Insights