File Content Extraction RAG Tools

Tools for extracting text, metadata, and structured data from various file formats (PDF, Office docs, images, web pages, audio). Does NOT include chunking strategies, vector storage, or post-extraction processing pipelines.

There are 61 file content extraction tools tracked. 2 score above 70 (verified tier). The highest-rated is PaddlePaddle/PaddleOCR at 95/100 with 72,167 stars and 1,622,419 monthly downloads. 3 of the top 10 are actively maintained.

Get all 61 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=file-content-extraction&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	PaddlePaddle/PaddleOCR Turn any PDF or image document into structured data for your AI. A powerful,...	95	Verified	72,167	Python
2	kreuzberg-dev/kreuzberg A polyglot document intelligence framework with a Rust core. Extract text,...	92	Verified	6,689	Rust
3	yfedoseev/pdf_oxide The fastest PDF library for Python and Rust. Text extraction, image...	67	Established	421	Rust
4	opendataloader-project/opendataloader-pdf PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.	62	Established	1,958	Java
5	NanoNets/docext An on-premises, OCR-free unstructured data extraction, markdown conversion...	55	Established	1,871	Python
6	AKSarav/pdfstract PDFStract - The Extraction and Chunking Layer in Your RAG Pipeline -...	52	Established	128	Python
7	docling-project/docling-java A Java API for Docling	48	Emerging	87	Java
8	explosion/spacy-layout 📚 Process PDFs, Word documents and more with spaCy	44	Emerging	869	Python
9	velocitybolt/open-extract Structured Data Extractor for AI Agents. Search your documents or the web...	44	Emerging	185	Python
10	quarkiverse/quarkus-docling Docling simplifies document processing, parsing diverse formats — including...	41	Emerging	17	Java
11	lazyFrogLOL/llmdocparser A package for parsing PDFs and analyzing their content using LLMs.	40	Emerging	269	Python
12	drmingler/smart-llm-loader smart-llm-loader is a lightweight yet powerful Python package that...	39	Emerging	75	Python
13	anyparser/anyparser_core Anyparser Python SDK for RAG/ETL Pipelines - File Content Extraction....	39	Emerging	2	Python
14	y3ex/ragtable-extract Extract tables precisely from PDFs and convert them to clean HTML for RAG...	37	Emerging	1	HTML
15	beenguelllayounes/ragtable-extract Extract tables precisely from PDFs and convert them to clean HTML for RAG...	37	Emerging	1	HTML
16	loryanstrant/unifi-documenter Auto generation of UniFi network documentation	35	Emerging	2	Python
17	risshe92/docprobe Universal documentation extraction tool	28	Experimental	6	Python
18	novatechflow/docai Local-first OCR → Markdown → RAG toolkit with optional Hugging Face/custom...	28	Experimental	1	Python
19	zhangyu1818/apple-docs-for-rag Apple Documentation Markdown For RAG	27	Experimental	41	CoffeeScript
20	baughmann/tikara The metadata and text content extractor for almost every file type.	26	Experimental	5	Python
21	Anecha9610/document-parser-ai 📄 Simplify data extraction from PDFs and documents using AI APIs for...	25	Experimental	3	Python
22	Blacksuan19/structx Type-safe structured data extraction from text using LLMs.	24	Experimental	10	Python
23	ParthaPRay/Docling_Colab This repo contains google colab notebook for handing Docling for data...	24	Experimental	4	Jupyter Notebook
24	Huang-lab/figure-extractor Flask-based service using PDFFigures 2.0 to extract figures and tables from...	23	Experimental	15	Python
25	msbayindir/rag-chunker PDF → Mistral OCR → deterministic AST chunker with Anthropic contextual...	23	Experimental	1	TypeScript
26	mylxsw/extractor extractor is an HTTP service used to convert PDF, Markdown, HTML, Docx,...	23	Experimental	6	Python
27	anyparser/anyparser_crewai Supercharge your AI workflows by combining Anyparser’s advanced content...	23	Experimental	2	Python
28	R0mb0/DocScraper_GUI Automate your OSINT and document research. This desktop app searches the web...	23	Experimental	1	Python
29	tarrantwrong366/OCR-Document-parser 📝 Streamline document analysis by extracting key fields from PAN cards,...	22	Experimental	—	Python
30	sussskiiirocks189/Scanned-PDF-to-Vector 📄 Convert scanned PDFs into searchable, copyable, and vectorized documents...	22	Experimental	—	Python
31	KoDiit/llm-cerebroscope 🕵️ Analyze forensic data with LLM-CerebroScope, a powerful AI-driven engine...	22	Experimental	—	Python
32	tbast24/docling_preprocessor_factory_public Provide a local preprocessing pipeline to extract and standardize...	22	Experimental	—	Python
33	DS4SD/quackling Build document-native LLM applications	22	Experimental	56	Python
34	anyparser/anyparser_langchain Integrate Anyparser's powerful content extraction capabilities with...	21	Experimental	3	Python
35	thomassuedbroecker/docling_preprocessor_factory_public Docling Preprocessor Factory is an open-source project that provides a...	21	Experimental	2	Python
36	ZhuJiaxin2/ragtable-extract PDF table extraction for RAG — convert to clean HTML. Fast, local, no GPU.	20	Experimental	1	HTML
37	qbxlvnf11/ocr-document-parser-for-rag OCR Document Markdown/HTML Parser for RAG	19	Experimental	—	Python
38	segunalabi383/Data-Extractor Structured Data Extractor for AI Agents. Search your documents or the web...	19	Experimental	—	Python
39	jtgsystems/OCR-TOOL-REALTIME 📝 Real-time OCR tool - Extract text from images and videos with live processing	19	Experimental	—	Python
40	rlozanointel/Vromlix-AI-Engine Cognitive ETL Engine & Architecture for Personal Knowledge Graphs....	19	Experimental	—	Python
41	syw2014/langparse LangParse is a universal document parsing and text chunking engine for LLM...	18	Experimental	4	Python
42	amirkiarafiei/docling-processor A Docling extension for superior PDF/DOCX to Markdown conversion, featuring...	17	Experimental	2	Python
43	xaman27x/Adobe-PDF-CTD A high-performance, multi-stage document processor with two interconnected...	16	Experimental	1	Python
44	muradali4442/thesis_extractor Use text + tables from PDFs for RAG (BM25 + LLM).	16	Experimental	1	Python
45	rangga276/ocr-llm-agent 🖼️ Extract and process text from images with an OCR AI agent, featuring...	15	Experimental	1	Python
46	MuntahaShams/Document_AI_for_Custom_Data_Extraction Automated extraction of structured information from semi-structured...	14	Experimental	—	Jupyter Notebook
47	mkai80/DocMeld Transform documents into structured, agent-ready knowledge efficiently with...	14	Experimental	—	Python
48	the-ai-entrepreneur-ai-hub/pdf-parser-api PDF Parser API - Extract text, metadata & page data from PDF files via HTTP API	14	Experimental	—	JavaScript
49	sitemap-ai/backend SitemapRAG is an open-source tool designed to leverage your website's...	13	Experimental	8	Python
50	AlwaysSany/doc-extract-parse-index The project is designed to streamline the workflow of extracting, parsing,...	12	Experimental	1	JavaScript
51	anyparser/anyparserjs Anyparser Typescript SDK for RAG/ETL Pipelines - File Content Extraction....	12	Experimental	3	TypeScript
52	elchemista/doc_dig DocDig is an Elixir wrapper around the Rust-based extractous library,...	12	Experimental	1	Elixir
53	kreuzberg-dev/.github Kreuzberg is a fast, polyglot document intelligence engine with a Rust core....	12	Experimental	1	—
54	qlfv/Docling-Testing Repository for testing and demonstrating the capabilities of Docling for...	12	Experimental	—	HTML
55	r00ters/tika-plus-docker Docker image to build Apache Tika Full + JPEG2000 + JBIG2	11	Experimental	—	Dockerfile
56	jeehoonyu/PDF_Seperator A lightweight tool for splitting PDF documents into chapters, optimized for...	11	Experimental	—	Python
57	DGloi/utillity-files-to-text Creates an endpoint to extract text content, images and document from...	11	Experimental	—	Python
58	AhmedZeyadTareq/Llama-Parse-Content-Extraction extract and analyze content from various file formats including PDFs, text...	11	Experimental	—	Python
59	johnzfitch/human-interface-markdown Apple Human Interface Guidelines archive (1980-2014) - 35 documents...	11	Experimental	—	—
60	kevv1m/tikara The metadata and text content extractor for almost every file type.	11	Experimental	—	—
61	anyparser/anyparser_llamaindex Instantly access Anyparser's robust document processing and data extraction...	10	Experimental	1	Python

Comparisons in this category

PaddleOCR and opendataloader-pdf (95 vs 62) kreuzberg and pdf_oxide (92 vs 67)