File Content Extraction RAG Tools
Tools for extracting text, metadata, and structured data from various file formats (PDF, Office docs, images, web pages, audio). Does NOT include chunking strategies, vector storage, or post-extraction processing pipelines.
There are 61 file content extraction tools tracked. 2 score above 70 (verified tier). The highest-rated is PaddlePaddle/PaddleOCR at 95/100 with 72,167 stars and 1,622,419 monthly downloads. 3 of the top 10 are actively maintained.
Get all 61 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=file-content-extraction&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
PaddlePaddle/PaddleOCR
Turn any PDF or image document into structured data for your AI. A powerful,... |
|
Verified |
| 2 |
kreuzberg-dev/kreuzberg
A polyglot document intelligence framework with a Rust core. Extract text,... |
|
Verified |
| 3 |
yfedoseev/pdf_oxide
The fastest PDF library for Python and Rust. Text extraction, image... |
|
Established |
| 4 |
opendataloader-project/opendataloader-pdf
PDF Parser for AI-ready data. Automate PDF accessibility. Open-source. |
|
Established |
| 5 |
NanoNets/docext
An on-premises, OCR-free unstructured data extraction, markdown conversion... |
|
Established |
| 6 |
AKSarav/pdfstract
PDFStract - The Extraction and Chunking Layer in Your RAG Pipeline -... |
|
Established |
| 7 |
docling-project/docling-java
A Java API for Docling |
|
Emerging |
| 8 |
explosion/spacy-layout
π Process PDFs, Word documents and more with spaCy |
|
Emerging |
| 9 |
velocitybolt/open-extract
Structured Data Extractor for AI Agents. Search your documents or the web... |
|
Emerging |
| 10 |
quarkiverse/quarkus-docling
Docling simplifies document processing, parsing diverse formats β including... |
|
Emerging |
| 11 |
lazyFrogLOL/llmdocparser
A package for parsing PDFs and analyzing their content using LLMs. |
|
Emerging |
| 12 |
drmingler/smart-llm-loader
smart-llm-loader is a lightweight yet powerful Python package that... |
|
Emerging |
| 13 |
anyparser/anyparser_core
Anyparser Python SDK for RAG/ETL Pipelines - File Content Extraction.... |
|
Emerging |
| 14 |
y3ex/ragtable-extract
Extract tables precisely from PDFs and convert them to clean HTML for RAG... |
|
Emerging |
| 15 |
beenguelllayounes/ragtable-extract
Extract tables precisely from PDFs and convert them to clean HTML for RAG... |
|
Emerging |
| 16 |
loryanstrant/unifi-documenter
Auto generation of UniFi network documentation |
|
Emerging |
| 17 |
risshe92/docprobe
Universal documentation extraction tool |
|
Experimental |
| 18 |
novatechflow/docai
Local-first OCR β Markdown β RAG toolkit with optional Hugging Face/custom... |
|
Experimental |
| 19 |
zhangyu1818/apple-docs-for-rag
Apple Documentation Markdown For RAG |
|
Experimental |
| 20 |
baughmann/tikara
The metadata and text content extractor for almost every file type. |
|
Experimental |
| 21 |
Anecha9610/document-parser-ai
π Simplify data extraction from PDFs and documents using AI APIs for... |
|
Experimental |
| 22 |
Blacksuan19/structx
Type-safe structured data extraction from text using LLMs. |
|
Experimental |
| 23 |
ParthaPRay/Docling_Colab
This repo contains google colab notebook for handing Docling for data... |
|
Experimental |
| 24 |
Huang-lab/figure-extractor
Flask-based service using PDFFigures 2.0 to extract figures and tables from... |
|
Experimental |
| 25 |
msbayindir/rag-chunker
PDF β Mistral OCR β deterministic AST chunker with Anthropic contextual... |
|
Experimental |
| 26 |
mylxsw/extractor
extractor is an HTTP service used to convert PDF, Markdown, HTML, Docx,... |
|
Experimental |
| 27 |
anyparser/anyparser_crewai
Supercharge your AI workflows by combining Anyparserβs advanced content... |
|
Experimental |
| 28 |
R0mb0/DocScraper_GUI
Automate your OSINT and document research. This desktop app searches the web... |
|
Experimental |
| 29 |
tarrantwrong366/OCR-Document-parser
π Streamline document analysis by extracting key fields from PAN cards,... |
|
Experimental |
| 30 |
sussskiiirocks189/Scanned-PDF-to-Vector
π Convert scanned PDFs into searchable, copyable, and vectorized documents... |
|
Experimental |
| 31 |
KoDiit/llm-cerebroscope
π΅οΈ Analyze forensic data with LLM-CerebroScope, a powerful AI-driven engine... |
|
Experimental |
| 32 |
tbast24/docling_preprocessor_factory_public
Provide a local preprocessing pipeline to extract and standardize... |
|
Experimental |
| 33 |
DS4SD/quackling
Build document-native LLM applications |
|
Experimental |
| 34 |
anyparser/anyparser_langchain
Integrate Anyparser's powerful content extraction capabilities with... |
|
Experimental |
| 35 |
thomassuedbroecker/docling_preprocessor_factory_public
Docling Preprocessor Factory is an open-source project that provides a... |
|
Experimental |
| 36 |
ZhuJiaxin2/ragtable-extract
PDF table extraction for RAG β convert to clean HTML. Fast, local, no GPU. |
|
Experimental |
| 37 |
qbxlvnf11/ocr-document-parser-for-rag
OCR Document Markdown/HTML Parser for RAG |
|
Experimental |
| 38 |
segunalabi383/Data-Extractor
Structured Data Extractor for AI Agents. Search your documents or the web... |
|
Experimental |
| 39 |
jtgsystems/OCR-TOOL-REALTIME
π Real-time OCR tool - Extract text from images and videos with live processing |
|
Experimental |
| 40 |
rlozanointel/Vromlix-AI-Engine
Cognitive ETL Engine & Architecture for Personal Knowledge Graphs.... |
|
Experimental |
| 41 |
syw2014/langparse
LangParse is a universal document parsing and text chunking engine for LLM... |
|
Experimental |
| 42 |
amirkiarafiei/docling-processor
A Docling extension for superior PDF/DOCX to Markdown conversion, featuring... |
|
Experimental |
| 43 |
xaman27x/Adobe-PDF-CTD
A high-performance, multi-stage document processor with two interconnected... |
|
Experimental |
| 44 |
muradali4442/thesis_extractor
Use text + tables from PDFs for RAG (BM25 + LLM). |
|
Experimental |
| 45 |
rangga276/ocr-llm-agent
πΌοΈ Extract and process text from images with an OCR AI agent, featuring... |
|
Experimental |
| 46 |
MuntahaShams/Document_AI_for_Custom_Data_Extraction
Automated extraction of structured information from semi-structured... |
|
Experimental |
| 47 |
mkai80/DocMeld
Transform documents into structured, agent-ready knowledge efficiently with... |
|
Experimental |
| 48 |
the-ai-entrepreneur-ai-hub/pdf-parser-api
PDF Parser API - Extract text, metadata & page data from PDF files via HTTP API |
|
Experimental |
| 49 |
sitemap-ai/backend
SitemapRAG is an open-source tool designed to leverage your website's... |
|
Experimental |
| 50 |
AlwaysSany/doc-extract-parse-index
The project is designed to streamline the workflow of extracting, parsing,... |
|
Experimental |
| 51 |
anyparser/anyparserjs
Anyparser Typescript SDK for RAG/ETL Pipelines - File Content Extraction.... |
|
Experimental |
| 52 |
elchemista/doc_dig
DocDig is an Elixir wrapper around the Rust-based extractous library,... |
|
Experimental |
| 53 |
kreuzberg-dev/.github
Kreuzberg is a fast, polyglot document intelligence engine with a Rust core.... |
|
Experimental |
| 54 |
qlfv/Docling-Testing
Repository for testing and demonstrating the capabilities of Docling for... |
|
Experimental |
| 55 |
r00ters/tika-plus-docker
Docker image to build Apache Tika Full + JPEG2000 + JBIG2 |
|
Experimental |
| 56 |
jeehoonyu/PDF_Seperator
A lightweight tool for splitting PDF documents into chapters, optimized for... |
|
Experimental |
| 57 |
DGloi/utillity-files-to-text
Creates an endpoint to extract text content, images and document from... |
|
Experimental |
| 58 |
AhmedZeyadTareq/Llama-Parse-Content-Extraction
extract and analyze content from various file formats including PDFs, text... |
|
Experimental |
| 59 |
johnzfitch/human-interface-markdown
Apple Human Interface Guidelines archive (1980-2014) - 35 documents... |
|
Experimental |
| 60 |
kevv1m/tikara
The metadata and text content extractor for almost every file type. |
|
Experimental |
| 61 |
anyparser/anyparser_llamaindex
Instantly access Anyparser's robust document processing and data extraction... |
|
Experimental |