File Content Extraction RAG Tools

Tools for extracting text, metadata, and structured data from various file formats (PDF, Office docs, images, web pages, audio). Does NOT include chunking strategies, vector storage, or post-extraction processing pipelines.

There are 61 file content extraction tools tracked. 2 score above 70 (verified tier). The highest-rated is PaddlePaddle/PaddleOCR at 95/100 with 72,167 stars and 1,622,419 monthly downloads. 3 of the top 10 are actively maintained.

Get all 61 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=file-content-extraction&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 PaddlePaddle/PaddleOCR

Turn any PDF or image document into structured data for your AI. A powerful,...

95
Verified
2 kreuzberg-dev/kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text,...

92
Verified
3 yfedoseev/pdf_oxide

The fastest PDF library for Python and Rust. Text extraction, image...

67
Established
4 opendataloader-project/opendataloader-pdf

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

62
Established
5 NanoNets/docext

An on-premises, OCR-free unstructured data extraction, markdown conversion...

55
Established
6 AKSarav/pdfstract

PDFStract - The Extraction and Chunking Layer in Your RAG Pipeline -...

52
Established
7 docling-project/docling-java

A Java API for Docling

48
Emerging
8 explosion/spacy-layout

πŸ“š Process PDFs, Word documents and more with spaCy

44
Emerging
9 velocitybolt/open-extract

Structured Data Extractor for AI Agents. Search your documents or the web...

44
Emerging
10 quarkiverse/quarkus-docling

Docling simplifies document processing, parsing diverse formats β€” including...

41
Emerging
11 lazyFrogLOL/llmdocparser

A package for parsing PDFs and analyzing their content using LLMs.

40
Emerging
12 drmingler/smart-llm-loader

smart-llm-loader is a lightweight yet powerful Python package that...

39
Emerging
13 anyparser/anyparser_core

Anyparser Python SDK for RAG/ETL Pipelines - File Content Extraction....

39
Emerging
14 y3ex/ragtable-extract

Extract tables precisely from PDFs and convert them to clean HTML for RAG...

37
Emerging
15 beenguelllayounes/ragtable-extract

Extract tables precisely from PDFs and convert them to clean HTML for RAG...

37
Emerging
16 loryanstrant/unifi-documenter

Auto generation of UniFi network documentation

35
Emerging
17 risshe92/docprobe

Universal documentation extraction tool

28
Experimental
18 novatechflow/docai

Local-first OCR β†’ Markdown β†’ RAG toolkit with optional Hugging Face/custom...

28
Experimental
19 zhangyu1818/apple-docs-for-rag

Apple Documentation Markdown For RAG

27
Experimental
20 baughmann/tikara

The metadata and text content extractor for almost every file type.

26
Experimental
21 Anecha9610/document-parser-ai

πŸ“„ Simplify data extraction from PDFs and documents using AI APIs for...

25
Experimental
22 Blacksuan19/structx

Type-safe structured data extraction from text using LLMs.

24
Experimental
23 ParthaPRay/Docling_Colab

This repo contains google colab notebook for handing Docling for data...

24
Experimental
24 Huang-lab/figure-extractor

Flask-based service using PDFFigures 2.0 to extract figures and tables from...

23
Experimental
25 msbayindir/rag-chunker

PDF β†’ Mistral OCR β†’ deterministic AST chunker with Anthropic contextual...

23
Experimental
26 mylxsw/extractor

extractor is an HTTP service used to convert PDF, Markdown, HTML, Docx,...

23
Experimental
27 anyparser/anyparser_crewai

Supercharge your AI workflows by combining Anyparser’s advanced content...

23
Experimental
28 R0mb0/DocScraper_GUI

Automate your OSINT and document research. This desktop app searches the web...

23
Experimental
29 tarrantwrong366/OCR-Document-parser

πŸ“ Streamline document analysis by extracting key fields from PAN cards,...

22
Experimental
30 sussskiiirocks189/Scanned-PDF-to-Vector

πŸ“„ Convert scanned PDFs into searchable, copyable, and vectorized documents...

22
Experimental
31 KoDiit/llm-cerebroscope

πŸ•΅οΈ Analyze forensic data with LLM-CerebroScope, a powerful AI-driven engine...

22
Experimental
32 tbast24/docling_preprocessor_factory_public

Provide a local preprocessing pipeline to extract and standardize...

22
Experimental
33 DS4SD/quackling

Build document-native LLM applications

22
Experimental
34 anyparser/anyparser_langchain

Integrate Anyparser's powerful content extraction capabilities with...

21
Experimental
35 thomassuedbroecker/docling_preprocessor_factory_public

Docling Preprocessor Factory is an open-source project that provides a...

21
Experimental
36 ZhuJiaxin2/ragtable-extract

PDF table extraction for RAG β€” convert to clean HTML. Fast, local, no GPU.

20
Experimental
37 qbxlvnf11/ocr-document-parser-for-rag

OCR Document Markdown/HTML Parser for RAG

19
Experimental
38 segunalabi383/Data-Extractor

Structured Data Extractor for AI Agents. Search your documents or the web...

19
Experimental
39 jtgsystems/OCR-TOOL-REALTIME

πŸ“ Real-time OCR tool - Extract text from images and videos with live processing

19
Experimental
40 rlozanointel/Vromlix-AI-Engine

Cognitive ETL Engine & Architecture for Personal Knowledge Graphs....

19
Experimental
41 syw2014/langparse

LangParse is a universal document parsing and text chunking engine for LLM...

18
Experimental
42 amirkiarafiei/docling-processor

A Docling extension for superior PDF/DOCX to Markdown conversion, featuring...

17
Experimental
43 xaman27x/Adobe-PDF-CTD

A high-performance, multi-stage document processor with two interconnected...

16
Experimental
44 muradali4442/thesis_extractor

Use text + tables from PDFs for RAG (BM25 + LLM).

16
Experimental
45 rangga276/ocr-llm-agent

πŸ–ΌοΈ Extract and process text from images with an OCR AI agent, featuring...

15
Experimental
46 MuntahaShams/Document_AI_for_Custom_Data_Extraction

Automated extraction of structured information from semi-structured...

14
Experimental
47 mkai80/DocMeld

Transform documents into structured, agent-ready knowledge efficiently with...

14
Experimental
48 the-ai-entrepreneur-ai-hub/pdf-parser-api

PDF Parser API - Extract text, metadata & page data from PDF files via HTTP API

14
Experimental
49 sitemap-ai/backend

SitemapRAG is an open-source tool designed to leverage your website's...

13
Experimental
50 AlwaysSany/doc-extract-parse-index

The project is designed to streamline the workflow of extracting, parsing,...

12
Experimental
51 anyparser/anyparserjs

Anyparser Typescript SDK for RAG/ETL Pipelines - File Content Extraction....

12
Experimental
52 elchemista/doc_dig

DocDig is an Elixir wrapper around the Rust-based extractous library,...

12
Experimental
53 kreuzberg-dev/.github

Kreuzberg is a fast, polyglot document intelligence engine with a Rust core....

12
Experimental
54 qlfv/Docling-Testing

Repository for testing and demonstrating the capabilities of Docling for...

12
Experimental
55 r00ters/tika-plus-docker

Docker image to build Apache Tika Full + JPEG2000 + JBIG2

11
Experimental
56 jeehoonyu/PDF_Seperator

A lightweight tool for splitting PDF documents into chapters, optimized for...

11
Experimental
57 DGloi/utillity-files-to-text

Creates an endpoint to extract text content, images and document from...

11
Experimental
58 AhmedZeyadTareq/Llama-Parse-Content-Extraction

extract and analyze content from various file formats including PDFs, text...

11
Experimental
59 johnzfitch/human-interface-markdown

Apple Human Interface Guidelines archive (1980-2014) - 35 documents...

11
Experimental
60 kevv1m/tikara

The metadata and text content extractor for almost every file type.

11
Experimental
61 anyparser/anyparser_llamaindex

Instantly access Anyparser's robust document processing and data extraction...

10
Experimental