PDF Document Processing RAG Tools

Tools and systems for extracting, parsing, and retrieving information from PDF documents through OCR, layout analysis, and structured data conversion. Does NOT include general chatbots, multi-source document handling beyond PDFs, or chat interfaces built on top of processed PDFs.

There are 58 pdf document processing tools tracked. 2 score above 50 (established tier). The highest-rated is thiswillbeyourgithub/wdoc at 66/100 with 510 stars and 840 monthly downloads. 1 of the top 10 are actively maintained.

Get all 58 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=rag&subcategory=pdf-document-processing&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 thiswillbeyourgithub/wdoc

Summarize and query from a lot of heterogeneous documents. Any LLM provider,...

66
Established
2 laxmimerit/RAGWire

Production-grade RAG toolkit — ingest PDFs, DOCX, XLSX into Qdrant with LLM...

51
Established
3 NoEdgeAI/pdfdeal

A python wrapper for the Doc2X API and comes with native texts processing...

44
Emerging
4 Arterning/DeepParseX

DeepParseX 是一个强大的多模态文档解析与知识管理平台,支持 PDF、Word、Excel、PPT、图片、视频、音频...

44
Emerging
5 David-Lolly/ViewRAG

图文并茂的 PDF RAG 系统:支持版式感知分块、图表深度理解与精准视觉溯源。 Multimodal PDF RAG: Features...

40
Emerging
6 3DCF-Labs/doc2dataset

3DCF / doc2dataset: token-efficient document layer with NumGuard numeric...

36
Emerging
7 preprocess-co/rag-document-viewer

RAG Document Viewer is an open-source library that generates high-fidelity...

36
Emerging
8 atpuxiner/docsloader

This is a documents loader. (文档解析加载器,rag文档解析,rag知识库构建)

35
Emerging
9 zzstoatzz/raggy

scraping and querying documents for LLMs

31
Emerging
10 ManiAm/RAG-Mail

RAG-Mail is a thread-aware email processing system that semantically indexes...

27
Experimental
11 e-kotov/rdocdump

rdocdump: Dump ‘R’ Package Source, Documentation, and Vignettes into One File

25
Experimental
12 salameaz/pdf-process-rag

A Python-based application that extracts and processes PDF content using a...

24
Experimental
13 MalayAgr/bookacle

bookacle is a RAPTOR-based RAG application to aid in understanding complex...

23
Experimental
14 antoninomariarizzo/rag

A Python library for Retrieval-Augmented Generation (RAG) that extracts text...

23
Experimental
15 MohammedNasserAhmed/RAGPost

RAGPost is an intelligent blog post generator that leverages...

23
Experimental
16 Nexialism-Friday/hwpx-toolkit

HWP/HWPX document processing toolkit — extraction, generation, vectorization...

22
Experimental
17 salim-lakhal/rag-document-pipeline

Production RAG pipeline: multi-format document extraction → intelligent...

22
Experimental
18 S0lkar/IntGathering-x-RAG--BlazingDocs

RAG-based tool for document batch querying.

22
Experimental
19 SStephanJX/Snowflake-RAG-System

Production-ready Snowflake RAG system with type-specific chunking

22
Experimental
20 natanhp/PythoRAG

PythoRAG is a simple, open-source project designed to facilitate...

22
Experimental
21 AKSHAYINDIA05/Document_Comparison_System

Implement a Retrieval Augmented Generation (RAG) with a user interface for...

22
Experimental
22 Besthope-Official/predoc

Preprocess document service for RAG (Retriveal Augumented Generation)

21
Experimental
23 iamarunbrahma/rag-ingest

RAG-Ingest: A tool for converting PDFs to markdown and indexing them for...

20
Experimental
24 ParthSareen/simple-rag

Too many docs? Quickly search over any PDF or Markdown documents

20
Experimental
25 yotaken/docuggez

Automatic project documentator

19
Experimental
26 FrostWillmott/FinDocBot

Modern RAG, designed for semantic search and question-answering over...

19
Experimental
27 juhaodong/large-file-translator

Extract the content while preserving the layout, images, and tables. Perform...

18
Experimental
28 liunian-Jay/MU-GOT

PDF Parsing Tool: GOT's vLLM acceleration implementation, MinerU for layout...

18
Experimental
29 JochiRaider/sievio

Sievio turns GitHub, local repos, and web PDFs into clean JSONL for LLM...

17
Experimental
30 este6an13/checks-ocr

Software that applies OCR + RAG to extract bank checks information

16
Experimental
31 lolbigtime/Folio

Zero-config Swifty RAG toolkit for iOS & macOS — PDF/text loaders, universal...

16
Experimental
32 silas-rickards/PDF-LLM-RAG

A RAG pipeline specialized for local pdfs.

16
Experimental
33 Vibhuarvind/Content-Engine-RAG-for-PDF

Content Engine is RAG system that analyzes and compares multiple PDF...

15
Experimental
34 slvg01/90.10d_RAG_OnTheFly

An app allowing to upload files (ppt, doc, pdf, zip) and RAG on their content

15
Experimental
35 A-Najjar/rag-factory

Modular RAG system with Factory Pattern - Load PDF/Word docs, configure...

15
Experimental
36 husaynirfan1/PullData

RAG with response in what you need. Output directly with supported format...

15
Experimental
37 solomonjie/rag-processor

RAG index pipeline, from raw data clean to index. each step communicate via...

15
Experimental
38 JuliaGenAI/DocsScraper.jl

Efficient RAG knowledge pack creator from online Julia documentation

14
Experimental
39 ashwyan/local-llm-pdf-analyzer

A local AI tool using Ollama (Llama 3) to analyze PDF documents and generate...

14
Experimental
40 Clearedge-AI/clearedge

Build a RAG preprocessing pipeline

14
Experimental
41 alrafiabdullah/doc_rag

Document RAG with HuggingFace Token

14
Experimental
42 yagmur-kurtbas/pdf-rag-pipeline

A RAG pipeline for PDF question answering using LangChain, ChromaDB and Groq...

14
Experimental
43 ahmad-albasha/DataForg

PDF to JSON pipeline with intelligent bilingual chunking (AR/EN) and a fully...

14
Experimental
44 ritheesh-dev/Local-PDF-RAG-System

Privacy-first local PDF RAG system using FAISS + Ollama — fully offline,...

14
Experimental
45 avocatt/ocr-rag-highlighted-viewer

OCR + RAG document viewer with highlighted search results

14
Experimental
46 will695672804/graphrag-engineering-pdfs

🔍 Extract entities and build knowledge graphs from large engineering PDFs,...

14
Experimental
47 fllin1/mawa

RAG workflow (Mistral OCR + Gemini) for complex regulatory PDFs....

12
Experimental
48 zenmakhlouf/arabic-rag-pipeline

A single-file RAG pipeline for Arabic PDF lectures with two-stage retrieval,...

11
Experimental
49 shivkhurana/technical-docs-rag-pipeline

Enterprise-grade RAG (Retrieval Augmented Generation) pipeline using...

11
Experimental
50 andersborgabiro/RagQueryDocuments

RAG application that makes it easy to search in multiple documents

11
Experimental
51 malkhabir/EasyRag

EasyRag shows you how to embed and query table documents from your own local...

11
Experimental
52 julicq/PDF-RAG-Query

RAG model for PDF database

11
Experimental
53 Qinnovation123/papers

PDF embedding workflow

11
Experimental
54 bazilicum/pdf-query

This project processes and retrieves information from PDF file or PDF...

11
Experimental
55 adrianizmi/Simple-RAG

Minimalist RAG system built from scratch using Python, local embeddings, and...

11
Experimental
56 2dogsandanerd/rag_pdf_audit

Tool to compare pdf extraction methods

10
Experimental
57 nkarast/ask-my-pdf

A RAG application using local LLM to answer questions given a PDF.

10
Experimental
58 sfkunal/librarian

Librarian is a RAG-assisted LLM application that allows any user to query...

10
Experimental