Document Data Extraction LLM Tools

Tools for extracting, parsing, and converting structured data from unstructured documents (PDFs, images, invoices, etc.) using OCR and LLMs. Does NOT include general document summarization, web scraping, or downstream analytics applications.

There are 74 document data extraction tools tracked. 5 score above 50 (established tier). The highest-rated is NanoNets/docstrange at 62/100 with 1,379 stars and 2,912 monthly downloads. 2 of the top 10 are actively maintained.

Get all 74 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=document-data-extraction&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 NanoNets/docstrange

Extract and convert data from any document, images, pdfs, word doc, ppt or...

62
Established
2 Dicklesworthstone/llm_aided_ocr

Enhances Tesseract OCR output using LLMs (local or API) for error...

58
Established
3 th1nhhdk/local_ai_ocr

An local, offline (after initial setup), portable OCR software that can...

54
Established
4 hashangit/Extract2MD

Extract2MD is a powerful and versatile AI-enabled client-side JavaScript...

53
Established
5 CatchTheTornado/text-extract-api

Document (PDF, Word, PPTX ...) extraction and parse API using state of the...

51
Established
6 emcf/thepipe

Get clean data from tricky documents, powered by vision-language models ⚑

49
Emerging
7 langstruct-ai/langstruct

Extract structured data from any content using LLMs.

48
Emerging
8 QuivrHQ/MegaParse

File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx,...

44
Emerging
9 CambioML/uniflow

LLM-based text extraction from unstructured data like PDFs, Words and HTMLs....

43
Emerging
10 Xyntopia/pydoxtools

Effortlessly extract information from unstructured data with this library,...

43
Emerging
11 Capevace/data-wizard

Extract structured data from PDFs, Word docs and images. Embeddable directly...

42
Emerging
12 enoch3712/ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs, offering...

41
Emerging
13 arshad-yaseen/ocr-llm

⚑️ Fast, ultra-accurate text extraction from any image or PDFβ€”including...

41
Emerging
14 langchain-ai/langchain-extract

πŸ¦œβ›οΈ Did you say you like data?

38
Emerging
15 heripo-lab/heripo-engine

TypeScript library for extracting structured data from archaeological...

37
Emerging
16 ShengjieJin/pdftrim-for-llm

An open-source Zotero plugin for vibe reading and LLM-assisted paper...

37
Emerging
17 LM-150A/docflash

⚑ AI-powered content intelligence with structured data extraction....

35
Emerging
18 junhoyeo/BetterOCR

πŸ” Better text detection by combining multiple OCR engines (EasyOCR,...

35
Emerging
19 Traves-Theberge/webform-cli

A CLI tool for extracting unstructured data from websites using customizable...

32
Emerging
20 kennethleungty/LangExtract-Gemma-Structured-Extraction

Using LangExtract and Gemma 3 for structured information extraction from...

31
Emerging
21 Lazzzer/structurizer

Structurizer is a web application that helps you extract structured data...

31
Emerging
22 messeb/py-openai-receipt-extractor

Extracts structured data from receipts via OpenAI API

29
Experimental
23 mohanbing/st_doc_ext

This repository contains the code for the information extraction app that...

27
Experimental
24 phanxuanquang/XCan-AI

Extract the text, style, format, and layout from any images, even...

27
Experimental
25 lias-laboratory/cidoccrm-llm-extractor

A tool for automating CIDOC CRM knowledge graph population using Large...

27
Experimental
26 CredentialEngine/ctdl-xtra

CTDL xTRA (eXtensible Extract and Transformation Assistant) is a tool for...

26
Experimental
27 jamesmcroft/azure-document-intelligence-markdown-to-openai-data-extraction-sample

This sample demonstrates how to use Document Intelligence's Layout model to...

26
Experimental
28 jamesmcroft/ai-document-data-extraction-evaluation

This project demonstrates how to evaluate the use of LLMs and SLMs for...

26
Experimental
29 isaiah76/Reviewer

extracts text from pdfs and powerpoint documents and summarizes it into key...

25
Experimental
30 sabber-slt/NetExtract

NetExtract: Efficiently extract core content from any webpage and convert it...

25
Experimental
31 CambioML/any-parser

Accurate, private and configurable document retrieval LLM

24
Experimental
32 Juliofal4822/deepseek-ocr-multigpu-infer

πŸš€ Run efficient DeepSeek-OCR inference with Python scripts, supporting both...

23
Experimental
33 pranavgupta2603/SplitwiseGPTVision

SplitwiseGPT Vision: Streamline bill splitting with AI-driven image...

23
Experimental
34 Nguyendu9096/langcore-api

Provide production-ready HTTP API for structured document extraction using...

22
Experimental
35 ilyassuelen/InsightAI

InsightAI: Python-based document processing platform with chunking,...

22
Experimental
36 jaimvizalla01/aiwhisperer

πŸ“„ Optimize your large documents for AI analysis by converting and splitting...

22
Experimental
37 QuartzUnit/docpick

Lightweight OCR + Local LLM β†’ Schema-based Structured JSON Extraction

22
Experimental
38 Randika00/VisionGPT-Extractor

An AI-powered tool designed to extract structured data from documents,...

22
Experimental
39 AFLucas-UOM/Accurate-Name-Extraction

2026 IEEE Conference on Artificial Intelligence (CAI26) Β· A modular computer...

22
Experimental
40 lecuong1502/NanoOCR

NanoOCR β€” Internal document OCR system powered by GLM-OCR, with a FastAPI...

22
Experimental
41 zero-nnkn/cook-extract

An automated CLI tool meant to securely scan, download, and extract...

22
Experimental
42 wmahfoudh/crabocr

PDF and image to-text converter with XFA forms support. It extract embedded...

22
Experimental
43 mike-grant/intelliextract

Extract structured data from your unstructured data

20
Experimental
44 Tek233/Document-Processing-with-OCR

An agent for document processing using OCR

19
Experimental
45 SH-Nihil-Mukkesh-25/fractaAI

FractaAI is a Streamlit-based application for exploring and visualizing text...

19
Experimental
46 ThePagePage/docschema

Document schema extraction framework for regulated industries. Parse complex...

19
Experimental
47 ycastorium/lextract

LLM-powered text extraction library for Elixir

19
Experimental
48 Danitilahun/Document-processing-Pdf-Structured-Data-Extractor

This project demonstrates how to extract structured information from PDF...

18
Experimental
49 Ja-yy/Invoice-extractor

Streamlit app leveraging OpenAI's LLM for accurate invoice extraction,...

17
Experimental
50 agxp/docpulse

Async document intelligence API β€” upload any PDF/DOCX/image + a JSON Schema,...

16
Experimental
51 mu373/vertex-ai-ocr

Convert scanned book images to Markdown with Gemini

16
Experimental
52 lisstasy/Receipt_Scanner

Advanced receipt OCR and analysis using PaddleOCR, GPT-3.5-turbo, Plotly,...

15
Experimental
53 leadershop/marksheet-information-extraction-api

πŸŽ“ Extract and validate data from academic marksheets using AI for accurate...

14
Experimental
54 kninepro09/intelligent-document-understanding

πŸ“„ Analyze unstructured documents with an end-to-end NLP system for...

14
Experimental
55 voidpenguin-28/Textractor-ExtraExtensions

Several useful Textractor extensions, which are not available by default in...

14
Experimental
56 isobarbaric/SnapTrack

a receipt CLI

14
Experimental
57 awalz92/schema-extract-deke

Schema-driven structured data extraction from unstructured text using local...

14
Experimental
58 cucumberian/__ai_draft-parser

structured data extraction from drafts

14
Experimental
59 obieg-zero/plugin-wibor-docs

OCR, ekstrakcja danych z umow, Q&A o kontrakcie

14
Experimental
60 xiangjianxiaohuangyu/paper-extract-app

AI-powered desktop tool for extracting structured information from academic...

14
Experimental
61 haritha8503/langextract

🌐 Extract languages from text seamlessly using LangExtract. Simplify...

14
Experimental
62 Jishnnu/InvoiceAI-Document-Parser

Simple Streamlit application that parses the data from Invoice images and...

13
Experimental
63 amit-timalsina/document_classification

All in one package for Document (image, pdf) Classification. Unified...

13
Experimental
64 JannesKlaas/doxstractor

Extract structured data from document in a modular way using NLP and LLMs.

13
Experimental
65 andyed/fascist-language-analyzer

langchain+langextract gemini-api breakdown of Project2025 text by Umberto...

12
Experimental
66 PMTheTechGuy/document-entity-extractor

AI-powered document extractor for names, emails, and organizations. License: MIT

12
Experimental
67 RPramodh/LLM-based-Invoice-Extractor

This repository hosts the source code for an Invoice Extractor application...

12
Experimental
68 HTLinh0604/invoice_ocr_craft_llama3

This CRAFT + Llama 3.1 pipeline automates invoice semantic extraction,...

12
Experimental
69 junotb/omniparse-ai-stack

Document & image parsing full-stack demo. OCR, VLM, document layout...

11
Experimental
70 abdulmanafsahito/Vision-OCR

A general OCR and image-understanding web app. Upload an image, write a...

11
Experimental
71 r0b0tan/document-ai-demo

Full-stack demo for AI-assisted document analysis. Upload a document and let...

11
Experimental
72 VianneyMI/amplifai

Amplifai is a package that allows you to transform your raw unstructured...

11
Experimental
73 aryanesmaili/JobExtractor

a simple Job exteactor from job posting website that uses llama3.2 model to...

10
Experimental
74 sahajrajmalla/invoice-data-extraction-llm

Extract invoice document data using Large Language Models.

10
Experimental