Document Data Extraction LLM Tools
Tools for extracting, parsing, and converting structured data from unstructured documents (PDFs, images, invoices, etc.) using OCR and LLMs. Does NOT include general document summarization, web scraping, or downstream analytics applications.
There are 74 document data extraction tools tracked. 5 score above 50 (established tier). The highest-rated is NanoNets/docstrange at 62/100 with 1,379 stars and 2,912 monthly downloads. 2 of the top 10 are actively maintained.
Get all 74 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=document-data-extraction&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
NanoNets/docstrange
Extract and convert data from any document, images, pdfs, word doc, ppt or... |
|
Established |
| 2 |
Dicklesworthstone/llm_aided_ocr
Enhances Tesseract OCR output using LLMs (local or API) for error... |
|
Established |
| 3 |
th1nhhdk/local_ai_ocr
An local, offline (after initial setup), portable OCR software that can... |
|
Established |
| 4 |
hashangit/Extract2MD
Extract2MD is a powerful and versatile AI-enabled client-side JavaScript... |
|
Established |
| 5 |
CatchTheTornado/text-extract-api
Document (PDF, Word, PPTX ...) extraction and parse API using state of the... |
|
Established |
| 6 |
emcf/thepipe
Get clean data from tricky documents, powered by vision-language models β‘ |
|
Emerging |
| 7 |
langstruct-ai/langstruct
Extract structured data from any content using LLMs. |
|
Emerging |
| 8 |
QuivrHQ/MegaParse
File Parser optimised for LLM Ingestion with no loss π§ Parse PDFs, Docx,... |
|
Emerging |
| 9 |
CambioML/uniflow
LLM-based text extraction from unstructured data like PDFs, Words and HTMLs.... |
|
Emerging |
| 10 |
Xyntopia/pydoxtools
Effortlessly extract information from unstructured data with this library,... |
|
Emerging |
| 11 |
Capevace/data-wizard
Extract structured data from PDFs, Word docs and images. Embeddable directly... |
|
Emerging |
| 12 |
enoch3712/ExtractThinker
ExtractThinker is a Document Intelligence library for LLMs, offering... |
|
Emerging |
| 13 |
arshad-yaseen/ocr-llm
β‘οΈ Fast, ultra-accurate text extraction from any image or PDFβincluding... |
|
Emerging |
| 14 |
langchain-ai/langchain-extract
π¦βοΈ Did you say you like data? |
|
Emerging |
| 15 |
heripo-lab/heripo-engine
TypeScript library for extracting structured data from archaeological... |
|
Emerging |
| 16 |
ShengjieJin/pdftrim-for-llm
An open-source Zotero plugin for vibe reading and LLM-assisted paper... |
|
Emerging |
| 17 |
LM-150A/docflash
β‘ AI-powered content intelligence with structured data extraction.... |
|
Emerging |
| 18 |
junhoyeo/BetterOCR
π Better text detection by combining multiple OCR engines (EasyOCR,... |
|
Emerging |
| 19 |
Traves-Theberge/webform-cli
A CLI tool for extracting unstructured data from websites using customizable... |
|
Emerging |
| 20 |
kennethleungty/LangExtract-Gemma-Structured-Extraction
Using LangExtract and Gemma 3 for structured information extraction from... |
|
Emerging |
| 21 |
Lazzzer/structurizer
Structurizer is a web application that helps you extract structured data... |
|
Emerging |
| 22 |
messeb/py-openai-receipt-extractor
Extracts structured data from receipts via OpenAI API |
|
Experimental |
| 23 |
mohanbing/st_doc_ext
This repository contains the code for the information extraction app that... |
|
Experimental |
| 24 |
phanxuanquang/XCan-AI
Extract the text, style, format, and layout from any images, even... |
|
Experimental |
| 25 |
lias-laboratory/cidoccrm-llm-extractor
A tool for automating CIDOC CRM knowledge graph population using Large... |
|
Experimental |
| 26 |
CredentialEngine/ctdl-xtra
CTDL xTRA (eXtensible Extract and Transformation Assistant) is a tool for... |
|
Experimental |
| 27 |
jamesmcroft/azure-document-intelligence-markdown-to-openai-data-extraction-sample
This sample demonstrates how to use Document Intelligence's Layout model to... |
|
Experimental |
| 28 |
jamesmcroft/ai-document-data-extraction-evaluation
This project demonstrates how to evaluate the use of LLMs and SLMs for... |
|
Experimental |
| 29 |
isaiah76/Reviewer
extracts text from pdfs and powerpoint documents and summarizes it into key... |
|
Experimental |
| 30 |
sabber-slt/NetExtract
NetExtract: Efficiently extract core content from any webpage and convert it... |
|
Experimental |
| 31 |
CambioML/any-parser
Accurate, private and configurable document retrieval LLM |
|
Experimental |
| 32 |
Juliofal4822/deepseek-ocr-multigpu-infer
π Run efficient DeepSeek-OCR inference with Python scripts, supporting both... |
|
Experimental |
| 33 |
pranavgupta2603/SplitwiseGPTVision
SplitwiseGPT Vision: Streamline bill splitting with AI-driven image... |
|
Experimental |
| 34 |
Nguyendu9096/langcore-api
Provide production-ready HTTP API for structured document extraction using... |
|
Experimental |
| 35 |
ilyassuelen/InsightAI
InsightAI: Python-based document processing platform with chunking,... |
|
Experimental |
| 36 |
jaimvizalla01/aiwhisperer
π Optimize your large documents for AI analysis by converting and splitting... |
|
Experimental |
| 37 |
QuartzUnit/docpick
Lightweight OCR + Local LLM β Schema-based Structured JSON Extraction |
|
Experimental |
| 38 |
Randika00/VisionGPT-Extractor
An AI-powered tool designed to extract structured data from documents,... |
|
Experimental |
| 39 |
AFLucas-UOM/Accurate-Name-Extraction
2026 IEEE Conference on Artificial Intelligence (CAI26) Β· A modular computer... |
|
Experimental |
| 40 |
lecuong1502/NanoOCR
NanoOCR β Internal document OCR system powered by GLM-OCR, with a FastAPI... |
|
Experimental |
| 41 |
zero-nnkn/cook-extract
An automated CLI tool meant to securely scan, download, and extract... |
|
Experimental |
| 42 |
wmahfoudh/crabocr
PDF and image to-text converter with XFA forms support. It extract embedded... |
|
Experimental |
| 43 |
mike-grant/intelliextract
Extract structured data from your unstructured data |
|
Experimental |
| 44 |
Tek233/Document-Processing-with-OCR
An agent for document processing using OCR |
|
Experimental |
| 45 |
SH-Nihil-Mukkesh-25/fractaAI
FractaAI is a Streamlit-based application for exploring and visualizing text... |
|
Experimental |
| 46 |
ThePagePage/docschema
Document schema extraction framework for regulated industries. Parse complex... |
|
Experimental |
| 47 |
ycastorium/lextract
LLM-powered text extraction library for Elixir |
|
Experimental |
| 48 |
Danitilahun/Document-processing-Pdf-Structured-Data-Extractor
This project demonstrates how to extract structured information from PDF... |
|
Experimental |
| 49 |
Ja-yy/Invoice-extractor
Streamlit app leveraging OpenAI's LLM for accurate invoice extraction,... |
|
Experimental |
| 50 |
agxp/docpulse
Async document intelligence API β upload any PDF/DOCX/image + a JSON Schema,... |
|
Experimental |
| 51 |
mu373/vertex-ai-ocr
Convert scanned book images to Markdown with Gemini |
|
Experimental |
| 52 |
lisstasy/Receipt_Scanner
Advanced receipt OCR and analysis using PaddleOCR, GPT-3.5-turbo, Plotly,... |
|
Experimental |
| 53 |
leadershop/marksheet-information-extraction-api
π Extract and validate data from academic marksheets using AI for accurate... |
|
Experimental |
| 54 |
kninepro09/intelligent-document-understanding
π Analyze unstructured documents with an end-to-end NLP system for... |
|
Experimental |
| 55 |
voidpenguin-28/Textractor-ExtraExtensions
Several useful Textractor extensions, which are not available by default in... |
|
Experimental |
| 56 |
isobarbaric/SnapTrack
a receipt CLI |
|
Experimental |
| 57 |
awalz92/schema-extract-deke
Schema-driven structured data extraction from unstructured text using local... |
|
Experimental |
| 58 |
cucumberian/__ai_draft-parser
structured data extraction from drafts |
|
Experimental |
| 59 |
obieg-zero/plugin-wibor-docs
OCR, ekstrakcja danych z umow, Q&A o kontrakcie |
|
Experimental |
| 60 |
xiangjianxiaohuangyu/paper-extract-app
AI-powered desktop tool for extracting structured information from academic... |
|
Experimental |
| 61 |
haritha8503/langextract
π Extract languages from text seamlessly using LangExtract. Simplify... |
|
Experimental |
| 62 |
Jishnnu/InvoiceAI-Document-Parser
Simple Streamlit application that parses the data from Invoice images and... |
|
Experimental |
| 63 |
amit-timalsina/document_classification
All in one package for Document (image, pdf) Classification. Unified... |
|
Experimental |
| 64 |
JannesKlaas/doxstractor
Extract structured data from document in a modular way using NLP and LLMs. |
|
Experimental |
| 65 |
andyed/fascist-language-analyzer
langchain+langextract gemini-api breakdown of Project2025 text by Umberto... |
|
Experimental |
| 66 |
PMTheTechGuy/document-entity-extractor
AI-powered document extractor for names, emails, and organizations. License: MIT |
|
Experimental |
| 67 |
RPramodh/LLM-based-Invoice-Extractor
This repository hosts the source code for an Invoice Extractor application... |
|
Experimental |
| 68 |
HTLinh0604/invoice_ocr_craft_llama3
This CRAFT + Llama 3.1 pipeline automates invoice semantic extraction,... |
|
Experimental |
| 69 |
junotb/omniparse-ai-stack
Document & image parsing full-stack demo. OCR, VLM, document layout... |
|
Experimental |
| 70 |
abdulmanafsahito/Vision-OCR
A general OCR and image-understanding web app. Upload an image, write a... |
|
Experimental |
| 71 |
r0b0tan/document-ai-demo
Full-stack demo for AI-assisted document analysis. Upload a document and let... |
|
Experimental |
| 72 |
VianneyMI/amplifai
Amplifai is a package that allows you to transform your raw unstructured... |
|
Experimental |
| 73 |
aryanesmaili/JobExtractor
a simple Job exteactor from job posting website that uses llama3.2 model to... |
|
Experimental |
| 74 |
sahajrajmalla/invoice-data-extraction-llm
Extract invoice document data using Large Language Models. |
|
Experimental |