Document OCR Extraction NLP Tools

Tools for extracting structured and unstructured text from documents (PDFs, scans, receipts, invoices, IDs) using OCR and computer vision. Does NOT include general document analysis, summarization, or retrieval systems without extraction focus.

There are 57 document ocr extraction tools tracked. 2 score above 70 (verified tier). The highest-rated is deepdoctection/deepdoctection at 85/100 with 3,147 stars and 5,833 monthly downloads. 2 of the top 10 are actively maintained.

Get all 57 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=document-ocr-extraction&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 deepdoctection/deepdoctection

A Repo For Document AI

85
Verified
2 deanmalmgren/textract

extract text from any document. no muss. no fuss.

78
Verified
3 eikek/docspell

Assist in organizing your piles of documents, resulting from scanners,...

54
Established
4 clovaai/donut

Official Implementation of OCR-free Document Understanding Transformer...

45
Emerging
5 axa-group/Parsr

Transforms PDF, Documents and Images into Enriched Structured Data

44
Emerging
6 zzzDavid/ICDAR-2019-SROIE

ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information...

44
Emerging
7 Saransh-cpp/OCRed

Clever, simple, and intuitive wrapper functionalities for OCRing specific...

40
Emerging
8 rithulkamesh/docproc

Document Intelligence Platform — Extract, refine, and query documents with...

38
Emerging
9 gnana70/tamil_ocr

OCR Tamil is a powerful tool that can detect and recognize text in Tamil...

37
Emerging
10 JonnoB/reading_the_unreadable

A pipeline for performing OCR on historical newspapers

36
Emerging
11 Rushi-Balapure/pdf_2_json_extractor

A high-performance Python library for extracting structured content from PDF...

34
Emerging
12 NjoyimPeguy/ICDAR-2019-RRC-SROIE

ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information...

33
Emerging
13 s3nh/text-detector

Tool which allow you to detect and translate text.

32
Emerging
14 gani114433/OCR_workflow

N8N OCR workflow

30
Emerging
15 Shulk97/daniel

This repository contain the implementation of DANIEL. (A fast Document...

29
Experimental
16 clovaai/webvicob

Official Implementation of Web-based Visual Corpus Builder (Webvicob), ICDAR 2023

29
Experimental
17 louisbrulenaudet/apple-ocr

Easy-to-Use Apple Vision wrapper for text extraction, scalar representation...

28
Experimental
18 situx/CuneiPainter

An App to recognize cuneiform characters on your Android phone

28
Experimental
19 lukevanin/OCRAI

Optical Character Recognition Artificial Intelligence iOS app for Udacity nanodegree

28
Experimental
20 trhgquan/OCR_chu_nom

Đồ án OCR chữ Nôm (CSC15006)

27
Experimental
21 Samuel310/Text-Recognition

Android application to extract text from an image using firebase MLkit.

26
Experimental
22 codebywiam/invoice-ocr

This project extracts key fields (like invoice number, date, total, and...

26
Experimental
23 ierolsen/Business-Card-Reader-App

The main idea of this project is that extracting entities from the scanned...

24
Experimental
24 jweissenberger/auto-docs

A CLI tool that automatically generates documentation for python code using...

24
Experimental
25 Zer0-Bug/ID-Document_Recognition

End-to-end offline OCR and semantic parsing pipeline for identity documents...

21
Experimental
26 macosnik/Recognize-text-from-image

Telegram-бот для распознавания текста на изображениях с использованием нейросетей

21
Experimental
27 DecisionNerd/docunderstand

A python system for Visually Rich Document Understanding

19
Experimental
28 SundayOni/document-ocr-nlp-pipeline

End-to-end pipeline for extracting and structuring text from scanned, PDF...

19
Experimental
29 nicdriebe/ocr-ner-sharepic-evaluation

Bachelor's Thesis: Evaluation of open-source OCR and NER pipelines...

19
Experimental
30 michael-borck/document-lens

Analyzes text documents for readability, academic integrity, and linguistic...

19
Experimental
31 avrtt/MobileEAST

Paper and code for a lightweight & fast scene text detection based on EAST...

17
Experimental
32 itshivams/Persona-Driven-Document-Intelligence

Persona-Driven Document Intelligence – A lightweight, CPU-only system that...

17
Experimental
33 isikmuhamm/unstructured-data-extraction-engine

Automated data ingestion pipeline for extracting plain text from proprietary...

17
Experimental
34 transybao1393/android-ocr

Android OCR using CameraX, support MLKit, support offline mode, support...

17
Experimental
35 fmadore/iwac-ai-pipelines

AI pipelines for Omeka S digital collections - OCR correction, entity...

17
Experimental
36 erl-ang/interactive-ocr

Implementation of a couple of heuristics that estimate OCR quality without...

16
Experimental
37 meck93/ScanOrUploadMe

A React-Native mobile application that digitalizes physical event...

16
Experimental
38 xuan3986/Texthandle

Open source project provided to Baidu PaddlePaddle community. Apply...

15
Experimental
39 marekpridal/Vision-OCR-Demo

Sample project for on-device text recognition

15
Experimental
40 iytedbb/OSPA-SuryaOCR

OSPA SuryaOCR – Advanced document processing framework for historical...

15
Experimental
41 SivaPA08/text-capture

Captures screen regions, extracts text and copies it to the clipboard

14
Experimental
42 dev-sungman/recent-ocr-papers

this repo include paper review, code in text detection, text recognition,...

14
Experimental
43 asainov1/invoice-generator-agent

Telegram bot for invoice generation — OCR (Tesseract) → NLP parsing → PDF...

14
Experimental
44 Komorebirumu/awe-ms-20260315-2211-01

AI Historical Document Transcription & Analysis CLI Tool

14
Experimental
45 archity/doc-scanner

Computer Vision and NLP based document scanner, text extractor and summarizer.

14
Experimental
46 esteininger/file-processor

A Python library that uses AI to convert unstructured files (like PDFs,...

14
Experimental
47 HySonLab/TeBaAb

TeBaAb: Text-Based Antigen-Conditioned Antibody Redesign via Directed Evolution

14
Experimental
48 Keizouw8/OCR-Command-Line-Tool

A tool that can be used in the CLI or NodeJS environment to scan for text in...

13
Experimental
49 shubh11220/PDF-Text-Extraction

Create a data extraction platform for users to conveniently obtain data in a...

13
Experimental
50 avirajsa/DocuMind

DocuMind - Python project for document analysis. Analyze, summarize, and...

13
Experimental
51 Cool-fire/Snipps

📚 📝📜 A simple android app to convert information into digital snippets,...

12
Experimental
52 mishaelaaa/OCR

This is a project in which I store all my attempts to create an application...

12
Experimental
53 saloni-rangari/nlp-ocr-marathi

This mini-project implements Marathi handwritten text recognition using...

12
Experimental
54 husnutass/ml_kit_app

A Flutter mobile app to read data from business cards and save that data in...

11
Experimental
55 emilyhasson/Text-Recognition

Scripts to convert low-quality scanned PDFs to text files using Google Cloud...

11
Experimental
56 fdovila/PDF2TXT4NLP

an online Python web app that accepts academic articles in PDF format and...

10
Experimental
57 Prateek32177/TextlyAI

AI-powered tool to extract and classify text from images using OCR and...

10
Experimental