CatchTheTornado/text-extract-api
Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown
Built on FastAPI with Celery task queues and Redis caching, it supports pluggable OCR strategies (EasyOCR, MiniCPM-V, Llama 3.2-Vision, and remote services like Marker) that can be swapped based on language/accuracy needs. The system runs entirely self-hosted via Docker—no cloud dependencies—with optional remote Ollama integration for scaling LLM-based post-processing, document structure parsing, and PII redaction across diverse document types and table extraction.
2,989 stars.
Stars
2,989
Forks
252
Language
Python
License
MIT
Category
Last pushed
Dec 08, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/CatchTheTornado/text-extract-api"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
NanoNets/docstrange
Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple...
hashangit/Extract2MD
Extract2MD is a powerful and versatile AI-enabled client-side JavaScript library for extracting...
Dicklesworthstone/llm_aided_ocr
Enhances Tesseract OCR output using LLMs (local or API) for error correction, smart chunking,...
th1nhhdk/local_ai_ocr
An local, offline (after initial setup), portable OCR software that can process images and PDF...
emcf/thepipe
Get clean data from tricky documents, powered by vision-language models ⚡