CatchTheTornado/text-extract-api

Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown

51
/ 100
Established

Built on FastAPI with Celery task queues and Redis caching, it supports pluggable OCR strategies (EasyOCR, MiniCPM-V, Llama 3.2-Vision, and remote services like Marker) that can be swapped based on language/accuracy needs. The system runs entirely self-hosted via Docker—no cloud dependencies—with optional remote Ollama integration for scaling LLM-based post-processing, document structure parsing, and PII redaction across diverse document types and table extraction.

2,989 stars.

No Package No Dependents
Maintenance 6 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 19 / 25

How are scores calculated?

Stars

2,989

Forks

252

Language

Python

License

MIT

Last pushed

Dec 08, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/CatchTheTornado/text-extract-api"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.