QuivrHQ/MegaParse

File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.

/ 100

Emerging

Preserves structural elements like tables, headers, footers, and images through multimodal vision models (GPT-4o, Claude 3.5) that achieve 0.87 similarity to source documents. Offers both Python library and REST API interfaces, with modular postprocessing architecture and benchmark evaluation tools for comparing parser performance.

7,347 stars. No commits in the last 6 months.

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 18 / 25

How are scores calculated?

Stars

7,347

Forks

416

Language

Python

License

Apache-2.0

Higher-rated alternatives

NanoNets/docstrange

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple...

hashangit/Extract2MD

Extract2MD is a powerful and versatile AI-enabled client-side JavaScript library for extracting...

Dicklesworthstone/llm_aided_ocr

Enhances Tesseract OCR output using LLMs (local or API) for error correction, smart chunking,...

th1nhhdk/local_ai_ocr

An local, offline (after initial setup), portable OCR software that can process images and PDF...

emcf/thepipe

Get clean data from tricky documents, powered by vision-language models ⚡

Explore LLM Tools

All categories Trending LLM Tool directory Insights