eellak/glossAPI

Greek Dataset Production from PDF+

/ 100

Established

Combines PyPDFium extraction with pluggable OCR backends (Docling, RapidOCR, DeepSeek) and Rust-accelerated text cleaning to convert academic PDFs into structured Markdown with configurable noise filtering and math enrichment. The modular `Corpus` API enables resumable, multi-stage processing (download → extract → clean → section → annotate → export) with Greek-optimized metadata and section classification, while mode-specific venv provisioning supports vanilla, GPU-accelerated, and large-model OCR pipelines.

128 stars. Available on PyPI.

Maintenance 13 / 25

Adoption 10 / 25

Maturity 25 / 25

Community 20 / 25

How are scores calculated?

Stars

128

Forks

Language

Python

License

—

Related tools

mozilla-ai/structured-qa

Blueprint by Mozilla.ai for answering questions about structured documents

KalyanM45/DocGenius-Revolutionizing-PDFs-with-AI

This is a Python application that allows you to load a PDF and ask questions about it using...

alejandro-ao/langchain-ask-pdf

An AI-app that allows you to upload a PDF and ask questions about it. It uses OpenAI's LLMs to...

leehanchung/llm-pdf-qa-workshop

Introduction to LLM App Development Workshop: PDF Q&A App using OpenAI, Langchain, and Chainlit

pymupdf/langchain-pymupdf4llm

An integration package connecting PyMuPDF4LLM to LangChain

Explore LLM Tools

All categories Trending LLM Tool directory Insights