eellak/glossAPI

Greek Dataset Production from PDF+

68
/ 100
Established

Combines PyPDFium extraction with pluggable OCR backends (Docling, RapidOCR, DeepSeek) and Rust-accelerated text cleaning to convert academic PDFs into structured Markdown with configurable noise filtering and math enrichment. The modular `Corpus` API enables resumable, multi-stage processing (download → extract → clean → section → annotate → export) with Greek-optimized metadata and section classification, while mode-specific venv provisioning supports vanilla, GPU-accelerated, and large-model OCR pipelines.

128 stars. Available on PyPI.

Maintenance 13 / 25
Adoption 10 / 25
Maturity 25 / 25
Community 20 / 25

How are scores calculated?

Stars

128

Forks

29

Language

Python

License

Category

pdf-qa-systems

Last pushed

Mar 10, 2026

Commits (30d)

0

Dependencies

11

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/eellak/glossAPI"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.