eellak/glossAPI
Greek Dataset Production from PDF+
Combines PyPDFium extraction with pluggable OCR backends (Docling, RapidOCR, DeepSeek) and Rust-accelerated text cleaning to convert academic PDFs into structured Markdown with configurable noise filtering and math enrichment. The modular `Corpus` API enables resumable, multi-stage processing (download → extract → clean → section → annotate → export) with Greek-optimized metadata and section classification, while mode-specific venv provisioning supports vanilla, GPU-accelerated, and large-model OCR pipelines.
128 stars. Available on PyPI.
Stars
128
Forks
29
Language
Python
License
—
Category
Last pushed
Mar 10, 2026
Commits (30d)
0
Dependencies
11
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/eellak/glossAPI"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
mozilla-ai/structured-qa
Blueprint by Mozilla.ai for answering questions about structured documents
KalyanM45/DocGenius-Revolutionizing-PDFs-with-AI
This is a Python application that allows you to load a PDF and ask questions about it using...
alejandro-ao/langchain-ask-pdf
An AI-app that allows you to upload a PDF and ask questions about it. It uses OpenAI's LLMs to...
leehanchung/llm-pdf-qa-workshop
Introduction to LLM App Development Workshop: PDF Q&A App using OpenAI, Langchain, and Chainlit
pymupdf/langchain-pymupdf4llm
An integration package connecting PyMuPDF4LLM to LangChain