PT-Perkasa-Pilar-Utama/ppu-pdf
Pdf utilities for text extraction in digital and convert scanned pdf into canvas.
Offers dual extraction modes via `PdfReader` (mupdfjs-based) and `PdfReaderLegacy` (pdfjs-dist) with precise bounding box and font metadata, plus LLM-optimized Token Object Notation encoding for structured data. Detects scan vs. digital PDFs and handles scanned documents through canvas rendering with integrated OCR via `ppu-paddle-ocr`, enabling reconstruction of searchable PDFs with invisible text overlays. Provides line-grouping post-processing and configurable DPI/viewport resizing for flexible PDF processing pipelines.
Available on npm.
Stars
12
Forks
2
Language
TypeScript
License
MIT
Category
Last pushed
Mar 08, 2026
Commits (30d)
0
Dependencies
4
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/PT-Perkasa-Pilar-Utama/ppu-pdf"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
joungminsung/OpenDocuments
Self-hosted open-source RAG platform that unifies organizational documents and answers natural...
osllmai/inDox
The Indox Ecosystem offers integrated AI tools for data workflows. Our four components...
pega2077/ai_file_manager
AIFileManager--AI based file manager. Auto tag,classify,rag your documents,images,videos
Harry-027/DocuMind
A document based RAG application
kbrisso/byte-vision
Byte-Vision is a privacy-first document intelligence platform that transforms static documents...