allenai/papermage

library supporting NLP and CV research on scientific papers

58
/ 100
Established

Provides multi-layered document parsing that segments PDFs into dynamically-indexed entity types (pages, tokens, sentences, sections, figures, tables, etc.), enabling cross-layer queries where entities can be traversed regardless of nesting boundaries. Built on modular `Parser`/`Rasterizer`/`Predictor` components—wrapping PDFPlumber for text extraction, PDF2Image for rendering, and custom ML models for semantic segmentation. Documents serialize to JSON for reproducible pipelines combining heuristic and learned extraction.

791 stars and 313 monthly downloads. No commits in the last 6 months. Available on PyPI.

Stale 6m
Maintenance 0 / 25
Adoption 16 / 25
Maturity 25 / 25
Community 17 / 25

How are scores calculated?

Stars

791

Forks

64

Language

Python

License

Apache-2.0

Last pushed

Nov 08, 2024

Monthly downloads

313

Commits (30d)

0

Dependencies

10

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/allenai/papermage"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.