allenai/papermage

library supporting NLP and CV research on scientific papers

/ 100

Established

Provides multi-layered document parsing that segments PDFs into dynamically-indexed entity types (pages, tokens, sentences, sections, figures, tables, etc.), enabling cross-layer queries where entities can be traversed regardless of nesting boundaries. Built on modular `Parser`/`Rasterizer`/`Predictor` components—wrapping PDFPlumber for text extraction, PDF2Image for rendering, and custom ML models for semantic segmentation. Documents serialize to JSON for reproducible pipelines combining heuristic and learned extraction.

791 stars and 313 monthly downloads. No commits in the last 6 months. Available on PyPI.

Stale 6m

Maintenance 0 / 25

Adoption 16 / 25

Maturity 25 / 25

Community 17 / 25

How are scores calculated?

Stars

791

Forks

Language

Python

License

Apache-2.0

Related frameworks

neuml/paperai

📄 🤖 AI for medical and scientific papers

asreview/asreview-makita

Workflow generator for simulation studies using the command line interface of ASReview LAB

supriya46788/Research-Paper-Organizer

Open-source beginner-friendly project

Tavris1/AI-Toolkit-Easy-Install

One-click Portable Windows installation of 'AI-Toolkit by Ostris'

alibaba/AliceMind

ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab

Explore ML Frameworks

All categories Trending ML Framework directory Insights