allenai/papermage
library supporting NLP and CV research on scientific papers
Provides multi-layered document parsing that segments PDFs into dynamically-indexed entity types (pages, tokens, sentences, sections, figures, tables, etc.), enabling cross-layer queries where entities can be traversed regardless of nesting boundaries. Built on modular `Parser`/`Rasterizer`/`Predictor` components—wrapping PDFPlumber for text extraction, PDF2Image for rendering, and custom ML models for semantic segmentation. Documents serialize to JSON for reproducible pipelines combining heuristic and learned extraction.
791 stars and 313 monthly downloads. No commits in the last 6 months. Available on PyPI.
Stars
791
Forks
64
Language
Python
License
Apache-2.0
Category
Last pushed
Nov 08, 2024
Monthly downloads
313
Commits (30d)
0
Dependencies
10
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/allenai/papermage"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related frameworks
neuml/paperai
📄 🤖 AI for medical and scientific papers
asreview/asreview-makita
Workflow generator for simulation studies using the command line interface of ASReview LAB
supriya46788/Research-Paper-Organizer
Open-source beginner-friendly project
Tavris1/AI-Toolkit-Easy-Install
One-click Portable Windows installation of 'AI-Toolkit by Ostris'
alibaba/AliceMind
ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab