titipata/pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset

71
/ 100
Verified

Extracts structured metadata and full-text content (abstracts, paragraphs, citations, images, tables) from PubMed OA and MEDLINE XML using `lxml` for efficient parsing into Python dictionaries. Supports both compressed and uncompressed XML files, NCBI E-utils API queries, and reference linking via PMIDs and DOIs. Built for biomedical text mining and NLP workflows with specialized parsers for figures, captions, MeSH terms, and chemical entities.

727 stars and 9,126 monthly downloads. No commits in the last 6 months. Available on PyPI.

Stale 6m
Maintenance 2 / 25
Adoption 19 / 25
Maturity 25 / 25
Community 25 / 25

How are scores calculated?

Stars

727

Forks

178

Language

Python

License

MIT

Last pushed

Jul 31, 2025

Monthly downloads

9,126

Commits (30d)

0

Dependencies

5

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/titipata/pubmed_parser"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.