titipata/pubmed_parser
:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
Extracts structured metadata and full-text content (abstracts, paragraphs, citations, images, tables) from PubMed OA and MEDLINE XML using `lxml` for efficient parsing into Python dictionaries. Supports both compressed and uncompressed XML files, NCBI E-utils API queries, and reference linking via PMIDs and DOIs. Built for biomedical text mining and NLP workflows with specialized parsers for figures, captions, MeSH terms, and chemical entities.
727 stars and 9,126 monthly downloads. No commits in the last 6 months. Available on PyPI.
Stars
727
Forks
178
Language
Python
License
MIT
Category
Last pushed
Jul 31, 2025
Monthly downloads
9,126
Commits (30d)
0
Dependencies
5
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/titipata/pubmed_parser"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
nfflow/pubmedflow
Data Collection API for pubmed
greenelab/snorkeling
Extracting biomedical relationships from literature with Snorkel 🏊
KarelDO/BioDEX
BioDEX: Large-Scale Biomedical Adverse Drug Event Extraction for Real-World Pharmacovigilance.
jind11/PubMed-PICO-Detection
PubMed PICO Element Detection Dataset
purplepotion/sadrat
Smart Adverse Drug Reaction Assessment Tools.