rom1504/img2dataset
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Supports flexible output formats (WebDataset, Parquet, TFRecord) and caption preservation for multimodal datasets, with configurable resizing strategies and metadata tracking via JSON/Parquet sidecars. Uses multiprocess + multithreaded architecture for scalability, respects machine-readable opt-out directives (X-Robots-Tag headers), and integrates with PySpark for distributed processing across clusters.
4,380 stars and 88,786 monthly downloads. Used by 1 other package. Available on PyPI.
Stars
4,380
Forks
372
Language
Python
License
MIT
Category
Last pushed
Oct 19, 2025
Monthly downloads
88,786
Commits (30d)
0
Dependencies
11
Reverse dependents
1
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/rom1504/img2dataset"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related frameworks
devrimcavusoglu/pybboxes
Light weight toolkit for bounding boxes providing conversion between bounding box types and...
PyRetri/PyRetri
Open source deep learning based unsupervised image retrieval toolbox built on PyTorch🔥
Particle1904/DatasetHelpers
Dataset Helper program to automatically select, re scale and tag Datasets (composed of image and...
salesforce/LAVIS
LAVIS - A One-stop Library for Language-Vision Intelligence
haltakov/natural-language-image-search
Search photos on Unsplash using natural language