rom1504/img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

71
/ 100
Verified

Supports flexible output formats (WebDataset, Parquet, TFRecord) and caption preservation for multimodal datasets, with configurable resizing strategies and metadata tracking via JSON/Parquet sidecars. Uses multiprocess + multithreaded architecture for scalability, respects machine-readable opt-out directives (X-Robots-Tag headers), and integrates with PySpark for distributed processing across clusters.

4,380 stars and 88,786 monthly downloads. Used by 1 other package. Available on PyPI.

Maintenance 6 / 25
Adoption 21 / 25
Maturity 25 / 25
Community 19 / 25

How are scores calculated?

Stars

4,380

Forks

372

Language

Python

License

MIT

Last pushed

Oct 19, 2025

Monthly downloads

88,786

Commits (30d)

0

Dependencies

11

Reverse dependents

1

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/rom1504/img2dataset"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.