datachain-ai/datachain
Analytics, Versioning and ETL for multimodal data: video, audio, PDFs, images
Provides a Python dataframe-like API with vectorized operations and delta/retry processing for efficient incremental workflows on unstructured data stored in S3, GCP, Azure, or local filesystems. Integrates with LLM APIs and ML frameworks (PyTorch, TensorFlow) for enrichment and model application, while maintaining data references without duplication and metadata in an internal queryable database.
2,729 stars and 17,066 monthly downloads. Used by 1 other package. Actively maintained with 40 commits in the last 30 days. Available on PyPI.
Stars
2,729
Forks
136
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 12, 2026
Monthly downloads
17,066
Commits (30d)
40
Dependencies
36
Reverse dependents
1
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/mlops/datachain-ai/datachain"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.