jack-tol/usda-food-data-pipeline
Code for the USDA Branded Food Dataset pipeline and the USDA Food Assistant. This project consolidates USDA FoodData Central data into a structured dataset, along with an interactive tool that allows for conversational exploration of food items, nutrients, and ingredients.
The pipeline automates ingestion and transformation of 34 USDA FoodData Central CSV files into a normalized, ML-ready dataset. The Food Assistant uses semantic search via Pinecone vector indexing with multilingual-e5-large embeddings to enable conversational queries, combining retrieval with language generation to answer nutrition and ingredient questions. The cleaned dataset is published on HuggingFace Datasets with a live demo available on HuggingFace Spaces.
No commits in the last 6 months.
Stars
7
Forks
1
Language
Jupyter Notebook
License
MIT
Category
Last pushed
Nov 07, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/data-engineering/jack-tol/usda-food-data-pipeline"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.