Multimodal Vision Language Voice AI Tools

There are 6 multimodal vision language tools tracked. 2 score above 50 (established tier). The highest-rated is canopyai/Orpheus-TTS at 51/100 with 6,000 stars.

Get all 6 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=voice-ai&subcategory=multimodal-vision-language&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	canopyai/Orpheus-TTS Towards Human-Sounding Speech	51	Established	6,000	Python
2	lifeiteng/vall-e PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo...	51	Established	2,207	Python
3	Plachtaa/VALL-E-X An open source implementation of Microsoft's VALL-E X zero-shot TTS model....	46	Emerging	7,954	Python
4	umbertocappellazzo/Omni-AVSR Official Pytorch implementation of "Omni-AVSR: Towards Unified Multimodal...	33	Emerging	31	Python
5	primepake/learnable-speech This repo is text to speech with learnable audio encoder without alignment...	29	Experimental	54	Python
6	ExplainableML/ZerAuCap [NeurIPS 2023 - ML for Audio Workshop (Oral)] Zero-shot audio captioning...	12	Experimental	18	Python

Comparisons in this category

VALL-E-X and vall-e (46 vs 51)