Multimodal Vision Language Voice AI Tools
There are 6 multimodal vision language tools tracked. 2 score above 50 (established tier). The highest-rated is canopyai/Orpheus-TTS at 51/100 with 6,000 stars.
Get all 6 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=voice-ai&subcategory=multimodal-vision-language&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
canopyai/Orpheus-TTS
Towards Human-Sounding Speech |
|
Established |
| 2 |
lifeiteng/vall-e
PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo... |
|
Established |
| 3 |
Plachtaa/VALL-E-X
An open source implementation of Microsoft's VALL-E X zero-shot TTS model.... |
|
Emerging |
| 4 |
umbertocappellazzo/Omni-AVSR
Official Pytorch implementation of "Omni-AVSR: Towards Unified Multimodal... |
|
Emerging |
| 5 |
primepake/learnable-speech
This repo is text to speech with learnable audio encoder without alignment... |
|
Experimental |
| 6 |
ExplainableML/ZerAuCap
[NeurIPS 2023 - ML for Audio Workshop (Oral)] Zero-shot audio captioning... |
|
Experimental |