Multimodal Vision Language Voice AI Tools

There are 6 multimodal vision language tools tracked. 2 score above 50 (established tier). The highest-rated is canopyai/Orpheus-TTS at 51/100 with 6,000 stars.

Get all 6 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=voice-ai&subcategory=multimodal-vision-language&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 canopyai/Orpheus-TTS

Towards Human-Sounding Speech

51
Established
2 lifeiteng/vall-e

PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo...

51
Established
3 Plachtaa/VALL-E-X

An open source implementation of Microsoft's VALL-E X zero-shot TTS model....

46
Emerging
4 umbertocappellazzo/Omni-AVSR

Official Pytorch implementation of "Omni-AVSR: Towards Unified Multimodal...

33
Emerging
5 primepake/learnable-speech

This repo is text to speech with learnable audio encoder without alignment...

29
Experimental
6 ExplainableML/ZerAuCap

[NeurIPS 2023 - ML for Audio Workshop (Oral)] Zero-shot audio captioning...

12
Experimental

Comparisons in this category