Skip to content
/ M3-TTS Public

Pytorch Implementation of the paper "M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis"

Notifications You must be signed in to change notification settings

WWWWxp/M3-TTS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

28 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

$\Large \boldsymbol{\mathsf{\color{#6366f1}M\color{#a855f7}3\color{#ec4899}\text{-}TTS}}: \mathsf{\color{#de2910}M\color{black}\text{ulti-}\color{#de2910}M\color{black}\text{odal\ DiT\ Alignment}\ \color{black}\&\ \color{#de2910}M\color{black}\text{el-latent}}$

arXiv Demo Page

πŸ“… Roadmap

  • Release model code
  • Release training and inference code
  • Release pre-trained model weights

πŸ”₯ Key Features

  • No Pseudo-Alignment: Achieves stable alignment implicitly via Joint-DiT attention.
  • Mel-VAE Codec: Efficient latent representation for faster training and high-fidelity reconstruction.
  • Unified Architecture: A simple, end-to-end framework without complex multi-stage pipelines.

πŸ™Œ Acknowledgements

This project is built upon the excellent work of F5-TTS, MMAudio and Zip-Voice. We thank the authors for their open-source contributions.

πŸ“ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citation πŸ“:

@article{wang2025m3tts,
  title={M3-TTS: Multi-modal DiT Alignment \& Mel-latent for Zero-shot High-fidelity Speech Synthesis},
  author={Wang, Xiaopeng and Qiang, Chunyu and Fu, Ruibo and others},
  journal={arXiv preprint arXiv:2512.04720},
  year={2025}
}

About

Pytorch Implementation of the paper "M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages