- Release model code
- Release training and inference code
- Release pre-trained model weights
- No Pseudo-Alignment: Achieves stable alignment implicitly via Joint-DiT attention.
- Mel-VAE Codec: Efficient latent representation for faster training and high-fidelity reconstruction.
- Unified Architecture: A simple, end-to-end framework without complex multi-stage pipelines.
This project is built upon the excellent work of F5-TTS, MMAudio and Zip-Voice. We thank the authors for their open-source contributions.
If you find our work helpful for your research, please consider giving a star β and citation π:
@article{wang2025m3tts,
title={M3-TTS: Multi-modal DiT Alignment \& Mel-latent for Zero-shot High-fidelity Speech Synthesis},
author={Wang, Xiaopeng and Qiang, Chunyu and Fu, Ruibo and others},
journal={arXiv preprint arXiv:2512.04720},
year={2025}
}