nari-labs
/

Dia-1.6B

Text-to-Speech

Safetensors

English

model_hub_mixin

pytorch_model_hub_mixin

Model card Files Files and versions Community

NariLabs commited on Apr 22

Commit

876dfa3

verified ·

1 Parent(s): ea1fb66

Update README.md

Browse files

Files changed (1) hide show

README.md +23 -4

README.md CHANGED Viewed

@@ -16,10 +16,11 @@ Dia is a 1.6B parameter text to speech model created by Nari Labs.
 Dia **directly generates highly realistic dialogue from a transcript**. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.
-To accelerate research, we are providing access to pretrained model checkpoints and inference code. The model weights are hosted on [Hugging Face](https://huggingface.co/nari-labs/Dia-1.6B).
 We also provide a [demo page](https://yummy-fir-7a4.notion.site/dia) comparing our model to [ElevenLabs Studio](https://elevenlabs.io/studio) and [Sesame CSM-1B](https://github.com/SesameAILabs/csm).
 - Join our [discord server](https://discord.gg/pgdB5YRe) for community support and access to new features.
 - Play with a larger version of Dia: generate fun conversations, remix content, and share with friends. 🔮 Join the [waitlist](https://tally.so/r/meokbo) for early access.
@@ -27,6 +28,13 @@ We also provide a [demo page](https://yummy-fir-7a4.notion.site/dia) comparing o
 This will open a Gradio UI that you can work on.
 ```bash
 git clone https://github.com/nari-labs/dia.git
 cd dia
@@ -36,6 +44,16 @@ pip install uv
 uv run app.py
 ```
 ## ⚙️ Usage
 ### As a Python Library
@@ -63,7 +81,7 @@ Dia has been tested on only GPUs (pytorch 2.0+, CUDA 12.6). CPU support is to be
 The initial run will take longer as the Descript Audio Codec also needs to be downloaded.
 On enterprise GPUs, Dia can generate audio in real-time. On older GPUs, inference time will be slower.
-For reference, on a A4000 GPU, Dia rougly generates 40 tokens/s (86 tokens equals 1 second of audio).
 `torch.compile` will increase speeds for supported GPUs.
 The full version of Dia requires around 10GB of VRAM to run. We will be adding a quantized version in the future.
@@ -76,7 +94,7 @@ This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENS
 ## ⚠️ Disclaimer
-This project offers a high-fidelity speech generation model intended solely for research and educational use. The following uses are **strictly forbidden**:
 - **Identity Misuse**: Do not produce audio resembling real individuals without permission.
 - **Deceptive Content**: Do not use this model to generate misleading content (e.g. fake news)
@@ -99,5 +117,6 @@ Join our [Discord Server](https://discord.gg/pgdB5YRe) for discussions.
 - We thank the [Google TPU Research Cloud program](https://sites.research.google/trc/about/) for providing computation resources.
 - Our work was heavily inspired by [SoundStorm](https://arxiv.org/abs/2305.09636), [Parakeet](https://jordandarefsky.com/blog/2024/parakeet/), and [Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec).
 - "Nari" is a pure Korean word for lily.
-- We thank Jason Y. for providing help with data filtering.

 Dia **directly generates highly realistic dialogue from a transcript**. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.
+To accelerate research, we are providing access to pretrained model checkpoints and inference code. The model weights are hosted on [Hugging Face](https://huggingface.co/nari-labs/Dia-1.6B). The model only supports English generation at the moment.
 We also provide a [demo page](https://yummy-fir-7a4.notion.site/dia) comparing our model to [ElevenLabs Studio](https://elevenlabs.io/studio) and [Sesame CSM-1B](https://github.com/SesameAILabs/csm).
+- (Update) We have a ZeroGPU Space running! Try it now [here](https://huggingface.co/spaces/nari-labs/Dia-1.6B). Thanks to the HF team for the support :)
 - Join our [discord server](https://discord.gg/pgdB5YRe) for community support and access to new features.
 - Play with a larger version of Dia: generate fun conversations, remix content, and share with friends. 🔮 Join the [waitlist](https://tally.so/r/meokbo) for early access.
 This will open a Gradio UI that you can work on.
+```bash
+git clone https://github.com/nari-labs/dia.git
+cd dia && uv run app.py
+```
+or if you do not have `uv` pre-installed:
 ```bash
 git clone https://github.com/nari-labs/dia.git
 cd dia
 uv run app.py
 ```
+Note that the model was not fine-tuned on a specific voice. Hence, you will get different voices every time you run the model.
+You can keep speaker consistency by either adding an audio prompt (a guide coming VERY soon - try it with the second example on Gradio for now), or fixing the seed.
+## Features
+- Generate dialogue via `[S1]` and `[S2]` tag
+- Generate non-verbal like `(laughs)`, `(coughs)`, etc.
+- Voice cloning. See [`example/voice_clone.py`](example/voice_clone.py) for more information.
+  - In the Hugging Face space, you can upload the audio you want to clone and place its transcript before your script. Make sure the transcript follows the required format. The model will then output only the content of your script.
 ## ⚙️ Usage
 ### As a Python Library
 The initial run will take longer as the Descript Audio Codec also needs to be downloaded.
 On enterprise GPUs, Dia can generate audio in real-time. On older GPUs, inference time will be slower.
+For reference, on a A4000 GPU, Dia roughly generates 40 tokens/s (86 tokens equals 1 second of audio).
 `torch.compile` will increase speeds for supported GPUs.
 The full version of Dia requires around 10GB of VRAM to run. We will be adding a quantized version in the future.
 ## ⚠️ Disclaimer
+This project offers a high-fidelity speech generation model intended for research and educational use. The following uses are **strictly forbidden**:
 - **Identity Misuse**: Do not produce audio resembling real individuals without permission.
 - **Deceptive Content**: Do not use this model to generate misleading content (e.g. fake news)
 - We thank the [Google TPU Research Cloud program](https://sites.research.google/trc/about/) for providing computation resources.
 - Our work was heavily inspired by [SoundStorm](https://arxiv.org/abs/2305.09636), [Parakeet](https://jordandarefsky.com/blog/2024/parakeet/), and [Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec).
+- HuggingFace for providing the ZeroGPU Grant.
 - "Nari" is a pure Korean word for lily.
+- We thank Jason Y. for providing help with data filtering.