thepushkarp commited on
Commit
4b3b208
·
verified ·
1 Parent(s): b54cf0d

Add files using upload-large-folder tool

Browse files
Files changed (3) hide show
  1. README.md +136 -0
  2. config.json +49 -0
  3. dia-v0_1-fp16.safetensors +3 -0
README.md ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - Text-to-Speech
7
+ pipeline_tag: text-to-speech
8
+ library_name: dia
9
+ ---
10
+
11
+ **Note:** This repository contains the FP16 (half-precision) version of the [Dia-1.6B model](https://huggingface.co/nari-labs/Dia-1.6B), converted to the SafeTensors format for potentially faster loading and reduced file size compared to the original `.pth` file.
12
+
13
+ **FP16 Conversion Statistics:**
14
+ ```text
15
+ Original size: 6.002177 GB
16
+ Converted size: 3.001058 GB
17
+ Size reduction: 50.000510%
18
+ Maximum absolute tensor difference: 0.000487
19
+ Maximum relative tensor difference: 0.229572
20
+ Average absolute tensor difference: 0.000010
21
+ ```
22
+
23
+ <center>
24
+ <a href="https://github.com/nari-labs/dia">
25
+ <img src="https://github.com/nari-labs/dia/raw/main/dia/static/images/banner.png">
26
+ </a>
27
+ </center>
28
+
29
+ Dia is a 1.6B parameter text to speech model created by Nari Labs.
30
+
31
+ Dia **directly generates highly realistic dialogue from a transcript**. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.
32
+
33
+ To accelerate research, we are providing access to pretrained model checkpoints and inference code. The model weights are hosted on [Hugging Face](https://huggingface.co/nari-labs/Dia-1.6B). The model only supports English generation at the moment.
34
+
35
+ We also provide a [demo page](https://yummy-fir-7a4.notion.site/dia) comparing our model to [ElevenLabs Studio](https://elevenlabs.io/studio) and [Sesame CSM-1B](https://github.com/SesameAILabs/csm).
36
+
37
+ - (Update) We have a ZeroGPU Space running! Try it now [here](https://huggingface.co/spaces/nari-labs/Dia-1.6B). Thanks to the HF team for the support :)
38
+ - Join our [discord server](https://discord.gg/pgdB5YRe) for community support and access to new features.
39
+ - Play with a larger version of Dia: generate fun conversations, remix content, and share with friends. 🔮 Join the [waitlist](https://tally.so/r/meokbo) for early access.
40
+
41
+ ## ⚡️ Quickstart
42
+
43
+ This will open a Gradio UI that you can work on.
44
+
45
+ ```bash
46
+ git clone https://github.com/nari-labs/dia.git
47
+ cd dia && uv run app.py
48
+ ```
49
+
50
+ or if you do not have `uv` pre-installed:
51
+
52
+ ```bash
53
+ git clone https://github.com/nari-labs/dia.git
54
+ cd dia
55
+ python -m venv .venv
56
+ source .venv/bin/activate
57
+ pip install uv
58
+ uv run app.py
59
+ ```
60
+
61
+ Note that the model was not fine-tuned on a specific voice. Hence, you will get different voices every time you run the model.
62
+ You can keep speaker consistency by either adding an audio prompt (a guide coming VERY soon - try it with the second example on Gradio for now), or fixing the seed.
63
+
64
+ ## Features
65
+
66
+ - Generate dialogue via `[S1]` and `[S2]` tag
67
+ - Generate non-verbal like `(laughs)`, `(coughs)`, etc.
68
+ - Voice cloning. See [`example/voice_clone.py`](example/voice_clone.py) for more information.
69
+ - In the Hugging Face space, you can upload the audio you want to clone and place its transcript before your script. Make sure the transcript follows the required format. The model will then output only the content of your script.
70
+
71
+ ## ⚙️ Usage
72
+
73
+ ### As a Python Library
74
+
75
+ ```python
76
+ import soundfile as sf
77
+
78
+ from dia.model import Dia
79
+
80
+
81
+ model = Dia.from_pretrained("nari-labs/Dia-1.6B")
82
+
83
+ text = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."
84
+
85
+ output = model.generate(text)
86
+
87
+ sf.write("simple.mp3", output, 44100)
88
+ ```
89
+
90
+ A pypi package and a working CLI tool will be available soon.
91
+
92
+ ## 💻 Hardware and Inference Speed
93
+
94
+ Dia has been tested on only GPUs (pytorch 2.0+, CUDA 12.6). CPU support is to be added soon.
95
+ The initial run will take longer as the Descript Audio Codec also needs to be downloaded.
96
+
97
+ On enterprise GPUs, Dia can generate audio in real-time. On older GPUs, inference time will be slower.
98
+ For reference, on a A4000 GPU, Dia roughly generates 40 tokens/s (86 tokens equals 1 second of audio).
99
+ `torch.compile` will increase speeds for supported GPUs.
100
+
101
+ The full version of Dia requires around 10GB of VRAM to run. We will be adding a quantized version in the future.
102
+
103
+ If you don't have hardware available or if you want to play with bigger versions of our models, join the waitlist [here](https://tally.so/r/meokbo).
104
+
105
+ ## 🪪 License
106
+
107
+ This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
108
+
109
+ ## ⚠️ Disclaimer
110
+
111
+ This project offers a high-fidelity speech generation model intended for research and educational use. The following uses are **strictly forbidden**:
112
+
113
+ - **Identity Misuse**: Do not produce audio resembling real individuals without permission.
114
+ - **Deceptive Content**: Do not use this model to generate misleading content (e.g. fake news)
115
+ - **Illegal or Malicious Use**: Do not use this model for activities that are illegal or intended to cause harm.
116
+
117
+ By using this model, you agree to uphold relevant legal standards and ethical responsibilities. We **are not responsible** for any misuse and firmly oppose any unethical usage of this technology.
118
+
119
+ ## 🔭 TODO / Future Work
120
+
121
+ - Docker support.
122
+ - Optimize inference speed.
123
+ - Add quantization for memory efficiency.
124
+
125
+ ## 🤝 Contributing
126
+
127
+ We are a tiny team of 1 full-time and 1 part-time research-engineers. We are extra-welcome to any contributions!
128
+ Join our [Discord Server](https://discord.gg/pgdB5YRe) for discussions.
129
+
130
+ ## 🤗 Acknowledgements
131
+
132
+ - We thank the [Google TPU Research Cloud program](https://sites.research.google/trc/about/) for providing computation resources.
133
+ - Our work was heavily inspired by [SoundStorm](https://arxiv.org/abs/2305.09636), [Parakeet](https://jordandarefsky.com/blog/2024/parakeet/), and [Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec).
134
+ - HuggingFace for providing the ZeroGPU Grant.
135
+ - "Nari" is a pure Korean word for lily.
136
+ - We thank Jason Y. for providing help with data filtering.
config.json ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "0.1",
3
+ "model": {
4
+ "encoder": {
5
+ "n_layer": 12,
6
+ "n_embd": 1024,
7
+ "n_hidden": 4096,
8
+ "n_head": 16,
9
+ "head_dim": 128
10
+ },
11
+ "decoder": {
12
+ "n_layer": 18,
13
+ "n_embd": 2048,
14
+ "n_hidden": 8192,
15
+ "gqa_query_heads": 16,
16
+ "cross_query_heads": 16,
17
+ "kv_heads": 4,
18
+ "gqa_head_dim": 128,
19
+ "cross_head_dim": 128
20
+ },
21
+ "src_vocab_size": 256,
22
+ "tgt_vocab_size": 1028,
23
+ "dropout": 0.0
24
+ },
25
+ "training": {
26
+ "dtype": "bfloat16",
27
+ "logits_dot_in_fp32": false
28
+ },
29
+ "data": {
30
+ "text_length": 1024,
31
+ "audio_length": 3072,
32
+ "channels": 9,
33
+ "text_pad_value": 0,
34
+ "audio_eos_value": 1024,
35
+ "audio_pad_value": 1025,
36
+ "audio_bos_value": 1026,
37
+ "delay_pattern": [
38
+ 0,
39
+ 8,
40
+ 9,
41
+ 10,
42
+ 11,
43
+ 12,
44
+ 13,
45
+ 14,
46
+ 15
47
+ ]
48
+ }
49
+ }
dia-v0_1-fp16.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9d480a138b8f38ce31374e674eb5e1f0cfa51445cae16091a12cc4ca8d4c0646
3
+ size 3222361608