not-lain commited on
Commit
b2d8a8c
·
verified ·
1 Parent(s): 8ff9cf3

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/logo/yue.mp3 filter=lfs diff=lfs merge=lfs -text
ORIGINAL_README.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <p align="center">
2
+ <img src="./assets/logo/白底.png" width="400" />
3
+ </p>
4
+
5
+ <p align="center">
6
+ <a href="https://map-yue.github.io/">Demo 🎶</a> &nbsp;|&nbsp; 📑 <a href="">Paper (coming soon)</a>
7
+ <br>
8
+ <a href="https://huggingface.co/m-a-p/YuE-s1-7B-anneal-en-cot">YuE-s1-7B-anneal-en-cot 🤗</a> &nbsp;|&nbsp; <a href="https://huggingface.co/m-a-p/YuE-s1-7B-anneal-en-icl">YuE-s1-7B-anneal-en-icl 🤗</a> &nbsp;|&nbsp; <a href="https://huggingface.co/m-a-p/YuE-s1-7B-anneal-jp-kr-cot">YuE-s1-7B-anneal-jp-kr-cot 🤗</a>
9
+ <br>
10
+ <a href="https://huggingface.co/m-a-p/YuE-s1-7B-anneal-jp-kr-icl">YuE-s1-7B-anneal-jp-kr-icl 🤗</a> &nbsp;|&nbsp; <a href="https://huggingface.co/m-a-p/YuE-s1-7B-anneal-zh-cot">YuE-s1-7B-anneal-zh-cot 🤗</a> &nbsp;|&nbsp; <a href="https://huggingface.co/m-a-p/YuE-s1-7B-anneal-zh-icl">YuE-s1-7B-anneal-zh-icl 🤗</a>
11
+ <br>
12
+ <a href="https://huggingface.co/m-a-p/YuE-s2-1B-general">YuE-s2-1B-general 🤗</a> &nbsp;|&nbsp; <a href="https://huggingface.co/m-a-p/YuE-upsampler">YuE-upsampler 🤗</a>
13
+ </p>
14
+
15
+ ---
16
+ Our model's name is **YuE (乐)**. In Chinese, the word means "music" and "happiness." Some of you may find words that start with Yu hard to pronounce. If so, you can just call it "yeah." We wrote a song with our model's name.
17
+
18
+ <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6555e8d8a0c34cd61a6b9ce3/rG-ELxMyzDU7zH-inB9DV.mpga"></audio>
19
+
20
+ YuE is a groundbreaking series of open-source foundation models designed for music generation, specifically for transforming lyrics into full songs (lyrics2song). It can generate a complete song, lasting several minutes, that includes both a catchy vocal track and complementary accompaniment, ensuring a polished and cohesive result. YuE is capable of modeling diverse genres/vocal styles. Below are examples of songs in the pop and metal genres. For more styles, please visit the demo page.
21
+
22
+ Pop:Quiet Evening
23
+ <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/640701cb4dc5f2846c91d4eb/gnBULaFjcUyXYzzIwXLZq.mpga"></audio>
24
+ Metal: Step Back
25
+ <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6555e8d8a0c34cd61a6b9ce3/kmCwl4GRS70UYDEELL-Tn.mpga"></audio>
26
+
27
+ ## News and Updates
28
+
29
+ * **2025.01.26 🔥**: We have released the **YuE** series.
30
+
31
+ <br>
32
+
33
+ ## Requirements
34
+
35
+ Python >=3.8 is recommended.
36
+
37
+ Install dependencies with the following command:
38
+
39
+ ```
40
+ pip install -r requirements.txt
41
+ ```
42
+
43
+ ### **Important: Install FlashAttention 2**
44
+ For saving GPU memory, **FlashAttention 2 is mandatory**. Without it, large sequence lengths will lead to out-of-memory (OOM) errors, especially on GPUs with limited memory. Install it using the following command:
45
+ ```
46
+ pip install flash-attn --no-build-isolation
47
+ ```
48
+ Before installing FlashAttention, ensure that your CUDA environment is correctly set up.
49
+ For example, if you are using CUDA 11.8:
50
+ - If using a module system:
51
+ ``` module load cuda11.8/toolkit/11.8.0 ```
52
+ - Or manually configure CUDA in your shell:
53
+ ```
54
+ export PATH=/usr/local/cuda-11.8/bin:$PATH
55
+ export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64:$LD_LIBRARY_PATH
56
+ ```
57
+
58
+ ---
59
+
60
+ ## GPU Memory Usage and Sessions
61
+
62
+ YuE requires significant GPU memory for generating long sequences. Below are the recommended configurations:
63
+
64
+ - **For GPUs with 24GB memory or less**: Run **up to 2 sessions** concurrently to avoid out-of-memory (OOM) errors.
65
+ - **For full song generation** (many sessions, e.g., 4 or more): Use **GPUs with at least 80GB memory**. This can be achieved by combining multiple GPUs and enabling tensor parallelism.
66
+
67
+ To customize the number of sessions, the interface allows you to specify the desired session count. By default, the model runs **2 sessions** for optimal memory usage.
68
+
69
+ ---
70
+
71
+ ## Quickstart
72
+
73
+ ```
74
+ # Make sure you have git-lfs installed (https://git-lfs.com)
75
+ git lfs install
76
+ git clone https://github.com/multimodal-art-projection/YuE.git
77
+
78
+ cd YuE/inference/
79
+ git clone https://huggingface.co/m-a-p/xcodec_mini_infer
80
+ ```
81
+
82
+ Here’s a quick guide to help you generate music with **YuE** using 🤗 Transformers. Before running the code, make sure your environment is properly set up, and that all dependencies are installed.
83
+
84
+ ### Running the Script
85
+
86
+ In the following example, customize the `genres` and `lyrics` in the script, then execute it to generate a song with **YuE**.
87
+
88
+ Notice: Set `--run_n_segments` to the number of lyric sections if you want to generate a full song. Additionally, you can increase `--stage2_batch_size` based on your available GPU memory.
89
+
90
+ ```bash
91
+ cd YuE/inference/
92
+ python infer.py \
93
+ --stage1_model m-a-p/YuE-s1-7B-anneal-en-cot \
94
+ --stage2_model m-a-p/YuE-s2-1B-general \
95
+ --genre_txt prompt_examples/genre.txt \
96
+ --lyrics_txt prompt_examples/lyrics.txt \
97
+ --run_n_segments 2 \
98
+ --stage2_batch_size 4 \
99
+ --output_dir ./output \
100
+ --cuda_idx 0 \
101
+ --max_new_tokens 3000
102
+ ```
103
+
104
+ If you want to use audio prompt, enable `--use_audio_prompt`, and provide audio prompt:
105
+ ```bash
106
+ cd YuE/inference/
107
+ python infer.py \
108
+ --stage1_model m-a-p/YuE-s1-7B-anneal-en-icl \
109
+ --stage2_model m-a-p/YuE-s2-1B-general \
110
+ --genre_txt prompt_examples/genre.txt \
111
+ --lyrics_txt prompt_examples/lyrics.txt \
112
+ --run_n_segments 2 \
113
+ --stage2_batch_size 4 \
114
+ --output_dir ./output \
115
+ --cuda_idx 0 \
116
+ --max_new_tokens 3000 \
117
+ --audio_prompt_path {YOUR_AUDIO_FILE} \
118
+ --prompt_start_time 0 \
119
+ --prompt_end_time 30
120
+ ```
121
+
122
+
123
+ ---
124
+
125
+ ### **Execution Time**
126
+ On an **H800 GPU**, generating 30s audio takes **150 seconds**.
127
+ On an **RTX 4090 GPU**, generating 30s audio takes approximately **360 seconds**.
128
+
129
+ **Tips:**
130
+ 1. `genres` should include details like instruments, genre, mood, vocal timbre, and vocal gender.
131
+ 2. The length of `lyrics` segments and the `--max_new_tokens` value should be matched. For example, if `--max_new_tokens` is set to 3000, the maximum duration for a segment is around 30 seconds. Ensure your lyrics fit this time frame.
132
+ 3. If using audio prompt,the duration around 30s will be fine.
133
+ ---
134
+
135
+ ### Notice
136
+ 1. A suitable [Genre] tag consists of five components: genre, instrument, mood, gender, and timbre. All five should be included if possible, separated by spaces. The values of timbre should include "vocal" (e.g., "bright vocal").
137
+
138
+ 2. Although our tags have an open vocabulary, we have provided the 200 most commonly used [tags](./wav_top_200_tags.json). It is recommended to select tags from this list for more stable results.
139
+
140
+ 3. The order of the tags is flexible. For example, a stable genre control string might look like: "[Genre] inspiring female uplifting pop airy vocal electronic bright vocal vocal."
141
+
142
+ 4. Additionally, we have introduced the "Mandarin" and "Cantonese" tags to distinguish between Mandarin and Cantonese, as their lyrics often share similarities.
143
+
144
+ ## License Agreement
145
+
146
+ Creative Commons Attribution Non Commercial 4.0
147
+
148
+ ---
149
+
150
+ ## Citation
151
+
152
+ If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)
153
+
154
+ ```BibTeX
155
+ @misc{yuan2025yue,
156
+ title={YuE: Open Music Foundation Models for Full-Song Generation},
157
+ author={Ruibin Yuan and Hanfeng Lin and Shawn Guo and Ge Zhang and Jiahao Pan and Yongyi Zang and Haohe Liu and Xingjian Du and Xeron Du and Zhen Ye and Tianyu Zheng and Yinghao Ma and Minghao Liu and Lijun Yu and Zeyue Tian and Ziya Zhou and Liumeng Xue and Xingwei Qu and Yizhi Li and Tianhao Shen and Ziyang Ma and Shangda Wu and Jun Zhan and Chunhui Wang and Yatian Wang and Xiaohuan Zhou and Xiaowei Chi and Xinyue Zhang and Zhenzhu Yang and Yiming Liang and Xiangzhou Wang and Shansong Liu and Lingrui Mei and Peng Li and Yong Chen and Chenghua Lin and Xie Chen and Gus Xia and Zhaoxiang Zhang and Chao Zhang and Wenhu Chen and Xinyu Zhou and Xipeng Qiu and Roger Dannenberg and Jiaheng Liu and Jian Yang and Stephen Huang and Wei Xue and Xu Tan and Yike Guo},
158
+ howpublished={\url{https://github.com/multimodal-art-projection/YuE}},
159
+ year={2025},
160
+ note={GitHub repository}
161
+ }
162
+ ```
163
+ <br>
README.md CHANGED
@@ -1,14 +1,12 @@
1
  ---
2
- title: YuE Music Generator Demo
3
- emoji: 🌍
4
- colorFrom: yellow
5
  colorTo: green
6
  sdk: gradio
7
  sdk_version: 5.13.1
8
  app_file: app.py
9
  pinned: false
10
- license: apache-2.0
11
- short_description: Suno lvl open source music generator
12
  ---
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
  ---
2
+ title: YuE
3
+ emoji: 👩‍🎤
4
+ colorFrom: pink
5
  colorTo: green
6
  sdk: gradio
7
  sdk_version: 5.13.1
8
  app_file: app.py
9
  pinned: false
 
 
10
  ---
11
 
12
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
app.py CHANGED
@@ -1,333 +1,252 @@
1
- import os
2
- import subprocess
3
-
4
- # Install flash attention
5
- subprocess.run(
6
- "pip install flash-attn --no-build-isolation",
7
- env={"FLASH_ATTENTION_SKIP_CUDA_BUILD": "TRUE"},
8
- shell=True,
9
- )
10
-
11
- import spaces
12
- import os
13
- import torch
14
- import numpy as np
15
- from omegaconf import OmegaConf
16
- import torchaudio
17
- from torchaudio.transforms import Resample
18
- import soundfile as sf
19
- import uuid
20
- from tqdm import tqdm
21
- from einops import rearrange
22
  import gradio as gr
23
- import re
24
- from collections import Counter
25
- from codecmanipulator import CodecManipulator
26
- from mmtokenizer import _MMSentencePieceTokenizer
27
- from transformers import AutoModelForCausalLM, LogitsProcessor, LogitsProcessorList
28
- # from models.soundstream_hubert_new import SoundStream
29
- from vocoder import build_codec_model, process_audio
30
- from post_process_audio import replace_low_freq_with_energy_matched
31
-
32
- # Initialize global variables and models
33
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
34
- mmtokenizer = _MMSentencePieceTokenizer("./mm_tokenizer_v0.2_hf/tokenizer.model")
35
- codectool = CodecManipulator("xcodec", 0, 1)
36
- codectool_stage2 = CodecManipulator("xcodec", 0, 8)
 
 
 
 
 
37
 
38
- # Load models once at startup
39
- def load_models():
40
- # Stage 1 Model
41
- stage1_model = AutoModelForCausalLM.from_pretrained(
42
- "m-a-p/YuE-s1-7B-anneal-en-cot",
43
- torch_dtype=torch.bfloat16,
44
- attn_implementation="flash_attention_2"
45
- ).to(device)
46
- stage1_model.eval()
47
 
48
- # Stage 2 Model
49
- stage2_model = AutoModelForCausalLM.from_pretrained(
50
- "m-a-p/YuE-s2-1B-general",
51
- torch_dtype=torch.float16,
52
- attn_implementation="flash_attention_2"
53
- ).to(device)
54
- stage2_model.eval()
55
 
56
- # Codec Model
57
- model_config = OmegaConf.load('./xcodec_mini_infer/final_ckpt/config.yaml')
58
- codec_model = eval(model_config.generator.name)(**model_config.generator.config).to(device)
59
- parameter_dict = torch.load('./xcodec_mini_infer/final_ckpt/ckpt_00360000.pth', map_location='cpu')
60
- codec_model.load_state_dict(parameter_dict['codec_model'])
61
- codec_model.eval()
62
 
63
- return stage1_model, stage2_model, codec_model
 
 
 
 
 
64
 
65
- stage1_model, stage2_model, codec_model = load_models()
 
 
 
66
 
67
- # Helper functions
68
- def split_lyrics(lyrics):
69
- pattern = r"\[(\w+)\](.*?)\n(?=\[|\Z)"
70
- segments = re.findall(pattern, lyrics, re.DOTALL)
71
- return [f"[{seg[0]}]\n{seg[1].strip()}\n\n" for seg in segments]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
73
- def load_audio_mono(filepath, sampling_rate=16000):
74
- audio, sr = torchaudio.load(filepath)
75
- audio = torch.mean(audio, dim=0, keepdim=True) # Convert to mono
76
- if sr != sampling_rate:
77
- resampler = Resample(orig_freq=sr, new_freq=sampling_rate)
78
- audio = resampler(audio)
79
- return audio
 
 
 
 
 
 
 
 
 
 
 
 
80
 
81
- def save_audio(wav: torch.Tensor, path, sample_rate: int, rescale: bool = False):
82
- folder_path = os.path.dirname(path)
83
- if not os.path.exists(folder_path):
84
- os.makedirs(folder_path)
85
- limit = 0.99
86
- max_val = wav.abs().max()
87
- wav = wav * min(limit / max_val, 1) if rescale else wav.clamp(-limit, limit)
88
- torchaudio.save(str(path), wav, sample_rate=sample_rate, encoding='PCM_S', bits_per_sample=16)
89
 
90
- # Stage 1 Generation
91
- def stage1_generate(genres, lyrics_text, use_audio_prompt, audio_prompt_path, prompt_start_time, prompt_end_time):
92
- structured_lyrics = split_lyrics(lyrics_text)
93
- full_lyrics = "\n".join(structured_lyrics)
94
- prompt_texts = [f"Generate music from the given lyrics segment by segment.\n[Genre] {genres}\n{full_lyrics}"] + structured_lyrics
95
 
96
- random_id = str(uuid.uuid4())
97
- output_dir = os.path.join("./output", random_id)
98
  os.makedirs(output_dir, exist_ok=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
- stage1_output_set = []
101
- for i, p in enumerate(tqdm(prompt_texts)):
102
- section_text = p.replace('[start_of_segment]', '').replace('[end_of_segment]', '')
103
- guidance_scale = 1.5 if i <= 1 else 1.2
104
-
105
- if i == 0:
106
- continue
107
-
108
- if i == 1 and use_audio_prompt:
109
- audio_prompt = load_audio_mono(audio_prompt_path)
110
- audio_prompt.unsqueeze_(0)
111
- with torch.no_grad():
112
- raw_codes = codec_model.encode(audio_prompt.to(device), target_bw=0.5)
113
- raw_codes = raw_codes.transpose(0, 1).cpu().numpy().astype(np.int16)
114
- audio_prompt_codec = codectool.npy2ids(raw_codes[0])[int(prompt_start_time * 50): int(prompt_end_time * 50)]
115
- audio_prompt_codec_ids = [mmtokenizer.soa] + codectool.sep_ids + audio_prompt_codec + [mmtokenizer.eoa]
116
- sentence_ids = mmtokenizer.tokenize("[start_of_reference]") + audio_prompt_codec_ids + mmtokenizer.tokenize("[end_of_reference]")
117
- head_id = mmtokenizer.tokenize(prompt_texts[0]) + sentence_ids
118
- else:
119
- head_id = mmtokenizer.tokenize(prompt_texts[0])
120
-
121
- prompt_ids = head_id + mmtokenizer.tokenize("[start_of_segment]") + mmtokenizer.tokenize(section_text) + [mmtokenizer.soa] + codectool.sep_ids
122
- prompt_ids = torch.as_tensor(prompt_ids).unsqueeze(0).to(device)
123
-
124
- with torch.no_grad():
125
- output_seq = stage1_model.generate(
126
- input_ids=prompt_ids,
127
- max_new_tokens=3000,
128
- min_new_tokens=100,
129
- do_sample=True,
130
- top_p=0.93,
131
- temperature=1.0,
132
- repetition_penalty=1.2,
133
- eos_token_id=mmtokenizer.eoa,
134
- pad_token_id=mmtokenizer.eoa,
135
- )
136
-
137
- if i > 1:
138
- raw_output = torch.cat([raw_output, prompt_ids, output_seq[:, prompt_ids.shape[-1]:]], dim=1)
139
  else:
140
- raw_output = output_seq
141
-
142
- # Save Stage 1 outputs
143
- ids = raw_output[0].cpu().numpy()
144
- soa_idx = np.where(ids == mmtokenizer.soa)[0].tolist()
145
- eoa_idx = np.where(ids == mmtokenizer.eoa)[0].tolist()
146
-
147
- vocals = []
148
- instrumentals = []
149
- for i in range(len(soa_idx)):
150
- codec_ids = ids[soa_idx[i] + 1:eoa_idx[i]]
151
- if codec_ids[0] == 32016:
152
- codec_ids = codec_ids[1:]
153
- codec_ids = codec_ids[:2 * (codec_ids.shape[0] // 2)]
154
- vocals_ids = codectool.ids2npy(rearrange(codec_ids, "(n b) -> b n", b=2)[0])
155
- vocals.append(vocals_ids)
156
- instrumentals_ids = codectool.ids2npy(rearrange(codec_ids, "(n b) -> b n", b=2)[1])
157
- instrumentals.append(instrumentals_ids)
158
-
159
- vocals = np.concatenate(vocals, axis=1)
160
- instrumentals = np.concatenate(instrumentals, axis=1)
161
- vocal_save_path = os.path.join(output_dir, f"vocal_{random_id}.npy")
162
- inst_save_path = os.path.join(output_dir, f"instrumental_{random_id}.npy")
163
- np.save(vocal_save_path, vocals)
164
- np.save(inst_save_path, instrumentals)
165
- stage1_output_set.append(vocal_save_path)
166
- stage1_output_set.append(inst_save_path)
167
-
168
- return stage1_output_set, output_dir
169
-
170
- # Stage 2 Generation
171
- def stage2_generate(model, prompt, batch_size=16):
172
- codec_ids = codectool.unflatten(prompt, n_quantizer=1)
173
- codec_ids = codectool.offset_tok_ids(
174
- codec_ids,
175
- global_offset=codectool.global_offset,
176
- codebook_size=codectool.codebook_size,
177
- num_codebooks=codectool.num_codebooks,
178
- ).astype(np.int32)
179
-
180
- if batch_size > 1:
181
- codec_list = []
182
- for i in range(batch_size):
183
- idx_begin = i * 300
184
- idx_end = (i + 1) * 300
185
- codec_list.append(codec_ids[:, idx_begin:idx_end])
186
- codec_ids = np.concatenate(codec_list, axis=0)
187
- prompt_ids = np.concatenate(
188
- [
189
- np.tile([mmtokenizer.soa, mmtokenizer.stage_1], (batch_size, 1)),
190
- codec_ids,
191
- np.tile([mmtokenizer.stage_2], (batch_size, 1)),
192
- ],
193
- axis=1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
194
  )
195
- else:
196
- prompt_ids = np.concatenate([
197
- np.array([mmtokenizer.soa, mmtokenizer.stage_1]),
198
- codec_ids.flatten(),
199
- np.array([mmtokenizer.stage_2])
200
- ]).astype(np.int32)
201
- prompt_ids = prompt_ids[np.newaxis, ...]
202
-
203
- codec_ids = torch.as_tensor(codec_ids).to(device)
204
- prompt_ids = torch.as_tensor(prompt_ids).to(device)
205
- len_prompt = prompt_ids.shape[-1]
206
-
207
- block_list = LogitsProcessorList([BlockTokenRangeProcessor(0, 46358), BlockTokenRangeProcessor(53526, mmtokenizer.vocab_size)])
208
-
209
- for frames_idx in range(codec_ids.shape[1]):
210
- cb0 = codec_ids[:, frames_idx:frames_idx + 1]
211
- prompt_ids = torch.cat([prompt_ids, cb0], dim=1)
212
- input_ids = prompt_ids
213
-
214
- with torch.no_grad():
215
- stage2_output = model.generate(
216
- input_ids=input_ids,
217
- min_new_tokens=7,
218
- max_new_tokens=7,
219
- eos_token_id=mmtokenizer.eoa,
220
- pad_token_id=mmtokenizer.eoa,
221
- logits_processor=block_list,
222
- )
223
-
224
- assert stage2_output.shape[1] - prompt_ids.shape[1] == 7, f"output new tokens={stage2_output.shape[1] - prompt_ids.shape[1]}"
225
- prompt_ids = stage2_output
226
-
227
- if batch_size > 1:
228
- output = prompt_ids.cpu().numpy()[:, len_prompt:]
229
- output_list = [output[i] for i in range(batch_size)]
230
- output = np.concatenate(output_list, axis=0)
231
- else:
232
- output = prompt_ids[0].cpu().numpy()[len_prompt:]
233
-
234
- return output
235
-
236
- def stage2_inference(model, stage1_output_set, output_dir, batch_size=4):
237
- stage2_result = []
238
- for i in tqdm(range(len(stage1_output_set))):
239
- output_filename = os.path.join(output_dir, os.path.basename(stage1_output_set[i]))
240
- if os.path.exists(output_filename):
241
- continue
242
-
243
- prompt = np.load(stage1_output_set[i]).astype(np.int32)
244
- output_duration = prompt.shape[-1] // 50 // 6 * 6
245
- num_batch = output_duration // 6
246
-
247
- if num_batch <= batch_size:
248
- output = stage2_generate(model, prompt[:, :output_duration * 50], batch_size=num_batch)
249
- else:
250
- segments = []
251
- num_segments = (num_batch // batch_size) + (1 if num_batch % batch_size != 0 else 0)
252
- for seg in range(num_segments):
253
- start_idx = seg * batch_size * 300
254
- end_idx = min((seg + 1) * batch_size * 300, output_duration * 50)
255
- current_batch_size = batch_size if seg != num_segments - 1 or num_batch % batch_size == 0 else num_batch % batch_size
256
- segment = stage2_generate(model, prompt[:, start_idx:end_idx], batch_size=current_batch_size)
257
- segments.append(segment)
258
- output = np.concatenate(segments, axis=0)
259
-
260
- if output_duration * 50 != prompt.shape[-1]:
261
- ending = stage2_generate(model, prompt[:, output_duration * 50:], batch_size=1)
262
- output = np.concatenate([output, ending], axis=0)
263
- output = codectool_stage2.ids2npy(output)
264
-
265
- fixed_output = copy.deepcopy(output)
266
- for i, line in enumerate(output):
267
- for j, element in enumerate(line):
268
- if element < 0 or element > 1023:
269
- counter = Counter(line)
270
- most_frequant = sorted(counter.items(), key=lambda x: x[1], reverse=True)[0][0]
271
- fixed_output[i, j] = most_frequant
272
- np.save(output_filename, fixed_output)
273
- stage2_result.append(output_filename)
274
- return stage2_result
275
-
276
- # Main Gradio function
277
- @spaces.GPU()
278
- def generate_music(genres, lyrics_text, use_audio_prompt, audio_prompt, start_time, end_time, progress=gr.Progress()):
279
- progress(0.1, "Running Stage 1 Generation...")
280
- stage1_output_set, output_dir = stage1_generate(genres, lyrics_text, use_audio_prompt, audio_prompt, start_time, end_time)
281
-
282
- progress(0.6, "Running Stage 2 Refinement...")
283
- stage2_result = stage2_inference(stage2_model, stage1_output_set, output_dir)
284
-
285
- progress(0.8, "Processing Audio...")
286
- vocal_decoder, inst_decoder = build_codec_model('./xcodec_mini_infer/decoders/config.yaml', './xcodec_mini_infer/decoders/decoder_131000.pth', './xcodec_mini_infer/decoders/decoder_151000.pth')
287
- vocoder_output_dir = os.path.join(output_dir, "vocoder")
288
- os.makedirs(vocoder_output_dir, exist_ok=True)
289
-
290
- for npy in stage2_result:
291
- if 'instrumental' in npy:
292
- process_audio(npy, os.path.join(vocoder_output_dir, 'instrumental.mp3'), False, None, inst_decoder, codec_model)
293
- else:
294
- process_audio(npy, os.path.join(vocoder_output_dir, 'vocal.mp3'), False, None, vocal_decoder, codec_model)
295
-
296
- return [
297
- os.path.join(vocoder_output_dir, 'instrumental.mp3'),
298
- os.path.join(vocoder_output_dir, 'vocal.mp3')
299
- ]
300
-
301
- # Gradio UI
302
- with gr.Blocks(title="AI Music Generation") as demo:
303
- gr.Markdown("# 🎵 AI Music Generation Pipeline")
304
 
305
- with gr.Row():
306
- with gr.Column():
307
- genre_input = gr.Textbox(label="Genre Tags", placeholder="e.g., Pop, Happy, Female Vocal")
308
- lyrics_input = gr.Textbox(label="Lyrics", lines=10, placeholder="Enter lyrics with segments...")
309
- use_audio_prompt = gr.Checkbox(label="Use Audio Prompt")
310
- audio_input = gr.Audio(label="Reference Audio", type="filepath", visible=False)
311
- start_time = gr.Number(label="Start Time (sec)", value=0.0, visible=False)
312
- end_time = gr.Number(label="End Time (sec)", value=30.0, visible=False)
313
-
314
- generate_btn = gr.Button("Generate Music", variant="primary")
315
-
316
- with gr.Column():
317
- vocal_output = gr.Audio(label="Vocal Track", interactive=False)
318
- inst_output = gr.Audio(label="Instrumental Track", interactive=False)
319
-
320
- use_audio_prompt.change(
321
- lambda x: [gr.update(visible=x), gr.update(visible=x), gr.update(visible=x)],
322
- inputs=use_audio_prompt,
323
- outputs=[audio_input, start_time, end_time]
324
  )
325
-
326
- generate_btn.click(
327
- generate_music,
328
- inputs=[genre_input, lyrics_input, use_audio_prompt, audio_input, start_time, end_time],
329
- outputs=[vocal_output, inst_output]
330
- )
331
-
332
- if __name__ == "__main__":
333
- demo.launch()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import gradio as gr
2
+ import subprocess
3
+ import os
4
+ import shutil
5
+ import tempfile
6
+
7
+ is_shared_ui = True if "fffiloni/YuE" in os.environ['SPACE_ID'] else False
8
+
9
+ # Install required package
10
+ def install_flash_attn():
11
+ try:
12
+ print("Installing flash-attn...")
13
+ subprocess.run(
14
+ ["pip", "install", "flash-attn", "--no-build-isolation"],
15
+ check=True
16
+ )
17
+ print("flash-attn installed successfully!")
18
+ except subprocess.CalledProcessError as e:
19
+ print(f"Failed to install flash-attn: {e}")
20
+ exit(1)
21
 
22
+ # Install flash-attn
23
+ install_flash_attn()
 
 
 
 
 
 
 
24
 
25
+ from huggingface_hub import snapshot_download
 
 
 
 
 
 
26
 
27
+ # Create xcodec_mini_infer folder
28
+ folder_path = './inference/xcodec_mini_infer'
 
 
 
 
29
 
30
+ # Create the folder if it doesn't exist
31
+ if not os.path.exists(folder_path):
32
+ os.mkdir(folder_path)
33
+ print(f"Folder created at: {folder_path}")
34
+ else:
35
+ print(f"Folder already exists at: {folder_path}")
36
 
37
+ snapshot_download(
38
+ repo_id = "m-a-p/xcodec_mini_infer",
39
+ local_dir = "./inference/xcodec_mini_infer"
40
+ )
41
 
42
+ # Change to the "inference" directory
43
+ inference_dir = "./inference"
44
+ try:
45
+ os.chdir(inference_dir)
46
+ print(f"Changed working directory to: {os.getcwd()}")
47
+ except FileNotFoundError:
48
+ print(f"Directory not found: {inference_dir}")
49
+ exit(1)
50
+
51
+ def empty_output_folder(output_dir):
52
+ # List all files in the output directory
53
+ files = os.listdir(output_dir)
54
+
55
+ # Iterate over the files and remove them
56
+ for file in files:
57
+ file_path = os.path.join(output_dir, file)
58
+ try:
59
+ if os.path.isdir(file_path):
60
+ # If it's a directory, remove it recursively
61
+ shutil.rmtree(file_path)
62
+ else:
63
+ # If it's a file, delete it
64
+ os.remove(file_path)
65
+ except Exception as e:
66
+ print(f"Error deleting file {file_path}: {e}")
67
+
68
+ # Function to create a temporary file with string content
69
+ def create_temp_file(content, prefix, suffix=".txt"):
70
+ temp_file = tempfile.NamedTemporaryFile(delete=False, mode="w", prefix=prefix, suffix=suffix)
71
+ # Ensure content ends with newline and normalize line endings
72
+ content = content.strip() + "\n\n" # Add extra newline at end
73
+ content = content.replace("\r\n", "\n").replace("\r", "\n")
74
+ temp_file.write(content)
75
+ temp_file.close()
76
+
77
+ # Debug: Print file contents
78
+ print(f"\nContent written to {prefix}{suffix}:")
79
+ print(content)
80
+ print("---")
81
+
82
+ return temp_file.name
83
 
84
+ def get_last_mp3_file(output_dir):
85
+ # List all files in the output directory
86
+ files = os.listdir(output_dir)
87
+
88
+ # Filter only .mp3 files
89
+ mp3_files = [file for file in files if file.endswith('.mp3')]
90
+
91
+ if not mp3_files:
92
+ print("No .mp3 files found in the output folder.")
93
+ return None
94
+
95
+ # Get the full path for the mp3 files
96
+ mp3_files_with_path = [os.path.join(output_dir, file) for file in mp3_files]
97
+
98
+ # Sort the files based on the modification time (most recent first)
99
+ mp3_files_with_path.sort(key=lambda x: os.path.getmtime(x), reverse=True)
100
+
101
+ # Return the most recent .mp3 file
102
+ return mp3_files_with_path[0]
103
 
104
+ def infer(genre_txt_content, lyrics_txt_content, num_segments, max_new_tokens):
105
+ # Create temporary files
106
+ genre_txt_path = create_temp_file(genre_txt_content, prefix="genre_")
107
+ lyrics_txt_path = create_temp_file(lyrics_txt_content, prefix="lyrics_")
 
 
 
 
108
 
109
+ print(f"Genre TXT path: {genre_txt_path}")
110
+ print(f"Lyrics TXT path: {lyrics_txt_path}")
 
 
 
111
 
112
+ # Ensure the output folder exists
113
+ output_dir = "./output"
114
  os.makedirs(output_dir, exist_ok=True)
115
+ print(f"Output folder ensured at: {output_dir}")
116
+
117
+ empty_output_folder(output_dir)
118
+
119
+ # Command and arguments with optimized settings
120
+ command = [
121
+ "python", "infer.py",
122
+ "--stage1_model", "m-a-p/YuE-s1-7B-anneal-en-cot",
123
+ "--stage2_model", "m-a-p/YuE-s2-1B-general",
124
+ "--genre_txt", f"{genre_txt_path}",
125
+ "--lyrics_txt", f"{lyrics_txt_path}",
126
+ "--run_n_segments", f"{num_segments}",
127
+ "--stage2_batch_size", "4",
128
+ "--output_dir", f"{output_dir}",
129
+ "--cuda_idx", "0",
130
+ "--max_new_tokens", f"{max_new_tokens}",
131
+ "--disable_offload_model"
132
+ ]
133
 
134
+ # Set up environment variables for CUDA with optimized settings
135
+ env = os.environ.copy()
136
+ env.update({
137
+ "CUDA_VISIBLE_DEVICES": "0",
138
+ "CUDA_HOME": "/usr/local/cuda",
139
+ "PATH": f"/usr/local/cuda/bin:{env.get('PATH', '')}",
140
+ "LD_LIBRARY_PATH": f"/usr/local/cuda/lib64:{env.get('LD_LIBRARY_PATH', '')}"
141
+ })
142
+
143
+ # Execute the command
144
+ try:
145
+ subprocess.run(command, check=True, env=env)
146
+ print("Command executed successfully!")
147
+
148
+ # Check and print the contents of the output folder
149
+ output_files = os.listdir(output_dir)
150
+ if output_files:
151
+ print("Output folder contents:")
152
+ for file in output_files:
153
+ print(f"- {file}")
154
+
155
+ last_mp3 = get_last_mp3_file(output_dir)
156
+
157
+ if last_mp3:
158
+ print("Last .mp3 file:", last_mp3)
159
+ return last_mp3
160
+ else:
161
+ return None
 
 
 
 
 
 
 
 
 
 
 
162
  else:
163
+ print("Output folder is empty.")
164
+ return None
165
+ except subprocess.CalledProcessError as e:
166
+ print(f"Error occurred: {e}")
167
+ return None
168
+ finally:
169
+ # Clean up temporary files
170
+ os.remove(genre_txt_path)
171
+ os.remove(lyrics_txt_path)
172
+ print("Temporary files deleted.")
173
+
174
+ # Gradio
175
+
176
+ with gr.Blocks() as demo:
177
+ with gr.Column():
178
+ gr.Markdown("# YuE: Open Music Foundation Models for Full-Song Generation")
179
+ gr.HTML("""
180
+ <div style="display:flex;column-gap:4px;">
181
+ <a href="https://github.com/multimodal-art-projection/YuE">
182
+ <img src='https://img.shields.io/badge/GitHub-Repo-blue'>
183
+ </a>
184
+ <a href="https://map-yue.github.io">
185
+ <img src='https://img.shields.io/badge/Project-Page-green'>
186
+ </a>
187
+ <a href="https://huggingface.co/spaces/fffiloni/YuE?duplicate=true">
188
+ <img src="https://huggingface.co/datasets/huggingface/badges/resolve/main/duplicate-this-space-sm.svg" alt="Duplicate this Space">
189
+ </a>
190
+ </div>
191
+ """)
192
+ with gr.Row():
193
+ with gr.Column():
194
+ genre_txt = gr.Textbox(label="Genre")
195
+ lyrics_txt = gr.Textbox(label="Lyrics")
196
+
197
+ with gr.Column():
198
+ if is_shared_ui:
199
+ num_segments = gr.Number(label="Number of Segments", value=2, interactive=True)
200
+ max_new_tokens = gr.Slider(label="Max New Tokens", minimum=500, maximum="3000", step=500, value=1500, interactive=True)
201
+ else:
202
+ num_segments = gr.Number(label="Number of Song Segments", value=2, interactive=True)
203
+ max_new_tokens = gr.Slider(label="Max New Tokens", minimum=500, maximum="24000", step=500, value=3000, interactive=True)
204
+ submit_btn = gr.Button("Submit")
205
+ music_out = gr.Audio(label="Audio Result")
206
+
207
+ gr.Examples(
208
+ examples = [
209
+ [
210
+ "female blues airy vocal bright vocal piano sad romantic guitar jazz",
211
+ """[verse]
212
+ In the quiet of the evening, shadows start to fall
213
+ Whispers of the night wind echo through the hall
214
+ Lost within the silence, I hear your gentle voice
215
+ Guiding me back homeward, making my heart rejoice
216
+
217
+ [chorus]
218
+ Don't let this moment fade, hold me close tonight
219
+ With you here beside me, everything's alright
220
+ Can't imagine life alone, don't want to let you go
221
+ Stay with me forever, let our love just flow
222
+ """
223
+ ],
224
+ [
225
+ "rap piano street tough piercing vocal hip-hop synthesizer clear vocal male",
226
+ """[verse]
227
+ Woke up in the morning, sun is shining bright
228
+ Chasing all my dreams, gotta get my mind right
229
+ City lights are fading, but my vision's clear
230
+ Got my team beside me, no room for fear
231
+ Walking through the streets, beats inside my head
232
+ Every step I take, closer to the bread
233
+ People passing by, they don't understand
234
+ Building up my future with my own two hands
235
+
236
+ [chorus]
237
+ This is my life, and I'm aiming for the top
238
+ Never gonna quit, no, I'm never gonna stop
239
+ Through the highs and lows, I'mma keep it real
240
+ Living out my dreams with this mic and a deal
241
+ """
242
+ ]
243
+ ],
244
+ inputs = [genre_txt, lyrics_txt]
245
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
246
 
247
+ submit_btn.click(
248
+ fn = infer,
249
+ inputs = [genre_txt, lyrics_txt, num_segments, max_new_tokens],
250
+ outputs = [music_out]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
251
  )
252
+ demo.queue().launch(show_api=False, show_error=True)
 
 
 
 
 
 
 
 
inference/codecmanipulator.py ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import numpy as np
3
+ import einops
4
+
5
+
6
+ class CodecManipulator(object):
7
+ r"""
8
+ **mm tokenizer v0.1**
9
+ see codeclm/hf/mm_tokenizer_v0.1_hf/id2vocab.json
10
+
11
+ text tokens:
12
+ llama tokenizer 0~31999
13
+
14
+ special tokens: "32000": "<EOD>", "32001": "<SOA>", "32002": "<EOA>", "32003": "<SOI>", "32004": "<EOI>", "32005": "<SOV>", "32006": "<EOV>", "32007": "<s_local>", "32008": "<e_local>", "32009": "<s_global>", "32010": "<e_global>", "32011": "<semantic>", "32012": "<acoustic>", "32013": "<low_level>", "32014": "<dac_16k>", "32015": "<dac_44k>", "32016": "<xcodec>", "32017": "<placeholder>", "32018": "<semantic_mert>", "32019": "<semantic_hubert>", "32020": "<visual>", "32021": "<semanticodec>"
15
+
16
+ mm tokens:
17
+ dac_16k: 4 codebook, 1024 vocab, 32022 - 36117
18
+ dac_44k: 9 codebook, 1024 vocab, 36118 - 45333
19
+ xcodec: 12 codebook, 1024 vocab, 45334 - 57621
20
+ semantic mert: 1024, 57622 - 58645
21
+ semantic hubert: 512, 58646 - 59157
22
+ visual: 64000, not included in v0.1
23
+ semanticodec 100tps 16384: semantic=16384, 59158 - 75541, acoustic=8192, 75542 - 83733
24
+ """
25
+ def __init__(self, codec_type, quantizer_begin=None, n_quantizer=None, teacher_forcing=False, data_feature="codec"):
26
+ self.codec_type = codec_type
27
+ self.mm_v0_2_cfg = {
28
+ "dac16k": {"codebook_size": 1024, "num_codebooks": 4, "global_offset": 32022, "sep": ["<dac_16k>"], "fps": 50},
29
+ "dac44k": {"codebook_size": 1024, "num_codebooks": 9, "global_offset": 36118, "sep": ["<dac_44k>"]},
30
+ "xcodec": {"codebook_size": 1024, "num_codebooks": 12, "global_offset": 45334, "sep": ["<xcodec>"], "fps": 50},
31
+ "mert": {"codebook_size": 1024, "global_offset": 57622, "sep": ["<semantic_mert>"]},
32
+ "hubert": {"codebook_size": 512, "global_offset": 58646, "sep": ["<semantic_hubert>"]},
33
+ "semantic/s": {"codebook_size": 16384, "num_codebooks": 1, "global_offset": 59158, "sep": ["<semanticodec>", "<semantic>"]},
34
+ "semantic/a": {"codebook_size": 8192, "num_codebooks": 1, "global_offset": 75542, "sep": ["<semanticodec>", "<acoustic>"]},
35
+ "semanticodec": {"codebook_size": [16384, 8192], "num_codebooks": 2, "global_offset": 59158, "sep": ["<semanticodec>"], "fps": 50},
36
+ "special_tokens": {
37
+ '<EOD>': 32000, '<SOA>': 32001, '<EOA>': 32002, '<SOI>': 32003, '<EOI>': 32004, '<SOV>': 32005, '<EOV>': 32006, '<s_local>': 32007, '<e_local>': 32008, '<s_global>': 32009, '<e_global>': 32010, '<semantic>': 32011, '<acoustic>': 32012, '<stage_1>': 32013, '<dac_16k>': 32014, '<dac_44k>': 32015, '<xcodec>': 32016, '<stage_2>': 32017, '<semantic_mert>': 32018, '<semantic_hubert>': 32019, '<visual>': 32020, '<semanticodec>': 32021
38
+ },
39
+ "metadata": {
40
+ "len": 83734,
41
+ "text_range": [0, 31999],
42
+ "special_range": [32000, 32021],
43
+ "mm_range": [32022, 83733]
44
+ },
45
+ "codec_range": {
46
+ "dac16k": [32022, 36117],
47
+ "dac44k": [36118, 45333],
48
+ "xcodec": [45334, 57621],
49
+ # "hifi16k": [53526, 57621],
50
+ "mert": [57622, 58645],
51
+ "hubert": [58646, 59157],
52
+ "semantic/s": [59158, 75541],
53
+ "semantic/a": [75542, 83733],
54
+ "semanticodec": [59158, 83733]
55
+ }
56
+ }
57
+ self.sep = self.mm_v0_2_cfg[self.codec_type]["sep"]
58
+ self.sep_ids = [self.mm_v0_2_cfg["special_tokens"][s] for s in self.sep]
59
+ self.codebook_size = self.mm_v0_2_cfg[self.codec_type]["codebook_size"]
60
+ self.num_codebooks = self.mm_v0_2_cfg[self.codec_type]["num_codebooks"]
61
+ self.global_offset = self.mm_v0_2_cfg[self.codec_type]["global_offset"]
62
+ self.fps = self.mm_v0_2_cfg[self.codec_type]["fps"] if "fps" in self.mm_v0_2_cfg[self.codec_type] else None
63
+
64
+ self.quantizer_begin = quantizer_begin if quantizer_begin is not None else 0
65
+ self.n_quantizer = n_quantizer if n_quantizer is not None else self.num_codebooks
66
+ self.teacher_forcing = teacher_forcing
67
+ self.data_feature = data_feature
68
+
69
+
70
+ def offset_tok_ids(self, x, global_offset=0, codebook_size=2048, num_codebooks=4):
71
+ """
72
+ x: (K, T)
73
+ """
74
+ if isinstance(codebook_size, int):
75
+ assert x.max() < codebook_size, f"max(x)={x.max()}, codebook_size={codebook_size}"
76
+ elif isinstance(codebook_size, list):
77
+ for i, cs in enumerate(codebook_size):
78
+ assert x[i].max() < cs, f"max(x)={x[i].max()}, codebook_size={cs}, layer_id={i}"
79
+ else:
80
+ raise ValueError(f"codebook_size={codebook_size}")
81
+ assert x.min() >= 0, f"min(x)={x.min()}"
82
+ assert x.shape[0] == num_codebooks or x.shape[0] == self.n_quantizer, \
83
+ f"x.shape[0]={x.shape[0]}, num_codebooks={num_codebooks}, n_quantizer={self.n_quantizer}"
84
+
85
+ _x = x.copy()
86
+ _x = _x.astype(np.uint32)
87
+ cum_offset = 0
88
+ quantizer_begin = self.quantizer_begin
89
+ quantizer_end = quantizer_begin+self.n_quantizer
90
+ for k in range(self.quantizer_begin, quantizer_end): # k: quantizer_begin to quantizer_end - 1
91
+ if isinstance(codebook_size, int):
92
+ _x[k] += global_offset + k * codebook_size
93
+ elif isinstance(codebook_size, list):
94
+ _x[k] += global_offset + cum_offset
95
+ cum_offset += codebook_size[k]
96
+ else:
97
+ raise ValueError(f"codebook_size={codebook_size}")
98
+ return _x[quantizer_begin:quantizer_end]
99
+
100
+ def unoffset_tok_ids(self, x, global_offset=0, codebook_size=2048, num_codebooks=4):
101
+ """
102
+ x: (K, T)
103
+ """
104
+ if isinstance(codebook_size, int):
105
+ assert x.max() < global_offset + codebook_size * num_codebooks, f"max(x)={x.max()}, codebook_size={codebook_size}"
106
+ elif isinstance(codebook_size, list):
107
+ assert x.max() < global_offset + sum(codebook_size), f"max(x)={x.max()}, codebook_size={codebook_size}"
108
+ assert x.min() >= global_offset, f"min(x)={x.min()}, global_offset={global_offset}"
109
+ assert x.shape[0] == num_codebooks or x.shape[0] == self.n_quantizer, \
110
+ f"x.shape[0]={x.shape[0]}, num_codebooks={num_codebooks}, n_quantizer={self.n_quantizer}"
111
+
112
+ _x = x.copy()
113
+ _x = _x.astype(np.uint32)
114
+ cum_offset = 0
115
+ quantizer_begin = self.quantizer_begin
116
+ quantizer_end = quantizer_begin+self.n_quantizer
117
+ for k in range(quantizer_begin, quantizer_end):
118
+ if isinstance(codebook_size, int):
119
+ _x[k-quantizer_begin] -= global_offset + k * codebook_size
120
+ elif isinstance(codebook_size, list):
121
+ _x[k-quantizer_begin] -= global_offset + cum_offset
122
+ cum_offset += codebook_size[k]
123
+ else:
124
+ raise ValueError(f"codebook_size={codebook_size}")
125
+ return _x
126
+
127
+ def flatten(self, x):
128
+ if len(x.shape) > 2:
129
+ x = x.squeeze()
130
+ assert x.shape[0] == self.num_codebooks or x.shape[0] == self.n_quantizer, \
131
+ f"x.shape[0]={x.shape[0]}, num_codebooks={self.num_codebooks}, n_quantizer={self.n_quantizer}"
132
+ return einops.rearrange(x, 'K T -> (T K)')
133
+
134
+ def unflatten(self, x, n_quantizer=None):
135
+ x = x.squeeze()
136
+ assert len(x.shape) == 1
137
+ assert x.shape[0] % self.num_codebooks == 0 or x.shape[0] % self.n_quantizer == 0, \
138
+ f"x.shape[0]={x.shape[0]}, num_codebooks={self.num_codebooks}, n_quantizer={self.n_quantizer}"
139
+ if n_quantizer!=self.num_codebooks:
140
+ return einops.rearrange(x, '(T K) -> K T', K=n_quantizer)
141
+ return einops.rearrange(x, '(T K) -> K T', K=self.num_codebooks)
142
+
143
+ # def check_codec_type_from_path(self, path):
144
+ # if self.codec_type == "hifi16k":
145
+ # assert "academicodec_hifi_16k_320d_large_uni" in path
146
+
147
+ def get_codec_type_from_range(self, ids):
148
+ ids_range = [ids.min(), ids.max()]
149
+ codec_range = self.mm_v0_2_cfg["codec_range"]
150
+ for codec_type, r in codec_range.items():
151
+ if ids_range[0] >= r[0] and ids_range[1] <= r[1]:
152
+ return codec_type
153
+ raise ValueError(f"ids_range={ids_range}, codec_range={codec_range}")
154
+
155
+ def npy2ids(self, npy):
156
+ if isinstance(npy, str):
157
+ data = np.load(npy)
158
+ elif isinstance(npy, np.ndarray):
159
+ data = npy
160
+ else:
161
+ raise ValueError(f"not supported type: {type(npy)}")
162
+ # data = data.squeeze()
163
+
164
+ assert len(data.shape)==2, f'data shape: {data.shape} is not (n_codebook, seq_len)'
165
+ data = self.offset_tok_ids(
166
+ data,
167
+ global_offset=self.global_offset,
168
+ codebook_size=self.codebook_size,
169
+ num_codebooks=self.num_codebooks,
170
+ )
171
+ data = self.flatten(data)
172
+ codec_range = self.get_codec_type_from_range(data)
173
+ assert codec_range == self.codec_type, f"get_codec_type_from_range(data)={codec_range}, self.codec_type={self.codec_type}"
174
+ data = data.tolist()
175
+ return data
176
+
177
+ def ids2npy(self, token_ids):
178
+ # make sure token_ids starts with codebook 0
179
+ if isinstance(self.codebook_size, int):
180
+ codebook_0_range = (self.global_offset + self.quantizer_begin*self.codebook_size, self.global_offset + (self.quantizer_begin+1)*self.codebook_size)
181
+ elif isinstance(self.codebook_size, list):
182
+ codebook_0_range = (self.global_offset, self.global_offset + self.codebook_size[0])
183
+ assert token_ids[0] >= codebook_0_range[0] \
184
+ and token_ids[0] < codebook_0_range[1], f"token_ids[0]={token_ids[self.quantizer_begin]}, codebook_0_range={codebook_0_range}"
185
+ data = np.array(token_ids)
186
+ data = self.unflatten(data, n_quantizer=self.n_quantizer)
187
+ data = self.unoffset_tok_ids(
188
+ data,
189
+ global_offset=self.global_offset,
190
+ codebook_size=self.codebook_size,
191
+ num_codebooks=self.num_codebooks,
192
+ )
193
+ return data
194
+
195
+ def npy_to_json_str(self, npy_path):
196
+ data = self.npy2ids(npy_path)
197
+ return json.dumps({"text": data, "src": npy_path, "codec": self.codec_type})
198
+
199
+ def sep(self):
200
+ return ''.join(self.sep)
201
+
202
+ def sep_ids(self):
203
+ return self.sep_ids
inference/infer.py ADDED
@@ -0,0 +1,459 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import sys
3
+ sys.path.append(os.path.join(os.path.dirname(os.path.abspath(__file__)), 'xcodec_mini_infer'))
4
+ sys.path.append(os.path.join(os.path.dirname(os.path.abspath(__file__)), 'xcodec_mini_infer', 'descriptaudiocodec'))
5
+ import argparse
6
+ import torch
7
+ import numpy as np
8
+ import json
9
+ from omegaconf import OmegaConf
10
+ import torchaudio
11
+ from torchaudio.transforms import Resample
12
+ import soundfile as sf
13
+
14
+ import uuid
15
+ from tqdm import tqdm
16
+ from einops import rearrange
17
+ from codecmanipulator import CodecManipulator
18
+ from mmtokenizer import _MMSentencePieceTokenizer
19
+ from transformers import AutoTokenizer, AutoModelForCausalLM, LogitsProcessor, LogitsProcessorList
20
+ import glob
21
+ import time
22
+ import copy
23
+ from collections import Counter
24
+ from models.soundstream_hubert_new import SoundStream
25
+ from vocoder import build_codec_model, process_audio
26
+ from post_process_audio import replace_low_freq_with_energy_matched
27
+ import re
28
+
29
+
30
+ parser = argparse.ArgumentParser()
31
+ # Model Configuration:
32
+ parser.add_argument("--stage1_model", type=str, default="m-a-p/YuE-s1-7B-anneal-en-cot", help="The model checkpoint path or identifier for the Stage 1 model.")
33
+ parser.add_argument("--stage2_model", type=str, default="m-a-p/YuE-s2-1B-general", help="The model checkpoint path or identifier for the Stage 2 model.")
34
+ parser.add_argument("--max_new_tokens", type=int, default=3000, help="The maximum number of new tokens to generate in one pass during text generation.")
35
+ parser.add_argument("--run_n_segments", type=int, default=2, help="The number of segments to process during the generation.")
36
+ parser.add_argument("--stage2_batch_size", type=int, default=4, help="The batch size used in Stage 2 inference.")
37
+ # Prompt
38
+ parser.add_argument("--genre_txt", type=str, required=True, help="The file path to a text file containing genre tags that describe the musical style or characteristics (e.g., instrumental, genre, mood, vocal timbre, vocal gender). This is used as part of the generation prompt.")
39
+ parser.add_argument("--lyrics_txt", type=str, required=True, help="The file path to a text file containing the lyrics for the music generation. These lyrics will be processed and split into structured segments to guide the generation process.")
40
+ parser.add_argument("--use_audio_prompt", action="store_true", help="If set, the model will use an audio file as a prompt during generation. The audio file should be specified using --audio_prompt_path.")
41
+ parser.add_argument("--audio_prompt_path", type=str, default="", help="The file path to an audio file to use as a reference prompt when --use_audio_prompt is enabled.")
42
+ parser.add_argument("--prompt_start_time", type=float, default=0.0, help="The start time in seconds to extract the audio prompt from the given audio file.")
43
+ parser.add_argument("--prompt_end_time", type=float, default=30.0, help="The end time in seconds to extract the audio prompt from the given audio file.")
44
+ # Output
45
+ parser.add_argument("--output_dir", type=str, default="./output", help="The directory where generated outputs will be saved.")
46
+ parser.add_argument("--keep_intermediate", action="store_true", help="If set, intermediate outputs will be saved during processing.")
47
+ parser.add_argument("--disable_offload_model", action="store_true", help="If set, the model will not be offloaded from the GPU to CPU after Stage 1 inference.")
48
+ parser.add_argument("--cuda_idx", type=int, default=0)
49
+ # Config for xcodec and upsampler
50
+ parser.add_argument('--basic_model_config', default='./xcodec_mini_infer/final_ckpt/config.yaml', help='YAML files for xcodec configurations.')
51
+ parser.add_argument('--resume_path', default='./xcodec_mini_infer/final_ckpt/ckpt_00360000.pth', help='Path to the xcodec checkpoint.')
52
+ parser.add_argument('--config_path', type=str, default='./xcodec_mini_infer/decoders/config.yaml', help='Path to Vocos config file.')
53
+ parser.add_argument('--vocal_decoder_path', type=str, default='./xcodec_mini_infer/decoders/decoder_131000.pth', help='Path to Vocos decoder weights.')
54
+ parser.add_argument('--inst_decoder_path', type=str, default='./xcodec_mini_infer/decoders/decoder_151000.pth', help='Path to Vocos decoder weights.')
55
+ parser.add_argument('-r', '--rescale', action='store_true', help='Rescale output to avoid clipping.')
56
+
57
+
58
+ args = parser.parse_args()
59
+ if args.use_audio_prompt and not args.audio_prompt_path:
60
+ raise FileNotFoundError("Please offer audio prompt filepath using '--audio_prompt_path', when you enable 'use_audio_prompt'!")
61
+ stage1_model = args.stage1_model
62
+ stage2_model = args.stage2_model
63
+ cuda_idx = args.cuda_idx
64
+ max_new_tokens = args.max_new_tokens
65
+ stage1_output_dir = os.path.join(args.output_dir, f"stage1")
66
+ stage2_output_dir = stage1_output_dir.replace('stage1', 'stage2')
67
+ os.makedirs(stage1_output_dir, exist_ok=True)
68
+ os.makedirs(stage2_output_dir, exist_ok=True)
69
+
70
+ # load tokenizer and model
71
+ device = torch.device(f"cuda:{cuda_idx}" if torch.cuda.is_available() else "cpu")
72
+
73
+ # Now you can use `device` to move your tensors or models to the GPU (if available)
74
+ print(f"Using device: {device}")
75
+
76
+ mmtokenizer = _MMSentencePieceTokenizer("./mm_tokenizer_v0.2_hf/tokenizer.model")
77
+ model = AutoModelForCausalLM.from_pretrained(
78
+ stage1_model,
79
+ torch_dtype=torch.bfloat16,
80
+ attn_implementation="flash_attention_2", # To enable flashattn, you have to install flash-attn
81
+ )
82
+ model.to(device)
83
+ model.eval()
84
+
85
+ codectool = CodecManipulator("xcodec", 0, 1)
86
+ codectool_stage2 = CodecManipulator("xcodec", 0, 8)
87
+ model_config = OmegaConf.load(args.basic_model_config)
88
+ codec_model = eval(model_config.generator.name)(**model_config.generator.config).to(device)
89
+ parameter_dict = torch.load(args.resume_path, map_location='cpu')
90
+ codec_model.load_state_dict(parameter_dict['codec_model'])
91
+ codec_model.to(device)
92
+ codec_model.eval()
93
+
94
+ class BlockTokenRangeProcessor(LogitsProcessor):
95
+ def __init__(self, start_id, end_id):
96
+ self.blocked_token_ids = list(range(start_id, end_id))
97
+
98
+ def __call__(self, input_ids, scores):
99
+ scores[:, self.blocked_token_ids] = -float("inf")
100
+ return scores
101
+
102
+ def load_audio_mono(filepath, sampling_rate=16000):
103
+ audio, sr = torchaudio.load(filepath)
104
+ # Convert to mono
105
+ audio = torch.mean(audio, dim=0, keepdim=True)
106
+ # Resample if needed
107
+ if sr != sampling_rate:
108
+ resampler = Resample(orig_freq=sr, new_freq=sampling_rate)
109
+ audio = resampler(audio)
110
+ return audio
111
+
112
+ def split_lyrics(lyrics):
113
+ pattern = r"\[(\w+)\](.*?)\n(?=\[|\Z)"
114
+ segments = re.findall(pattern, lyrics, re.DOTALL)
115
+ structured_lyrics = [f"[{seg[0]}]\n{seg[1].strip()}\n\n" for seg in segments]
116
+ return structured_lyrics
117
+
118
+ # Call the function and print the result
119
+ stage1_output_set = []
120
+ # Tips:
121
+ # genre tags support instrumental,genre,mood,vocal timbr and vocal gender
122
+ # all kinds of tags are needed
123
+ with open(args.genre_txt) as f:
124
+ genres = f.read().strip()
125
+ with open(args.lyrics_txt) as f:
126
+ lyrics = split_lyrics(f.read())
127
+ # intruction
128
+ full_lyrics = "\n".join(lyrics)
129
+ prompt_texts = [f"Generate music from the given lyrics segment by segment.\n[Genre] {genres}\n{full_lyrics}"]
130
+ prompt_texts += lyrics
131
+
132
+
133
+ random_id = uuid.uuid4()
134
+ output_seq = None
135
+ # Here is suggested decoding config
136
+ top_p = 0.93
137
+ temperature = 1.0
138
+ repetition_penalty = 1.2
139
+ # special tokens
140
+ start_of_segment = mmtokenizer.tokenize('[start_of_segment]')
141
+ end_of_segment = mmtokenizer.tokenize('[end_of_segment]')
142
+ # Format text prompt
143
+ run_n_segments = min(args.run_n_segments+1, len(lyrics))
144
+ for i, p in enumerate(tqdm(prompt_texts[:run_n_segments])):
145
+ section_text = p.replace('[start_of_segment]', '').replace('[end_of_segment]', '')
146
+ guidance_scale = 1.5 if i <=1 else 1.2
147
+ if i==0:
148
+ continue
149
+ if i==1:
150
+ if args.use_audio_prompt:
151
+ audio_prompt = load_audio_mono(args.audio_prompt_path)
152
+ audio_prompt.unsqueeze_(0)
153
+ with torch.no_grad():
154
+ raw_codes = codec_model.encode(audio_prompt.to(device), target_bw=0.5)
155
+ raw_codes = raw_codes.transpose(0, 1)
156
+ raw_codes = raw_codes.cpu().numpy().astype(np.int16)
157
+ # Format audio prompt
158
+ code_ids = codectool.npy2ids(raw_codes[0])
159
+ audio_prompt_codec = code_ids[int(args.prompt_start_time *50): int(args.prompt_end_time *50)] # 50 is tps of xcodec
160
+ audio_prompt_codec_ids = [mmtokenizer.soa] + codectool.sep_ids + audio_prompt_codec + [mmtokenizer.eoa]
161
+ sentence_ids = mmtokenizer.tokenize("[start_of_reference]") + audio_prompt_codec_ids + mmtokenizer.tokenize("[end_of_reference]")
162
+ head_id = mmtokenizer.tokenize(prompt_texts[0]) + sentence_ids
163
+ else:
164
+ head_id = mmtokenizer.tokenize(prompt_texts[0])
165
+ prompt_ids = head_id + start_of_segment + mmtokenizer.tokenize(section_text) + [mmtokenizer.soa] + codectool.sep_ids
166
+ else:
167
+ prompt_ids = end_of_segment + start_of_segment + mmtokenizer.tokenize(section_text) + [mmtokenizer.soa] + codectool.sep_ids
168
+
169
+ prompt_ids = torch.as_tensor(prompt_ids).unsqueeze(0).to(device)
170
+ input_ids = torch.cat([raw_output, prompt_ids], dim=1) if i > 1 else prompt_ids
171
+ # Use window slicing in case output sequence exceeds the context of model
172
+ max_context = 16384-max_new_tokens-1
173
+ if input_ids.shape[-1] > max_context:
174
+ print(f'Section {i}: output length {input_ids.shape[-1]} exceeding context length {max_context}, now using the last {max_context} tokens.')
175
+ input_ids = input_ids[:, -(max_context):]
176
+ with torch.no_grad():
177
+ output_seq = model.generate(
178
+ input_ids=input_ids,
179
+ max_new_tokens=max_new_tokens,
180
+ min_new_tokens=100,
181
+ do_sample=True,
182
+ top_p=top_p,
183
+ temperature=temperature,
184
+ repetition_penalty=repetition_penalty,
185
+ eos_token_id=mmtokenizer.eoa,
186
+ pad_token_id=mmtokenizer.eoa,
187
+ logits_processor=LogitsProcessorList([BlockTokenRangeProcessor(0, 32002), BlockTokenRangeProcessor(32016, 32016)]),
188
+ guidance_scale=guidance_scale,
189
+ )
190
+ if output_seq[0][-1].item() != mmtokenizer.eoa:
191
+ tensor_eoa = torch.as_tensor([[mmtokenizer.eoa]]).to(model.device)
192
+ output_seq = torch.cat((output_seq, tensor_eoa), dim=1)
193
+ if i > 1:
194
+ raw_output = torch.cat([raw_output, prompt_ids, output_seq[:, input_ids.shape[-1]:]], dim=1)
195
+ else:
196
+ raw_output = output_seq
197
+
198
+ # save raw output and check sanity
199
+ ids = raw_output[0].cpu().numpy()
200
+ soa_idx = np.where(ids == mmtokenizer.soa)[0].tolist()
201
+ eoa_idx = np.where(ids == mmtokenizer.eoa)[0].tolist()
202
+ if len(soa_idx)!=len(eoa_idx):
203
+ raise ValueError(f'invalid pairs of soa and eoa, Num of soa: {len(soa_idx)}, Num of eoa: {len(eoa_idx)}')
204
+
205
+ vocals = []
206
+ instrumentals = []
207
+ range_begin = 1 if args.use_audio_prompt else 0
208
+ for i in range(range_begin, len(soa_idx)):
209
+ codec_ids = ids[soa_idx[i]+1:eoa_idx[i]]
210
+ if codec_ids[0] == 32016:
211
+ codec_ids = codec_ids[1:]
212
+ codec_ids = codec_ids[:2 * (codec_ids.shape[0] // 2)]
213
+ vocals_ids = codectool.ids2npy(rearrange(codec_ids,"(n b) -> b n", b=2)[0])
214
+ vocals.append(vocals_ids)
215
+ instrumentals_ids = codectool.ids2npy(rearrange(codec_ids,"(n b) -> b n", b=2)[1])
216
+ instrumentals.append(instrumentals_ids)
217
+ vocals = np.concatenate(vocals, axis=1)
218
+ instrumentals = np.concatenate(instrumentals, axis=1)
219
+ vocal_save_path = os.path.join(stage1_output_dir, f"cot_{genres.replace(' ', '-')}_tp{top_p}_T{temperature}_rp{repetition_penalty}_maxtk{max_new_tokens}_vocal_{random_id}".replace('.', '@')+'.npy')
220
+ inst_save_path = os.path.join(stage1_output_dir, f"cot_{genres.replace(' ', '-')}_tp{top_p}_T{temperature}_rp{repetition_penalty}_maxtk{max_new_tokens}_instrumental_{random_id}".replace('.', '@')+'.npy')
221
+ np.save(vocal_save_path, vocals)
222
+ np.save(inst_save_path, instrumentals)
223
+ stage1_output_set.append(vocal_save_path)
224
+ stage1_output_set.append(inst_save_path)
225
+
226
+
227
+ # offload model
228
+ if not args.disable_offload_model:
229
+ model.cpu()
230
+ del model
231
+ torch.cuda.empty_cache()
232
+
233
+ print("Stage 2 inference...")
234
+ model_stage2 = AutoModelForCausalLM.from_pretrained(
235
+ stage2_model,
236
+ torch_dtype=torch.float16,
237
+ attn_implementation="flash_attention_2"
238
+ )
239
+ model_stage2.to(device)
240
+ model_stage2.eval()
241
+
242
+ def stage2_generate(model, prompt, batch_size=16):
243
+ codec_ids = codectool.unflatten(prompt, n_quantizer=1)
244
+ codec_ids = codectool.offset_tok_ids(
245
+ codec_ids,
246
+ global_offset=codectool.global_offset,
247
+ codebook_size=codectool.codebook_size,
248
+ num_codebooks=codectool.num_codebooks,
249
+ ).astype(np.int32)
250
+
251
+ # Prepare prompt_ids based on batch size or single input
252
+ if batch_size > 1:
253
+ codec_list = []
254
+ for i in range(batch_size):
255
+ idx_begin = i * 300
256
+ idx_end = (i + 1) * 300
257
+ codec_list.append(codec_ids[:, idx_begin:idx_end])
258
+
259
+ codec_ids = np.concatenate(codec_list, axis=0)
260
+ prompt_ids = np.concatenate(
261
+ [
262
+ np.tile([mmtokenizer.soa, mmtokenizer.stage_1], (batch_size, 1)),
263
+ codec_ids,
264
+ np.tile([mmtokenizer.stage_2], (batch_size, 1)),
265
+ ],
266
+ axis=1
267
+ )
268
+ else:
269
+ prompt_ids = np.concatenate([
270
+ np.array([mmtokenizer.soa, mmtokenizer.stage_1]),
271
+ codec_ids.flatten(), # Flatten the 2D array to 1D
272
+ np.array([mmtokenizer.stage_2])
273
+ ]).astype(np.int32)
274
+ prompt_ids = prompt_ids[np.newaxis, ...]
275
+
276
+ codec_ids = torch.as_tensor(codec_ids).to(device)
277
+ prompt_ids = torch.as_tensor(prompt_ids).to(device)
278
+ len_prompt = prompt_ids.shape[-1]
279
+
280
+ block_list = LogitsProcessorList([BlockTokenRangeProcessor(0, 46358), BlockTokenRangeProcessor(53526, mmtokenizer.vocab_size)])
281
+
282
+ # Teacher forcing generate loop
283
+ for frames_idx in range(codec_ids.shape[1]):
284
+ cb0 = codec_ids[:, frames_idx:frames_idx+1]
285
+ prompt_ids = torch.cat([prompt_ids, cb0], dim=1)
286
+ input_ids = prompt_ids
287
+
288
+ with torch.no_grad():
289
+ stage2_output = model.generate(input_ids=input_ids,
290
+ min_new_tokens=7,
291
+ max_new_tokens=7,
292
+ eos_token_id=mmtokenizer.eoa,
293
+ pad_token_id=mmtokenizer.eoa,
294
+ logits_processor=block_list,
295
+ )
296
+
297
+ assert stage2_output.shape[1] - prompt_ids.shape[1] == 7, f"output new tokens={stage2_output.shape[1]-prompt_ids.shape[1]}"
298
+ prompt_ids = stage2_output
299
+
300
+ # Return output based on batch size
301
+ if batch_size > 1:
302
+ output = prompt_ids.cpu().numpy()[:, len_prompt:]
303
+ output_list = [output[i] for i in range(batch_size)]
304
+ output = np.concatenate(output_list, axis=0)
305
+ else:
306
+ output = prompt_ids[0].cpu().numpy()[len_prompt:]
307
+
308
+ return output
309
+
310
+ def stage2_inference(model, stage1_output_set, stage2_output_dir, batch_size=4):
311
+ stage2_result = []
312
+ for i in tqdm(range(len(stage1_output_set))):
313
+ output_filename = os.path.join(stage2_output_dir, os.path.basename(stage1_output_set[i]))
314
+
315
+ if os.path.exists(output_filename):
316
+ print(f'{output_filename} stage2 has done.')
317
+ continue
318
+
319
+ # Load the prompt
320
+ prompt = np.load(stage1_output_set[i]).astype(np.int32)
321
+
322
+ # Only accept 6s segments
323
+ output_duration = prompt.shape[-1] // 50 // 6 * 6
324
+ num_batch = output_duration // 6
325
+
326
+ if num_batch <= batch_size:
327
+ # If num_batch is less than or equal to batch_size, we can infer the entire prompt at once
328
+ output = stage2_generate(model, prompt[:, :output_duration*50], batch_size=num_batch)
329
+ else:
330
+ # If num_batch is greater than batch_size, process in chunks of batch_size
331
+ segments = []
332
+ num_segments = (num_batch // batch_size) + (1 if num_batch % batch_size != 0 else 0)
333
+
334
+ for seg in range(num_segments):
335
+ start_idx = seg * batch_size * 300
336
+ # Ensure the end_idx does not exceed the available length
337
+ end_idx = min((seg + 1) * batch_size * 300, output_duration*50) # Adjust the last segment
338
+ current_batch_size = batch_size if seg != num_segments-1 or num_batch % batch_size == 0 else num_batch % batch_size
339
+ segment = stage2_generate(
340
+ model,
341
+ prompt[:, start_idx:end_idx],
342
+ batch_size=current_batch_size
343
+ )
344
+ segments.append(segment)
345
+
346
+ # Concatenate all the segments
347
+ output = np.concatenate(segments, axis=0)
348
+
349
+ # Process the ending part of the prompt
350
+ if output_duration*50 != prompt.shape[-1]:
351
+ ending = stage2_generate(model, prompt[:, output_duration*50:], batch_size=1)
352
+ output = np.concatenate([output, ending], axis=0)
353
+ output = codectool_stage2.ids2npy(output)
354
+
355
+ # Fix invalid codes (a dirty solution, which may harm the quality of audio)
356
+ # We are trying to find better one
357
+ fixed_output = copy.deepcopy(output)
358
+ for i, line in enumerate(output):
359
+ for j, element in enumerate(line):
360
+ if element < 0 or element > 1023:
361
+ counter = Counter(line)
362
+ most_frequant = sorted(counter.items(), key=lambda x: x[1], reverse=True)[0][0]
363
+ fixed_output[i, j] = most_frequant
364
+ # save output
365
+ np.save(output_filename, fixed_output)
366
+ stage2_result.append(output_filename)
367
+ return stage2_result
368
+
369
+ stage2_result = stage2_inference(model_stage2, stage1_output_set, stage2_output_dir, batch_size=args.stage2_batch_size)
370
+ print(stage2_result)
371
+ print('Stage 2 DONE.\n')
372
+ # convert audio tokens to audio
373
+ def save_audio(wav: torch.Tensor, path, sample_rate: int, rescale: bool = False):
374
+ folder_path = os.path.dirname(path)
375
+ if not os.path.exists(folder_path):
376
+ os.makedirs(folder_path)
377
+ limit = 0.99
378
+ max_val = wav.abs().max()
379
+ wav = wav * min(limit / max_val, 1) if rescale else wav.clamp(-limit, limit)
380
+ torchaudio.save(str(path), wav, sample_rate=sample_rate, encoding='PCM_S', bits_per_sample=16)
381
+ # reconstruct tracks
382
+ recons_output_dir = os.path.join(args.output_dir, "recons")
383
+ recons_mix_dir = os.path.join(recons_output_dir, 'mix')
384
+ os.makedirs(recons_mix_dir, exist_ok=True)
385
+ tracks = []
386
+ for npy in stage2_result:
387
+ codec_result = np.load(npy)
388
+ decodec_rlt=[]
389
+ with torch.no_grad():
390
+ decoded_waveform = codec_model.decode(torch.as_tensor(codec_result.astype(np.int16), dtype=torch.long).unsqueeze(0).permute(1, 0, 2).to(device))
391
+ decoded_waveform = decoded_waveform.cpu().squeeze(0)
392
+ decodec_rlt.append(torch.as_tensor(decoded_waveform))
393
+ decodec_rlt = torch.cat(decodec_rlt, dim=-1)
394
+ save_path = os.path.join(recons_output_dir, os.path.splitext(os.path.basename(npy))[0] + ".mp3")
395
+ tracks.append(save_path)
396
+ save_audio(decodec_rlt, save_path, 16000)
397
+ # mix tracks
398
+ for inst_path in tracks:
399
+ try:
400
+ if (inst_path.endswith('.wav') or inst_path.endswith('.mp3')) \
401
+ and 'instrumental' in inst_path:
402
+ # find pair
403
+ vocal_path = inst_path.replace('instrumental', 'vocal')
404
+ if not os.path.exists(vocal_path):
405
+ continue
406
+ # mix
407
+ recons_mix = os.path.join(recons_mix_dir, os.path.basename(inst_path).replace('instrumental', 'mixed'))
408
+ vocal_stem, sr = sf.read(inst_path)
409
+ instrumental_stem, _ = sf.read(vocal_path)
410
+ mix_stem = (vocal_stem + instrumental_stem) / 1
411
+ sf.write(recons_mix, mix_stem, sr)
412
+ except Exception as e:
413
+ print(e)
414
+
415
+ # vocoder to upsample audios
416
+ vocal_decoder, inst_decoder = build_codec_model(args.config_path, args.vocal_decoder_path, args.inst_decoder_path)
417
+ vocoder_output_dir = os.path.join(args.output_dir, 'vocoder')
418
+ vocoder_stems_dir = os.path.join(vocoder_output_dir, 'stems')
419
+ vocoder_mix_dir = os.path.join(vocoder_output_dir, 'mix')
420
+ os.makedirs(vocoder_mix_dir, exist_ok=True)
421
+ os.makedirs(vocoder_stems_dir, exist_ok=True)
422
+ for npy in stage2_result:
423
+ if 'instrumental' in npy:
424
+ # Process instrumental
425
+ instrumental_output = process_audio(
426
+ npy,
427
+ os.path.join(vocoder_stems_dir, 'instrumental.mp3'),
428
+ args.rescale,
429
+ args,
430
+ inst_decoder,
431
+ codec_model
432
+ )
433
+ else:
434
+ # Process vocal
435
+ vocal_output = process_audio(
436
+ npy,
437
+ os.path.join(vocoder_stems_dir, 'vocal.mp3'),
438
+ args.rescale,
439
+ args,
440
+ vocal_decoder,
441
+ codec_model
442
+ )
443
+ # mix tracks
444
+ try:
445
+ mix_output = instrumental_output + vocal_output
446
+ vocoder_mix = os.path.join(vocoder_mix_dir, os.path.basename(recons_mix))
447
+ save_audio(mix_output, vocoder_mix, 44100, args.rescale)
448
+ print(f"Created mix: {vocoder_mix}")
449
+ except RuntimeError as e:
450
+ print(e)
451
+ print(f"mix {vocoder_mix} failed! inst: {instrumental_output.shape}, vocal: {vocal_output.shape}")
452
+
453
+ # Post process
454
+ replace_low_freq_with_energy_matched(
455
+ a_file=recons_mix, # 16kHz
456
+ b_file=vocoder_mix, # 48kHz
457
+ c_file=os.path.join(args.output_dir, os.path.basename(recons_mix)),
458
+ cutoff_freq=5500.0
459
+ )
inference/mm_tokenizer_v0.2_hf/tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ee5c7cbf32da93989f14d9ba635e3e1d1ab2cc88a92908a5ed0f149375f6ee49
3
+ size 1761962
inference/mmtokenizer.py ADDED
@@ -0,0 +1,367 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from abc import ABC
2
+ from abc import abstractmethod
3
+
4
+
5
+ class AbstractTokenizer(ABC):
6
+ """Abstract class for tokenizer."""
7
+
8
+ def __init__(self, name):
9
+ self.name = name
10
+ super().__init__()
11
+
12
+ @property
13
+ @abstractmethod
14
+ def vocab_size(self):
15
+ pass
16
+
17
+ @property
18
+ @abstractmethod
19
+ def vocab(self):
20
+ """Dictionary from vocab text token to id token."""
21
+ pass
22
+
23
+ @property
24
+ @abstractmethod
25
+ def inv_vocab(self):
26
+ """Dictionary from vocab id token to text token."""
27
+ pass
28
+
29
+ @abstractmethod
30
+ def tokenize(self, text):
31
+ pass
32
+
33
+ def detokenize(self, token_ids):
34
+ raise NotImplementedError('detokenizer is not implemented for {} '
35
+ 'tokenizer'.format(self.name))
36
+
37
+ @property
38
+ def cls(self):
39
+ raise NotImplementedError('CLS is not provided for {} '
40
+ 'tokenizer'.format(self.name))
41
+
42
+ @property
43
+ def sep(self):
44
+ raise NotImplementedError('SEP is not provided for {} '
45
+ 'tokenizer'.format(self.name))
46
+
47
+ @property
48
+ def pad(self):
49
+ raise NotImplementedError('PAD is not provided for {} '
50
+ 'tokenizer'.format(self.name))
51
+
52
+ @property
53
+ def eod(self):
54
+ raise NotImplementedError('EOD is not provided for {} '
55
+ 'tokenizer'.format(self.name))
56
+
57
+ @property
58
+ def mask(self):
59
+ raise NotImplementedError('MASK is not provided for {} '
60
+ 'tokenizer'.format(self.name))
61
+
62
+
63
+ class _SentencePieceTokenizer(AbstractTokenizer):
64
+ """SentencePieceTokenizer-Megatron wrapper"""
65
+
66
+ def __init__(self, model_file, vocab_extra_ids=0):
67
+ name = 'SentencePieceTokenizer'
68
+ super().__init__(name)
69
+
70
+ import sentencepiece
71
+ self.tokenizer = sentencepiece.SentencePieceProcessor(model_file=model_file)
72
+ self._initalize(vocab_extra_ids)
73
+
74
+ def _populate_vocab(self):
75
+ self._vocab = {}
76
+ self._inv_vocab = {}
77
+
78
+ for i in range(len(self.tokenizer)):
79
+ t = self.tokenizer.id_to_piece(i)
80
+ self._inv_vocab[i] = t
81
+ self._vocab[t] = i
82
+
83
+ def _initalize(self, vocab_extra_ids):
84
+ self._populate_vocab()
85
+ self._special_tokens = {}
86
+ self._inv_special_tokens = {}
87
+
88
+ self._t5_tokens = []
89
+
90
+ def _add_special_token(t):
91
+ if t not in self._vocab:
92
+ next_id = len(self._vocab)
93
+ self._vocab[t] = next_id
94
+ self._inv_vocab[next_id] = t
95
+ self._special_tokens[t] = self._vocab[t]
96
+ self._inv_special_tokens[self._vocab[t]] = t
97
+
98
+ _add_special_token('<CLS>')
99
+ self._cls_id = self._vocab['<CLS>']
100
+ _add_special_token('<SEP>')
101
+ self._sep_id = self._vocab['<SEP>']
102
+ _add_special_token('<EOD>')
103
+ self._eod_id = self._vocab['<EOD>']
104
+ _add_special_token('<MASK>')
105
+ self._mask_id = self._vocab['<MASK>']
106
+
107
+ pad_id = self.tokenizer.pad_id()
108
+ try:
109
+ pad_token = self.tokenizer.id_to_piece(pad_id)
110
+ except IndexError:
111
+ pad_token = '<PAD>'
112
+ _add_special_token(pad_token)
113
+ self._pad_id = self._vocab[pad_token]
114
+
115
+ bos_id = self.tokenizer.bos_id()
116
+ try:
117
+ bos_token = self.tokenizer.id_to_piece(bos_id)
118
+ except IndexError:
119
+ bos_token = '<BOS>'
120
+ _add_special_token(bos_token)
121
+ self._bos_id = self._vocab[bos_token]
122
+
123
+ eos_id = self.tokenizer.eos_id()
124
+ try:
125
+ eos_token = self.tokenizer.id_to_piece(eos_id)
126
+ except IndexError:
127
+ eos_token = '<EOS>'
128
+ _add_special_token(eos_token)
129
+ self._eos_id = self._vocab[eos_token]
130
+
131
+ for i in range(vocab_extra_ids):
132
+ t = "<extra_id_{}>".format(i)
133
+ _add_special_token(t)
134
+ self._t5_tokens += [t]
135
+
136
+ @property
137
+ def vocab_size(self):
138
+ return len(self._vocab)
139
+
140
+ @property
141
+ def vocab(self):
142
+ return self._vocab
143
+
144
+ @property
145
+ def inv_vocab(self):
146
+ return self._inv_vocab
147
+
148
+ @property
149
+ def decoder(self):
150
+ return self._inv_vocab
151
+
152
+ @property
153
+ def encoder(self):
154
+ return self._vocab
155
+
156
+ # From:
157
+ # https://github.com/NVIDIA/NeMo/blob/c8fa217e811d60d11d014827c7f3845ff6c99ae7/nemo/collections/common/tokenizers/sentencepiece_tokenizer.py#L89
158
+ def tokenize(self, text):
159
+ ids = []
160
+ idx = 0
161
+
162
+ while 1:
163
+ indices = {}
164
+ for token in self._special_tokens:
165
+ try:
166
+ indices[token] = text[idx:].index(token)
167
+ except ValueError:
168
+ continue
169
+ if len(indices) == 0:
170
+ break
171
+
172
+ next_token = min(indices, key=indices.get)
173
+ next_idx = idx + indices[next_token]
174
+
175
+ ids.extend(self.tokenizer.encode_as_ids(text[idx:next_idx]))
176
+ ids.append(self._special_tokens[next_token])
177
+ idx = next_idx + len(next_token)
178
+
179
+ ids.extend(self.tokenizer.encode_as_ids(text[idx:]))
180
+ return ids
181
+
182
+ # From:
183
+ # https://github.com/NVIDIA/NeMo/blob/c8fa217e811d60d11d014827c7f3845ff6c99ae7/nemo/collections/common/tokenizers/sentencepiece_tokenizer.py#L125
184
+ def detokenize(self, ids):
185
+ text = ""
186
+ last_i = 0
187
+
188
+ for i, id in enumerate(ids):
189
+ if id in self._inv_special_tokens:
190
+ text += self.tokenizer.decode_ids(ids[last_i:i]) + " "
191
+ text += self._inv_special_tokens[id] + " "
192
+ last_i = i + 1
193
+
194
+ text += self.tokenizer.decode_ids(ids[last_i:])
195
+ return text
196
+
197
+ @property
198
+ def cls(self):
199
+ return self._cls_id
200
+
201
+ @property
202
+ def sep(self):
203
+ return self._sep_id
204
+
205
+ @property
206
+ def pad(self):
207
+ return self._pad_id
208
+
209
+ @property
210
+ def bos_token_id(self):
211
+ return self._bos_id
212
+
213
+ @property
214
+ def bos(self):
215
+ return self._bos_id
216
+
217
+ @property
218
+ def eod(self):
219
+ return self._eod_id
220
+
221
+ @property
222
+ def eos_token_id(self):
223
+ return self._eos_id
224
+
225
+ @property
226
+ def eos(self):
227
+ return self._eos_id
228
+
229
+ @property
230
+ def mask(self):
231
+ return self._mask_id
232
+
233
+ @property
234
+ def additional_special_tokens_ids(self):
235
+ return [self.vocab[k] for k in self._t5_tokens]
236
+
237
+ class _MMSentencePieceTokenizer(_SentencePieceTokenizer):
238
+ """SentencePieceTokenizer-Megatron wrapper"""
239
+
240
+ def __init__(self, model_file, vocab_extra_ids=0):
241
+ super().__init__(model_file, vocab_extra_ids)
242
+
243
+
244
+ def _initalize(self, vocab_extra_ids):
245
+ self._populate_vocab()
246
+ self._special_tokens = {}
247
+ self._inv_special_tokens = {}
248
+
249
+ self._t5_tokens = []
250
+
251
+ def _add_special_token(t):
252
+ if t not in self._vocab:
253
+ next_id = len(self._vocab)
254
+ self._vocab[t] = next_id
255
+ self._inv_vocab[next_id] = t
256
+ self._special_tokens[t] = self._vocab[t]
257
+ self._inv_special_tokens[self._vocab[t]] = t
258
+
259
+ _add_special_token('<CLS>')
260
+ self._cls_id = self._vocab['<CLS>']
261
+ _add_special_token('<SEP>')
262
+ self._sep_id = self._vocab['<SEP>']
263
+ _add_special_token('<EOD>')
264
+ self._eod_id = self._vocab['<EOD>']
265
+ _add_special_token('<MASK>')
266
+ self._mask_id = self._vocab['<MASK>']
267
+
268
+ _add_special_token('<SOA>')
269
+ self._soa_id = self._vocab['<SOA>']
270
+ _add_special_token('<EOA>')
271
+ self._eoa_id = self._vocab['<EOA>']
272
+ _add_special_token('<SOV>')
273
+ self._sov_id = self._vocab['<SOV>']
274
+ _add_special_token('<EOV>')
275
+ self._eov_id = self._vocab['<EOV>']
276
+ _add_special_token('<SOI>')
277
+ self._soi_id = self._vocab['<SOI>']
278
+ _add_special_token('<EOI>')
279
+ self._eoi_id = self._vocab['<EOI>']
280
+ _add_special_token('<s_local>')
281
+ self._s_local_id = self._vocab['<s_local>']
282
+ _add_special_token('<e_local>')
283
+ self._e_local_id = self._vocab['<e_local>']
284
+ _add_special_token('<s_global>')
285
+ self._s_global_id = self._vocab['<s_global>']
286
+ _add_special_token('<e_global>')
287
+ self._e_global_id = self._vocab['<e_global>']
288
+ _add_special_token('<stage_1>')
289
+ self._stage_1_id = self._vocab['<stage_1>']
290
+ _add_special_token('<stage_2>')
291
+ self._stage_2_id = self._vocab['<stage_2>']
292
+ pad_id = self.tokenizer.pad_id()
293
+ try:
294
+ pad_token = self.tokenizer.id_to_piece(pad_id)
295
+ except IndexError:
296
+ pad_token = '<PAD>'
297
+ _add_special_token(pad_token)
298
+ self._pad_id = self._vocab[pad_token]
299
+
300
+ bos_id = self.tokenizer.bos_id()
301
+ try:
302
+ bos_token = self.tokenizer.id_to_piece(bos_id)
303
+ except IndexError:
304
+ bos_token = '<BOS>'
305
+ _add_special_token(bos_token)
306
+ self._bos_id = self._vocab[bos_token]
307
+
308
+ eos_id = self.tokenizer.eos_id()
309
+ try:
310
+ eos_token = self.tokenizer.id_to_piece(eos_id)
311
+ except IndexError:
312
+ eos_token = '<EOS>'
313
+ _add_special_token(eos_token)
314
+ self._eos_id = self._vocab[eos_token]
315
+
316
+ for i in range(vocab_extra_ids):
317
+ t = "<extra_id_{}>".format(i)
318
+ _add_special_token(t)
319
+ self._t5_tokens += [t]
320
+
321
+ @property
322
+ def soa(self):
323
+ return self._soa_id
324
+
325
+ @property
326
+ def eoa(self):
327
+ return self._eoa_id
328
+
329
+ @property
330
+ def sov(self):
331
+ return self._sov_id
332
+
333
+ @property
334
+ def eov(self):
335
+ return self._eov_id
336
+
337
+ @property
338
+ def soi(self):
339
+ return self._soi_id
340
+
341
+ @property
342
+ def eoi(self):
343
+ return self._eoi_id
344
+
345
+ @property
346
+ def s_local(self):
347
+ return self._s_local_id
348
+
349
+ @property
350
+ def e_local(self):
351
+ return self._e_local_id
352
+
353
+ @property
354
+ def s_global(self):
355
+ return self._s_global_id
356
+
357
+ @property
358
+ def e_global(self):
359
+ return self._e_global_id
360
+
361
+ @property
362
+ def stage_1(self):
363
+ return self._stage_1_id
364
+
365
+ @property
366
+ def stage_2(self):
367
+ return self._stage_2_id
inference/prompt_examples/genre.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ inspiring female uplifting pop airy vocal electronic bright vocal vocal
inference/prompt_examples/lyrics.txt ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [verse]
2
+ Staring at the sunset, colors paint the sky
3
+ Thoughts of you keep swirling, can't deny
4
+ I know I let you down, I made mistakes
5
+ But I'm here to mend the heart I didn't break
6
+
7
+ [chorus]
8
+ Every road you take, I'll be one step behind
9
+ Every dream you chase, I'm reaching for the light
10
+ You can't fight this feeling now
11
+ I won't back down
12
+ You know you can't deny it now
13
+ I won't back down
14
+
15
+ [verse]
16
+ They might say I'm foolish, chasing after you
17
+ But they don't feel this love the way we do
18
+ My heart beats only for you, can't you see?
19
+ I won't let you slip away from me
20
+
21
+ [chorus]
22
+ Every road you take, I'll be one step behind
23
+ Every dream you chase, I'm reaching for the light
24
+ You can't fight this feeling now
25
+ I won't back down
26
+ You know you can't deny it now
27
+ I won't back down
28
+
29
+ [bridge]
30
+ No, I won't back down, won't turn around
31
+ Until you're back where you belong
32
+ I'll cross the oceans wide, stand by your side
33
+ Together we are strong
34
+
35
+ [outro]
36
+ Every road you take, I'll be one step behind
37
+ Every dream you chase, love's the tie that binds
38
+ You can't fight this feeling now
39
+ I won't back down
requirements.txt CHANGED
@@ -1,8 +1,8 @@
1
- torch
 
2
  omegaconf
3
- torchaudio
4
  einops
5
- numpy
6
  transformers
7
  sentencepiece
8
  tqdm
@@ -10,3 +10,6 @@ tensorboard
10
  descript-audiotools>=0.7.2
11
  descript-audio-codec
12
  scipy==1.10.1
 
 
 
 
1
+ torch==2.2.0
2
+ torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu118
3
  omegaconf
 
4
  einops
5
+ numpy<2
6
  transformers
7
  sentencepiece
8
  tqdm
 
10
  descript-audiotools>=0.7.2
11
  descript-audio-codec
12
  scipy==1.10.1
13
+ huggingface-hub==0.25.2
14
+ wheel
15
+ #https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu11torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl