okezieowen commited on
Commit
8062b24
·
verified ·
1 Parent(s): 3ad86e1

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,12 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ assets/Emmanuel_English.wav filter=lfs diff=lfs merge=lfs -text
38
+ assets/Emmanuel_Hausa.wav filter=lfs diff=lfs merge=lfs -text
39
+ assets/Eniola_English.wav filter=lfs diff=lfs merge=lfs -text
40
+ assets/Eniola_Hausa.wav filter=lfs diff=lfs merge=lfs -text
41
+ assets/Eniola_Igbo.wav filter=lfs diff=lfs merge=lfs -text
42
+ assets/Eniola_Yoruba.wav filter=lfs diff=lfs merge=lfs -text
43
+ assets/Lovelyn_English.wav filter=lfs diff=lfs merge=lfs -text
44
+ assets/Lovelyn_Igbo.wav filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,654 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags:
4
+ - voice-cloning
5
+ - audio
6
+ - unsloth
7
+ - orpheus
8
+ - snac
9
+ - hypaai
10
+ - lora
11
+ - merged
12
+ license: apache-2.0
13
+ datasets:
14
+ - hypaai/Hypa_Fleurs
15
+ - MrDragonFox/Elise
16
+ - canopylabs/zac-sample-dataset
17
+ - google/fleurs
18
+ language:
19
+ - ig
20
+ - yo
21
+ - ha
22
+ - en
23
+ base_model:
24
+ - canopylabs/orpheus-3b-0.1-ft
25
+ pipeline_tag: text-to-speech
26
+ ---
27
+
28
+ # Hypa_Orpheus-3b-0.1-ft (merged 16-bit)
29
+
30
+ A 16-bit quantized and merged, memory-efficient fine-tuned version of [`canopylabs/orpheus-3b-0.1-ft`](https://huggingface.co/canopylabs/orpheus-3b-0.1-ft), optimized with Unsloth and LoRA, for expressive multilingual TTS, especially in low-resource African languages. This model provides the following capabilities:
31
+
32
+ * Text-to-Speech generation
33
+ * Speech synthesis for under-represented accents
34
+ * Voice cloning & emotion synthesis
35
+ * Research on multilingual low-resource voice AI
36
+
37
+ Check out the model at [`HypaChat`](https://hypachat.com/)
38
+
39
+ ---
40
+
41
+ ## Model Details
42
+
43
+ ### Model Summary
44
+
45
+ This model was trained on a parallel text-speech dataset totaling over 300 hours (75k samples) of Nigerian-accented and low-resource language audio (Igbo, Yoruba, Hausa). A key part of the dataset comes from AfroVoices' transcription of real-world YouTube data (denoted as Random speaker, ~100+ hrs).
46
+ To preserve and enhance multilingual capabilities while avoiding catastrophic forgetting, we included synthetic speech-text data sampled from the original 8 Orpheus voices using the default emotional prompts.
47
+ The final training set also included new speakers like:
48
+ * Eniola (40 hrs) – Female, bold, clear
49
+ * Moyo (40 hrs) – Female, professional, articulate
50
+ * Lovelyn (35 hrs) – Female, warm, shy
51
+ * Precious (30 hrs) – Female, friendly, gentle
52
+
53
+ This model sets state-of-the-art performance on low-resource Multilingual TTS tasks across African languages (see training stats below).
54
+
55
+ ### Base Model Details
56
+
57
+ The default Orpheus-TTS model released by [`Canopy Labs`](https://canopylabs.ai/releases/orpheus_can_speak_any_language) supports the below voices and emotions:
58
+
59
+ Voices: `tara`, `leah`, `jess`, `leo`, `dan`, `mia`, `zac`, and `zoe`.
60
+
61
+ Emotions: `<laugh>`, `<chuckle>`, `<sigh>`, `<cough>`, `<sniffle>`, `<groan>`, `<yawn>`, and `<gasp>`.
62
+
63
+ Through synthetic data generation and addition, our finetuned model also maintains these voices and emotions. Please visit the default model's card for mor info on voices and emotions.
64
+
65
+ ### Our Model Sample Generations
66
+
67
+ 🎧 Listen to samples generated by Hypa Orpheus-TTS
68
+
69
+ <div style="margin-top: 20px;">
70
+ <table style="width: 100%; border-collapse: collapse;">
71
+ <thead>
72
+ <tr>
73
+ <th style="border: 1px solid #ddd; padding: 8px; width: 30%;">Text Input</th>
74
+ <th style="border: 1px solid #ddd; padding: 8px; width: 40%;">Audio Output</th>
75
+ <th style="border: 1px solid #ddd; padding: 8px; width: 10%;">Language</th>
76
+ <th style="border: 1px solid #ddd; padding: 8px; width: 10%;">Voice</th>
77
+ </tr>
78
+ </thead>
79
+ <tbody>
80
+ <tr>
81
+ <td style="border: 1px solid #ddd; padding: 8px;">I am cooking for guests tomorrow and need to know how to make aioli. Can you give me a step-by-step recipe.</td>
82
+ <td style="border: 1px solid #ddd; padding: 8px;">
83
+ <audio controls style="width: 100%;">
84
+ <source src="https://huggingface.co/hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-bnb-4bit/resolve/main/assets/Emmanuel_English.wav" type="audio/wav">
85
+ </audio>
86
+ </td>
87
+ <td style="border: 1px solid #ddd; padding: 8px;">English</td>
88
+ <td style="border: 1px solid #ddd; padding: 8px;">Emmanuel</td>
89
+ </tr>
90
+ <tr>
91
+ <td style="border: 1px solid #ddd; padding: 8px;">Ina dafa abinci don bakin gobe kuma ina bukatar sanin yadda ake yin ailoli. Za ka iya ba ni girke-gireken matakan daya bayan daya?</td>
92
+ <td style="border: 1px solid #ddd; padding: 8px;">
93
+ <audio controls style="width: 100%;">
94
+ <source src="https://huggingface.co/hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-bnb-4bit/resolve/main/assets/Emmanuel_Hausa.wav" type="audio/wav">
95
+ </audio>
96
+ </td>
97
+ <td style="border: 1px solid #ddd; padding: 8px;">Hausa</td>
98
+ <td style="border: 1px solid #ddd; padding: 8px;">Emmanuel</td>
99
+ </tr>
100
+ <tr>
101
+ <td style="border: 1px solid #ddd; padding: 8px;">Ina dafa abinci don bakin gobe kuma ina bukatar sanin yadda ake yin ailoli. Za ka iya ba ni girke-gireken matakan daya bayan daya?</td>
102
+ <td style="border: 1px solid #ddd; padding: 8px;">
103
+ <audio controls style="width: 100%;">
104
+ <source src="https://huggingface.co/hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-bnb-4bit/resolve/main/assets/Eniola_Hausa.wav" type="audio/wav">
105
+ </audio>
106
+ </td>
107
+ <td style="border: 1px solid #ddd; padding: 8px;">Hausa</td>
108
+ <td style="border: 1px solid #ddd; padding: 8px;">Eniola</td>
109
+ </tr>
110
+ <tr>
111
+ <td style="border: 1px solid #ddd; padding: 8px;">Èmi máa se o��njẹ fún àwọn àlejò l'ọ́la mo sì nílò láti mọ bí wọn ti ńṣe aioli. Ṣe o lè fún mi ni àwọn ìlànà ìdáná ẹlẹ́sẹẹsẹ?</td>
112
+ <td style="border: 1px solid #ddd; padding: 8px;">
113
+ <audio controls style="width: 100%;">
114
+ <source src="https://huggingface.co/hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-bnb-4bit/resolve/main/assets/Eniola_Yoruba.wav" type="audio/wav">
115
+ </audio>
116
+ </td>
117
+ <td style="border: 1px solid #ddd; padding: 8px;">Yoruba</td>
118
+ <td style="border: 1px solid #ddd; padding: 8px;">Eniola</td>
119
+ </tr>
120
+ <tr>
121
+ <td style="border: 1px solid #ddd; padding: 8px;">I am cooking for guests tomorrow and need to know how to make aioli. Can you give me a step-by-step recipe.</td>
122
+ <td style="border: 1px solid #ddd; padding: 8px;">
123
+ <audio controls style="width: 100%;">
124
+ <source src="https://huggingface.co/hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-bnb-4bit/resolve/main/assets/Eniola_English.wav" type="audio/wav">
125
+ </audio>
126
+ </td>
127
+ <td style="border: 1px solid #ddd; padding: 8px;">English</td>
128
+ <td style="border: 1px solid #ddd; padding: 8px;">Eniola</td>
129
+ </tr>
130
+ <tr>
131
+ <td style="border: 1px solid #ddd; padding: 8px;">M na-esi nri maka ndị ọbịa echi ma achọ ịmata otú esi esi aioli. Ị nwere ike inye m usoro ntụziaka?</td>
132
+ <td style="border: 1px solid #ddd; padding: 8px;">
133
+ <audio controls style="width: 100%;">
134
+ <source src="https://huggingface.co/hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-bnb-4bit/resolve/main/assets/Eniola_Igbo.wav" type="audio/wav">
135
+ </audio>
136
+ </td>
137
+ <td style="border: 1px solid #ddd; padding: 8px;">Igbo</td>
138
+ <td style="border: 1px solid #ddd; padding: 8px;">Eniola</td>
139
+ </tr>
140
+ <tr>
141
+ <td style="border: 1px solid #ddd; padding: 8px;">M na-esi nri maka ndị ọbịa echi ma achọ ịmata otú esi esi aioli. Ị nwere ike inye m usoro ntụziaka?</td>
142
+ <td style="border: 1px solid #ddd; padding: 8px;">
143
+ <audio controls style="width: 100%;">
144
+ <source src="https://huggingface.co/hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-bnb-4bit/resolve/main/assets/Lovelyn_Igbo.wav" type="audio/wav">
145
+ </audio>
146
+ </td>
147
+ <td style="border: 1px solid #ddd; padding: 8px;">Igbo</td>
148
+ <td style="border: 1px solid #ddd; padding: 8px;">Lovelyn</td>
149
+ </tr>
150
+ <tr>
151
+ <td style="border: 1px solid #ddd; padding: 8px;">I am cooking for guests tomorrow and need to know how to make aioli. Can you give me a step-by-step recipe.</td>
152
+ <td style="border: 1px solid #ddd; padding: 8px;">
153
+ <audio controls style="width: 100%;">
154
+ <source src="https://huggingface.co/hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-bnb-4bit/resolve/main/assets/Lovelyn_English.wav" type="audio/wav">
155
+ </audio>
156
+ </td>
157
+ <td style="border: 1px solid #ddd; padding: 8px;">English</td>
158
+ <td style="border: 1px solid #ddd; padding: 8px;">Lovelyn</td>
159
+ </tr>
160
+ </tbody>
161
+ </table>
162
+ </div>
163
+
164
+ ---
165
+
166
+ ## Training Details
167
+
168
+ ### Training Summary
169
+
170
+ * Base model: canopylabs/orpheus-3b-0.1-ft
171
+ * Training engine: Unsloth + LoRA
172
+ * LoRA config: r=1024, alpha=1024, dropout=0.0, full attention + FFN adaptation
173
+ * Quantization: 4-bit (bnb) for training; final model is highly memory-efficient
174
+ * Total steps: 18,014 (1 epoch)
175
+ * Batch size: 1 × 4 (grad accum)
176
+ * GPU: A100 40GB (max 55% VRAM used)
177
+
178
+
179
+ <div style="margin-top: 20px;">
180
+ <table style="border-collapse: collapse;">
181
+ <thead>
182
+ <tr>
183
+ <th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Step</th>
184
+ <th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Training Loss</th>
185
+ <th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Validation Loss</th>
186
+ </tr>
187
+ </thead>
188
+ <tbody>
189
+ <tr>
190
+ <td style="border: 1px solid #ddd; padding: 8px;">5,000</td>
191
+ <td style="border: 1px solid #ddd; padding: 8px;">3.9496</td>
192
+ <td style="border: 1px solid #ddd; padding: 8px;">3.8790</td>
193
+ </tr>
194
+ <tr>
195
+ <td style="border: 1px solid #ddd; padding: 8px;">10,000</td>
196
+ <td style="border: 1px solid #ddd; padding: 8px;">3.8863</td>
197
+ <td style="border: 1px solid #ddd; padding: 8px;">3.79497</td>
198
+ </tr>
199
+ <tr>
200
+ <td style="border: 1px solid #ddd; padding: 8px;">15,000</td>
201
+ <td style="border: 1px solid #ddd; padding: 8px;">3.8544</td>
202
+ <td style="border: 1px solid #ddd; padding: 8px;">3.75323</td>
203
+ </tr>
204
+ </tbody>
205
+ </table>
206
+ </div>
207
+
208
+ ### Dataset Summary
209
+
210
+ * Sources:
211
+ * ✅ Manually aligned YouTube transcriptions (aka Random)
212
+ * ✅ Synthetic voice generation from Orpheus TTS
213
+ * ✅ Parallel text-audio pairs for African-English, Igbo, Yoruba, Hausa
214
+ * Total Hours: 300+ (multi-accent)
215
+ * Key Speakers: 45+ unique voices (see speaker distribution chart below)
216
+
217
+ ![image/png](https://huggingface.co/hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-bnb-4bit/resolve/main/assets/stats.png)
218
+
219
+ We plan to open-source the full dataset shortly similar to the [Hypa_Fleurs](https://huggingface.co/datasets/hypaai/Hypa_Fleurs) initiative.
220
+
221
+ ---
222
+
223
+ ## Licensing and Citation
224
+
225
+ This model is released under an [Open Source License](./LICENSE) (apache-2.0). Please refer to the LICENSE file for full details.
226
+
227
+ When using this model in your work, please cite both this model as well as the base [`canopylabs/orpheus-3b-0.1-ft`](https://huggingface.co/canopylabs/orpheus-3b-0.1-ft) model as follows:
228
+
229
+ ```bibtex
230
+ @misc{canopylabsorpheus,
231
+ title={Orpheus-3b-0.1-ft: A Multilingual Text-to-Speech Model},
232
+ author={Canopy Labs},
233
+ year={2025},
234
+ publisher={Hugging Face},
235
+ howpublished={\url{https://huggingface.co/canopylabs/orpheus-3b-0.1-ft}},
236
+ note={Fine-tuned version of Orpheus for expressive TTS}
237
+ }
238
+
239
+ @misc{hypaorpheus4bit,
240
+ title={Hypa_Orpheus-3b-0.1-ft (LoRA-4bit)},
241
+ author={Hypa AI},
242
+ year={2025},
243
+ note={Fine-tuned Orpheus TTS on African languages},
244
+ url={https://huggingface.co/hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-bnb-4bit}
245
+ }
246
+ ```
247
+
248
+ ---
249
+
250
+ ## Acknowledgements
251
+
252
+ - **Canopy Labs Team:** For creating the foundational model and opensourcing it.
253
+ - **AfroVoices Experts:** For their translation expertise and high-quality datasets.
254
+ - **Community Support:** We thank all supporters, contributors, and users.
255
+
256
+ ---
257
+
258
+ ## Contact and Contributions
259
+
260
+ For any questions, issues, or contributions, please open an issue in this repository or contact [[email protected]](mailto:[email protected]). Contributions are welcome!
261
+
262
+ ---
263
+
264
+ ## Closing Remarks
265
+
266
+ By making Hypa_Orpheus available, we hope to empower research and development in multilingual speech technologies for African languages.
267
+
268
+ Hypa AI remains steadfast in its mission to pioneer intelligent solutions that are not just technologically advanced but are also culturally aware, ensuring that the future of AI is as diverse and inclusive as the world it serves.
269
+
270
+ AfroVoices, a subsidiary of Hypa AI, is dedicated to amplifying African voices, languages, and cultures in the intelligence age. Focused on bridging the digital representation gap, AfroVoices curates datasets and resources for African languages, promoting inclusivity and cultural appreciation in AI technologies. Their mission goes beyond technological innovation, aiming to celebrate the richness of African linguistic diversity on a global stage.
271
+
272
+ ---
273
+
274
+ ## Usage
275
+
276
+ ### Unsloth Inference
277
+
278
+ Download the nedded packages.
279
+
280
+ ```python
281
+ %%capture
282
+ import os
283
+ if "COLAB_" not in "".join(os.environ.keys()):
284
+ !pip install unsloth
285
+ else:
286
+ # Do this only in Colab notebooks! Otherwise use pip install unsloth
287
+ !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
288
+ !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
289
+ !pip install --no-deps unsloth
290
+ !pip install snac
291
+ ```
292
+
293
+ Download the models (both the SNAC encoder/decoder as well as our finetuned Hypa_Orpheus).
294
+
295
+ ```python
296
+ import torch
297
+ from snac import SNAC
298
+ from unsloth import FastLanguageModel
299
+
300
+ dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
301
+ load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
302
+
303
+ model, tokenizer = FastLanguageModel.from_pretrained(
304
+ model_name = "hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-merged_16bit",
305
+ max_seq_length= 2048, # Choose any for long context!
306
+ dtype = dtype,
307
+ load_in_4bit = load_in_4bit,
308
+ #token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
309
+ )
310
+
311
+ snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
312
+ snac_model = snac_model.to("cuda")
313
+ ```
314
+
315
+ Create your text prompt, select the voice, and pass through the Model.
316
+
317
+ ```python
318
+ prompts = [
319
+ """Mo nífẹ̀ẹ́sí láti ṣe Ph.D sùgbọ́n mi ò ì tíì pinnu ẹ̀ka tí màá ṣe. Àwọn anfaani tí óń dé oríṣiríṣi àwọn olùgbọ́ káàkiri àgbáyé wo ni mo ní""",
320
+ ]
321
+ chosen_voice = "Eniola" # None for single-speaker
322
+
323
+
324
+ FastLanguageModel.for_inference(model) # Enable native 2x faster inference
325
+ snac_model.to("cpu")# Moving snac_model cuda to cpu
326
+
327
+ prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]
328
+
329
+ all_input_ids = []
330
+
331
+ for prompt in prompts_:
332
+ input_ids = tokenizer(prompt, return_tensors="pt").input_ids
333
+ all_input_ids.append(input_ids)
334
+
335
+ start_token = torch.tensor([[ 128259]], dtype=torch.int64) # Start of human
336
+ end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # End of text, End of human
337
+
338
+ all_modified_input_ids = []
339
+ for input_ids in all_input_ids:
340
+ modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) # SOH SOT Text EOT EOH
341
+ all_modified_input_ids.append(modified_input_ids)
342
+
343
+ all_padded_tensors = []
344
+ all_attention_masks = []
345
+ max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
346
+ for modified_input_ids in all_modified_input_ids:
347
+ padding = max_length - modified_input_ids.shape[1]
348
+ padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)
349
+ attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
350
+ all_padded_tensors.append(padded_tensor)
351
+ all_attention_masks.append(attention_mask)
352
+
353
+ all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
354
+ all_attention_masks = torch.cat(all_attention_masks, dim=0)
355
+
356
+ input_ids = all_padded_tensors.to("cuda")
357
+ attention_mask = all_attention_masks.to("cuda")
358
+ generated_ids = model.generate(
359
+ input_ids=input_ids,
360
+ attention_mask=attention_mask,
361
+ max_new_tokens=1200,
362
+ do_sample=True,
363
+ temperature=0.6,
364
+ top_p=0.95,
365
+ repetition_penalty=1.1,
366
+ num_return_sequences=1,
367
+ eos_token_id=128258,
368
+ use_cache = True
369
+ )
370
+ token_to_find = 128257
371
+ token_to_remove = 128258
372
+
373
+ token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
374
+
375
+ if len(token_indices[1]) > 0:
376
+ last_occurrence_idx = token_indices[1][-1].item()
377
+ cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
378
+ else:
379
+ cropped_tensor = generated_ids
380
+
381
+ mask = cropped_tensor != token_to_remove
382
+
383
+ processed_rows = []
384
+
385
+ for row in cropped_tensor:
386
+ masked_row = row[row != token_to_remove]
387
+ processed_rows.append(masked_row)
388
+
389
+ code_lists = []
390
+
391
+ for row in processed_rows:
392
+ row_length = row.size(0)
393
+ new_length = (row_length // 7) * 7
394
+ trimmed_row = row[:new_length]
395
+ trimmed_row = [t - 128266 for t in trimmed_row]
396
+ code_lists.append(trimmed_row)
397
+
398
+
399
+ def redistribute_codes(code_list):
400
+ layer_1 = []
401
+ layer_2 = []
402
+ layer_3 = []
403
+ for i in range((len(code_list)+1)//7):
404
+ layer_1.append(code_list[7*i])
405
+ layer_2.append(code_list[7*i+1]-4096)
406
+ layer_3.append(code_list[7*i+2]-(2*4096))
407
+ layer_3.append(code_list[7*i+3]-(3*4096))
408
+ layer_2.append(code_list[7*i+4]-(4*4096))
409
+ layer_3.append(code_list[7*i+5]-(5*4096))
410
+ layer_3.append(code_list[7*i+6]-(6*4096))
411
+ codes = [torch.tensor(layer_1).unsqueeze(0),
412
+ torch.tensor(layer_2).unsqueeze(0),
413
+ torch.tensor(layer_3).unsqueeze(0)]
414
+
415
+ # codes = [c.to("cuda") for c in codes]
416
+ audio_hat = snac_model.decode(codes)
417
+ return audio_hat
418
+
419
+ my_samples = []
420
+ for code_list in code_lists:
421
+ samples = redistribute_codes(code_list)
422
+ my_samples.append(samples)
423
+ from IPython.display import display, Audio
424
+ if len(prompts) != len(my_samples):
425
+ raise Exception("Number of prompts and samples do not match")
426
+ else:
427
+ for i in range(len(my_samples)):
428
+ print(prompts[i])
429
+ samples = my_samples[i]
430
+ display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))
431
+ # Clean up to save RAM
432
+ del my_samples,samples
433
+ ```
434
+
435
+ ### Standard Inference
436
+
437
+ Download the nedded packages.
438
+
439
+ ```python
440
+ %%capture
441
+ !pip install snac ipywebrtc
442
+ ```
443
+
444
+ Download the Models (SNAC & Hypa_Orpheus)
445
+
446
+ ```python
447
+ import torch
448
+ from transformers import AutoModelForCausalLM, Trainer, TrainingArguments, AutoTokenizer
449
+ from snac import SNAC
450
+
451
+ # Loads the pre-trained SNAC model and moves it to the CPU.
452
+ snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
453
+ snac_model = snac_model #.to("cpu")
454
+
455
+ print("We have loaded the Encoder/Decoder model to the cpu, to use vram - use the gpu for faster inference")
456
+
457
+ # Loading the Orpheus Model and Tokenizer, moving the model to the GPU for faster inference
458
+ model_name = "hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-merged_16bit"
459
+ model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
460
+ model.cuda()
461
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
462
+ ```
463
+
464
+ Create Prompt(s) and Select Voice & Emotions as needed.
465
+
466
+ ```python
467
+ # List of supported voices in Orpheus-TTS
468
+ voices = [
469
+ "Eniola", "tara", # Female, conversational, clear
470
+ "Moyo", "leah", # Female, warm, gentle
471
+ "Gift", "jess", # Female, energetic, youthful
472
+ "Prince", "leo", # Male, authoritative, deep
473
+ "Emmanuel", "dan", # Male, friendly, casual
474
+ "Cynthia", "mia", # Female, professional, articulate
475
+ "Kolade", "zac", # Male, enthusiastic, dynamic
476
+ "Lovelyn", "zoe" # Female, calm, soothing
477
+ ]
478
+
479
+ # List of supported emotion tags in Orpheus-TTS
480
+ emotions = [
481
+ "<laugh>", # Laughter
482
+ "<chuckle>", # Soft chuckle
483
+ "<sigh>", # Sighing
484
+ "<cough>", # Coughing
485
+ "<sniffle>", # Sniffling
486
+ "<groan>", # Groaning
487
+ "<yawn>", # Yawning
488
+ "<gasp>" # Gasping
489
+ ]
490
+
491
+ # Creating Prompts
492
+ prompts = [
493
+ "Hey there my name is Eniola 9000, and I'm a speech generation model that can sound like a person.",
494
+ # "I've also been taught to understand and produce paralinguistic things like sighing, or chuckling, or yawning!",
495
+ # "I live in San Francisco, and have, uhm let's see, 3 billion 7 hundred ... well, lets just say a lot of parameters.",
496
+ ]
497
+
498
+ chosen_voice = "Eniola" # "tara" # see github for other voices
499
+ prompts = [f"{chosen_voice}: " + p for p in prompts] # Creating the prompts (as a batch)
500
+ print(prompts)
501
+ ```
502
+
503
+ Tokenize prompt(s) into inputIDs, pad, and create attention masks.
504
+
505
+ ```python
506
+ # Tokenizing each prompt into input IDs.
507
+ all_input_ids = []
508
+ for prompt in prompts:
509
+ input_ids = tokenizer(prompt, return_tensors="pt").input_ids
510
+ all_input_ids.append(input_ids)
511
+
512
+ # Adds special tokens to mark the beginning and end of each prompt
513
+ start_token = torch.tensor([[128259]], dtype=torch.int64) # Start of human
514
+ end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # End of text, End of human
515
+
516
+ all_modified_input_ids = []
517
+ for input_ids in all_input_ids:
518
+ modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) # SOH SOT Text EOT EOH
519
+ all_modified_input_ids.append(modified_input_ids)
520
+
521
+ # Padding All sequences to same length and creating corresponding attention masks
522
+ all_padded_tensors = []
523
+ all_attention_masks = []
524
+ max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
525
+ for modified_input_ids in all_modified_input_ids:
526
+ padding = max_length - modified_input_ids.shape[1]
527
+ # Left Padding
528
+ padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)
529
+ attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
530
+ all_padded_tensors.append(padded_tensor)
531
+ all_attention_masks.append(attention_mask)
532
+
533
+ all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
534
+ all_attention_masks = torch.cat(all_attention_masks, dim=0)
535
+
536
+ # Moving all padded sequences to GPU for Faster computation
537
+ input_ids = all_padded_tensors.to("cuda")
538
+ attention_mask = all_attention_masks.to("cuda")
539
+ ```
540
+
541
+ Generate Output Tokens from the and Parse output tokens as speech
542
+
543
+ ```python
544
+ print("*** Model.generate is slow - see vllm implementation on github for realtime streaming and inference")
545
+ print("*** Increase/decrease inference params for more expressive less stable generations")
546
+
547
+ # Generating Output Tokens
548
+ with torch.no_grad():
549
+ generated_ids = model.generate(
550
+ input_ids=input_ids,
551
+ attention_mask=attention_mask,
552
+ max_new_tokens=1200,
553
+ do_sample=True,
554
+ temperature=0.6,
555
+ top_p=0.95,
556
+ repetition_penalty=1.1,
557
+ num_return_sequences=1,
558
+ eos_token_id=128258,
559
+ )
560
+
561
+ # Processing Generated Tokens (Parse Output as speech)
562
+ token_to_find = 128257 # Start of Audio token (relevant output)
563
+ token_to_remove = 128258 # End/ Terminal Token (End of Audio/ relevant output)
564
+
565
+ token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
566
+ print(token_indices)
567
+
568
+ # Slices the tensor to exclude unwanted tokens.
569
+ if len(token_indices[1]) > 0:
570
+ last_occurrence_idx = token_indices[1][-1].item()
571
+ cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
572
+ else:
573
+ cropped_tensor = generated_ids
574
+
575
+ # mask = cropped_tensor != token_to_remove
576
+
577
+ # Storing the cleaned-up token sequences#
578
+ processed_rows = []
579
+ for row in cropped_tensor:
580
+ masked_row = row[row != token_to_remove]
581
+ processed_rows.append(masked_row)
582
+
583
+ # Preparing (Audio Codes) the token sequences for audio decoding by trimming and adjusting token values.
584
+ code_lists = []
585
+ for row in processed_rows:
586
+ row_length = row.size(0) # Determines the length of the token sequence.
587
+ new_length = (row_length // 7) * 7 # Ensures the sequence length is a multiple of 7, as required by the decoder.
588
+ trimmed_row = row[:new_length]
589
+ trimmed_row = [t - 128266 for t in trimmed_row] # Adjusts token values to match the expected input range for the decoder.
590
+ code_lists.append(trimmed_row)
591
+ ```
592
+
593
+ Decode Outputs with SNAC Decoder
594
+
595
+ ```python
596
+ # Processes the token sequences into the format expected by the SNAC decoder:
597
+ def redistribute_codes(code_list):
598
+ """ Reorganizes the flattened token list into three separate layers, adjusting each token's value to align with the decoder's expectations"""
599
+ layer_1 = [] # Coarsest layer
600
+ layer_2 = [] # Intermediate layer
601
+ layer_3 = [] # Finest layer
602
+
603
+ num_groups = (len(code_list) + 1) // 7 #Calculate the number of complete 7-token groups in the code_list
604
+ for i in range(num_groups):
605
+ idx = 7 * i # starting index for the current group
606
+ # Layer 1 receives the first token of the group
607
+ layer_1.append(code_list[idx])
608
+
609
+ # Layer 2 receives the second token, adjusted by subtracting 4096
610
+ layer_2.append(code_list[idx + 1] - 4096)
611
+
612
+ # Layer 3 receives the third and fourth tokens, adjusted by subtracting 8192 and 12288 respectively
613
+ layer_3.append(code_list[idx+2]-(2*4096))
614
+ layer_3.append(code_list[idx+3]-(3*4096))
615
+
616
+ # Layer 2 receives the fifth token, adjusted by subtracting 16384
617
+ layer_2.append(code_list[idx+4]-(4*4096))
618
+
619
+ # Layer 3 receives the sixth and seventh tokens, adjusted by subtracting 20480 and 24576 respectively
620
+ layer_3.append(code_list[idx+5]-(5*4096))
621
+ layer_3.append(code_list[idx+6]-(6*4096))
622
+
623
+ codes = [
624
+ torch.tensor(layer_1).unsqueeze(0), # Shape: (1, len(layer_1))
625
+ torch.tensor(layer_2).unsqueeze(0), # Shape: (1, len(layer_2))
626
+ torch.tensor(layer_3).unsqueeze(0) # Shape: (1, len(layer_3))
627
+ ] # Convert the lists to PyTorch tensors and add a batch dimension
628
+ audio_hat = snac_model.decode(codes) # Decode the structured codes into an audio waveform using the SNAC model
629
+ return audio_hat
630
+
631
+ my_samples = []
632
+ for code_list in code_lists:
633
+ samples = redistribute_codes(code_list) # Generates audio samples from the processed token sequences
634
+ my_samples.append(samples)
635
+
636
+ # Display Audio
637
+ from IPython.display import display, Audio
638
+
639
+ if len(prompts) != len(my_samples):
640
+ raise Exception("Number of prompts and samples do not match")
641
+ else:
642
+ for i in range(len(my_samples)):
643
+ print(prompts[i])
644
+ samples = my_samples[i]
645
+ display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))
646
+ ```
647
+
648
+ - **Repository:** [N/A]
649
+ - **Paper:** [N/A]
650
+ - **Demo:** [N/A]
651
+
652
+ This llama based model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
653
+
654
+ [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
assets/.DS_Store ADDED
Binary file (6.15 kB). View file
 
assets/Emmanuel_English.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7f7ed6302c725516d040d03e03331d2a4777026037e97099a337f20319880034
3
+ size 487468
assets/Emmanuel_Hausa.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:29da60c6dffcdfa4fe402489bdb3040bc3ce0085bbae0f7e9b52d1b9d1a2b000
3
+ size 688172
assets/Eniola_English.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:edd3e58ea65a92bf691db4ca20bd68cc5fc5e6e5cef715cc798279b978f3e5aa
3
+ size 344108
assets/Eniola_Hausa.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:40dd9ee114f53b1bdb3f32d3be21476fcd758de376d4d4252e45d00a5ccfa8ba
3
+ size 548908
assets/Eniola_Igbo.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a05aabf5f577c4b1703a10d4999a1872068d16e6013492d28bf2d83950c856a6
3
+ size 483372
assets/Eniola_Yoruba.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8c97b3527e108805280eddde2be8c090eeb8507917db334c645933567fb3b155
3
+ size 479276
assets/Lovelyn_English.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:40a2640afa40eb2f0aa6a8db18143a070f17986bd8ac7e70a099ec3f558d4a01
3
+ size 348204
assets/Lovelyn_Igbo.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dc38cc3a3e73c3bf9bbf207f68e8983e9f93da5f57f14ed8e8aea4bc997e47d8
3
+ size 434220
assets/stats.png ADDED
config.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 128000,
8
+ "eos_token_id": 128009,
9
+ "head_dim": 128,
10
+ "hidden_act": "silu",
11
+ "hidden_size": 3072,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 8192,
14
+ "max_position_embeddings": 131072,
15
+ "mlp_bias": false,
16
+ "model_type": "llama",
17
+ "num_attention_heads": 24,
18
+ "num_hidden_layers": 28,
19
+ "num_key_value_heads": 8,
20
+ "pad_token_id": 128004,
21
+ "pretraining_tp": 1,
22
+ "rms_norm_eps": 1e-05,
23
+ "rope_scaling": {
24
+ "factor": 32.0,
25
+ "high_freq_factor": 4.0,
26
+ "low_freq_factor": 1.0,
27
+ "original_max_position_embeddings": 8192,
28
+ "rope_type": "llama3"
29
+ },
30
+ "rope_theta": 500000.0,
31
+ "tie_word_embeddings": true,
32
+ "torch_dtype": "bfloat16",
33
+ "transformers_version": "4.51.3",
34
+ "unsloth_fixed": true,
35
+ "unsloth_version": "2025.3.19",
36
+ "use_cache": true,
37
+ "vocab_size": 156940
38
+ }
generation_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 128000,
4
+ "do_sample": true,
5
+ "eos_token_id": 128009,
6
+ "max_length": 131072,
7
+ "pad_token_id": 128004,
8
+ "temperature": 0.6,
9
+ "top_p": 0.9,
10
+ "transformers_version": "4.51.3"
11
+ }
handler.py ADDED
@@ -0,0 +1,302 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ # Orpheus TTS Handler - Explanation & Deployment Guide
3
+
4
+ This guide explains how to properly deploy the Orpheus TTS model with the custom
5
+ handler on Hugging Face Inference Endpoints.
6
+
7
+ ## The Problem
8
+
9
+ Based on the error messages you're seeing:
10
+ 1. Connection is working (you get responses)
11
+ 2. But responses contain text rather than audio data
12
+ 3. The response format is the standard HF format: [{"generated_text": "..."}]
13
+
14
+ This indicates that your endpoint is running the standard text generation handler
15
+ rather than the custom audio generation handler you've defined.
16
+
17
+ ## Step 1: Properly package your handler
18
+
19
+ Create a `handler.py` file with your custom handler code:
20
+ """
21
+
22
+ import os
23
+ import torch
24
+ import numpy as np
25
+ from transformers import AutoModelForCausalLM, AutoTokenizer
26
+ from snac import SNAC
27
+ import logging
28
+
29
+ # Set up logging
30
+ logging.basicConfig(level=logging.INFO)
31
+ logger = logging.getLogger(__name__)
32
+
33
+ class EndpointHandler:
34
+ def __init__(self, path=""):
35
+ logger.info("Initializing Orpheus TTS handler")
36
+ # Load the Orpheus model and tokenizer
37
+ self.model_name = "hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-merged_16bit"
38
+ self.model = AutoModelForCausalLM.from_pretrained(
39
+ self.model_name,
40
+ torch_dtype=torch.bfloat16
41
+ )
42
+
43
+ # Move model to GPU if available
44
+ self.device = "cuda" if torch.cuda.is_available() else "cpu"
45
+ self.model.to(self.device)
46
+ logger.info(f"Model loaded on {self.device}")
47
+
48
+ # Load tokenizer
49
+ self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
50
+ logger.info("Tokenizer loaded")
51
+
52
+ # Load SNAC model for audio decoding
53
+ try:
54
+ self.snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
55
+ self.snac_model.to(self.device)
56
+ logger.info("SNAC model loaded")
57
+ except Exception as e:
58
+ logger.error(f"Error loading SNAC: {str(e)}")
59
+ raise
60
+
61
+ # Special tokens
62
+ self.start_token = torch.tensor([[128259]], dtype=torch.int64) # Start of human
63
+ self.end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # End of text, End of human
64
+ self.padding_token = 128263
65
+ self.start_audio_token = 128257 # Start of Audio token
66
+ self.end_audio_token = 128258 # End of Audio token
67
+
68
+ self._warmed_up = False
69
+
70
+ logger.info("Handler initialization complete")
71
+
72
+ def preprocess(self, data):
73
+ """
74
+ Preprocess input data before inference
75
+ """
76
+ logger.info(f"Preprocessing data: {type(data)}")
77
+
78
+ # Handle health check
79
+ if data == "ping" or (isinstance(data, dict) and data.get("inputs") == "ping"):
80
+ logger.info("Health check detected")
81
+ return {"health_check": True}
82
+
83
+ # HF Inference API format: 'inputs' is the text, 'parameters' contains the config
84
+ if isinstance(data, dict) and "inputs" in data:
85
+ # Standard HF format
86
+ text = data["inputs"]
87
+ parameters = data.get("parameters", {})
88
+ else:
89
+ # Direct access (fallback)
90
+ text = data
91
+ parameters = {}
92
+
93
+ # Extract parameters from request
94
+ voice = parameters.get("voice", "tara")
95
+ temperature = float(parameters.get("temperature", 0.6))
96
+ top_p = float(parameters.get("top_p", 0.95))
97
+ max_new_tokens = int(parameters.get("max_new_tokens", 1200))
98
+ repetition_penalty = float(parameters.get("repetition_penalty", 1.1))
99
+
100
+ # Format prompt with voice
101
+ prompt = f"{voice}: {text}"
102
+ logger.info(f"Formatted prompt with voice {voice}")
103
+
104
+ # Tokenize
105
+ input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids
106
+
107
+ # Add special tokens
108
+ modified_input_ids = torch.cat([self.start_token, input_ids, self.end_tokens], dim=1)
109
+
110
+ # No need for padding as we're processing a single sequence
111
+ input_ids = modified_input_ids.to(self.device)
112
+ attention_mask = torch.ones_like(input_ids)
113
+
114
+ return {
115
+ "input_ids": input_ids,
116
+ "attention_mask": attention_mask,
117
+ "temperature": temperature,
118
+ "top_p": top_p,
119
+ "max_new_tokens": max_new_tokens,
120
+ "repetition_penalty": repetition_penalty,
121
+ "health_check": False
122
+ }
123
+
124
+ def inference(self, inputs):
125
+ """
126
+ Run model inference on the preprocessed inputs
127
+ """
128
+ # Handle health check
129
+ if inputs.get("health_check", False):
130
+ return {"status": "ok"}
131
+
132
+ # Extract parameters
133
+ input_ids = inputs["input_ids"]
134
+ attention_mask = inputs["attention_mask"]
135
+ temperature = inputs["temperature"]
136
+ top_p = inputs["top_p"]
137
+ max_new_tokens = inputs["max_new_tokens"]
138
+ repetition_penalty = inputs["repetition_penalty"]
139
+
140
+ logger.info(f"Running inference with max_new_tokens={max_new_tokens}")
141
+
142
+ # Generate output tokens
143
+ with torch.no_grad():
144
+ generated_ids = self.model.generate(
145
+ input_ids=input_ids,
146
+ attention_mask=attention_mask,
147
+ max_new_tokens=max_new_tokens,
148
+ do_sample=True,
149
+ temperature=temperature,
150
+ top_p=top_p,
151
+ repetition_penalty=repetition_penalty,
152
+ num_return_sequences=1,
153
+ eos_token_id=self.end_audio_token,
154
+ )
155
+
156
+ logger.info(f"Generation complete, output shape: {generated_ids.shape}")
157
+ return generated_ids
158
+
159
+ def postprocess(self, generated_ids):
160
+ """
161
+ Process generated tokens into audio
162
+ """
163
+ # Handle health check response
164
+ if isinstance(generated_ids, dict) and "status" in generated_ids:
165
+ return generated_ids
166
+
167
+ logger.info("Postprocessing generated tokens")
168
+
169
+ # Find Start of Audio token
170
+ token_indices = (generated_ids == self.start_audio_token).nonzero(as_tuple=True)
171
+
172
+ if len(token_indices[1]) > 0:
173
+ last_occurrence_idx = token_indices[1][-1].item()
174
+ cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
175
+ logger.info(f"Found start audio token at position {last_occurrence_idx}")
176
+ else:
177
+ cropped_tensor = generated_ids
178
+ logger.warning("No start audio token found")
179
+
180
+ # Remove End of Audio tokens
181
+ processed_rows = []
182
+ for row in cropped_tensor:
183
+ masked_row = row[row != self.end_audio_token]
184
+ processed_rows.append(masked_row)
185
+
186
+ # Prepare audio codes
187
+ code_lists = []
188
+ for row in processed_rows:
189
+ row_length = row.size(0)
190
+ # Ensure length is multiple of 7 for SNAC
191
+ new_length = (row_length // 7) * 7
192
+ trimmed_row = row[:new_length]
193
+ trimmed_row = [t.item() - 128266 for t in trimmed_row] # Adjust token values
194
+ code_lists.append(trimmed_row)
195
+
196
+ # Generate audio from codes
197
+ audio_samples = []
198
+ for code_list in code_lists:
199
+ logger.info(f"Processing code list of length {len(code_list)}")
200
+ if len(code_list) > 0:
201
+ audio = self.redistribute_codes(code_list)
202
+ audio_samples.append(audio)
203
+ else:
204
+ logger.warning("Empty code list, no audio to generate")
205
+
206
+ if not audio_samples:
207
+ logger.error("No audio samples generated")
208
+ return {"error": "No audio samples generated"}
209
+
210
+ # Return first (and only) audio sample
211
+ audio_sample = audio_samples[0].detach().squeeze().cpu().numpy()
212
+
213
+ # Convert to base64 for transmission
214
+ import base64
215
+ import io
216
+ import wave
217
+
218
+ # Convert float32 array to int16 for WAV format
219
+ audio_int16 = (audio_sample * 32767).astype(np.int16)
220
+
221
+ # Create WAV in memory
222
+ with io.BytesIO() as wav_io:
223
+ with wave.open(wav_io, 'wb') as wav_file:
224
+ wav_file.setnchannels(1) # Mono
225
+ wav_file.setsampwidth(2) # 16-bit
226
+ wav_file.setframerate(24000) # 24kHz
227
+ wav_file.writeframes(audio_int16.tobytes())
228
+ wav_data = wav_io.getvalue()
229
+
230
+ # Encode as base64
231
+ audio_b64 = base64.b64encode(wav_data).decode('utf-8')
232
+ logger.info(f"Audio encoded as base64, length: {len(audio_b64)}")
233
+
234
+ return {
235
+ "audio_sample": audio_sample,
236
+ "audio_b64": audio_b64,
237
+ "sample_rate": 24000
238
+ }
239
+
240
+ def redistribute_codes(self, code_list):
241
+ """
242
+ Reorganize tokens for SNAC decoding
243
+ """
244
+ layer_1 = [] # Coarsest layer
245
+ layer_2 = [] # Intermediate layer
246
+ layer_3 = [] # Finest layer
247
+
248
+ num_groups = len(code_list) // 7
249
+ for i in range(num_groups):
250
+ idx = 7 * i
251
+ layer_1.append(code_list[idx])
252
+ layer_2.append(code_list[idx + 1] - 4096)
253
+ layer_3.append(code_list[idx + 2] - (2 * 4096))
254
+ layer_3.append(code_list[idx + 3] - (3 * 4096))
255
+ layer_2.append(code_list[idx + 4] - (4 * 4096))
256
+ layer_3.append(code_list[idx + 5] - (5 * 4096))
257
+ layer_3.append(code_list[idx + 6] - (6 * 4096))
258
+
259
+ codes = [
260
+ torch.tensor(layer_1).unsqueeze(0).to(self.device),
261
+ torch.tensor(layer_2).unsqueeze(0).to(self.device),
262
+ torch.tensor(layer_3).unsqueeze(0).to(self.device)
263
+ ]
264
+
265
+ # Decode audio
266
+ audio_hat = self.snac_model.decode(codes)
267
+ return audio_hat
268
+
269
+ def __call__(self, data):
270
+ """
271
+ Main entry point for the handler
272
+ """
273
+ # Run warmup only once, the first time __call__ is triggered
274
+ if not self._warmed_up:
275
+ self._warmup()
276
+
277
+ try:
278
+ logger.info(f"Received request: {type(data)}")
279
+
280
+ # Check if we need to handle the health check route
281
+ if data == "ping" or (isinstance(data, dict) and data.get("inputs") == "ping"):
282
+ logger.info("Processing health check request")
283
+ return {"status": "ok"}
284
+
285
+ preprocessed_inputs = self.preprocess(data)
286
+ model_outputs = self.inference(preprocessed_inputs)
287
+ response = self.postprocess(model_outputs)
288
+ return response
289
+ except Exception as e:
290
+ logger.error(f"Error processing request: {str(e)}")
291
+ import traceback
292
+ logger.error(traceback.format_exc())
293
+ return {"error": str(e)}
294
+
295
+ def _warmup(self):
296
+ try:
297
+ dummy_prompt = "tara: Hello"
298
+ input_ids = self.tokenizer(dummy_prompt, return_tensors="pt").input_ids.to(self.device)
299
+ _ = self.model.generate(input_ids=input_ids, max_new_tokens=5)
300
+ self._warmed_up = True
301
+ except Exception as e:
302
+ print(f"[WARMUP ERROR] {str(e)}")
model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8213aa17429d0aa9e5efa37fdcb305683aec81f781d44397e4fbf3fb3e361188
3
+ size 4991037968
model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:151d8b600614a200e322528b791155a75ae6f703a2a7ffb269e974e4cdb1c219
3
+ size 1610725592
model.safetensors.index.json ADDED
@@ -0,0 +1,261 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 6601734144
4
+ },
5
+ "weight_map": {
6
+ "model.embed_tokens.weight": "model-00001-of-00002.safetensors",
7
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
8
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
9
+ "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
10
+ "model.layers.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
11
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
12
+ "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
13
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
14
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
15
+ "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
16
+ "model.layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
17
+ "model.layers.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
18
+ "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
19
+ "model.layers.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
20
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
21
+ "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
22
+ "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
23
+ "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
24
+ "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
25
+ "model.layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
26
+ "model.layers.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
27
+ "model.layers.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
28
+ "model.layers.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
29
+ "model.layers.10.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
30
+ "model.layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
31
+ "model.layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
32
+ "model.layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
33
+ "model.layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
34
+ "model.layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
35
+ "model.layers.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
36
+ "model.layers.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
37
+ "model.layers.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
38
+ "model.layers.11.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
39
+ "model.layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
40
+ "model.layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
41
+ "model.layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
42
+ "model.layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
43
+ "model.layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
44
+ "model.layers.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
45
+ "model.layers.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
46
+ "model.layers.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
47
+ "model.layers.12.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
48
+ "model.layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
49
+ "model.layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
50
+ "model.layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
51
+ "model.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
52
+ "model.layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
53
+ "model.layers.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
54
+ "model.layers.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
55
+ "model.layers.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
56
+ "model.layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
57
+ "model.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
58
+ "model.layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
59
+ "model.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
60
+ "model.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
61
+ "model.layers.14.input_layernorm.weight": "model-00001-of-00002.safetensors",
62
+ "model.layers.14.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
63
+ "model.layers.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
64
+ "model.layers.14.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
65
+ "model.layers.14.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
66
+ "model.layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
67
+ "model.layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
68
+ "model.layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
69
+ "model.layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
70
+ "model.layers.15.input_layernorm.weight": "model-00001-of-00002.safetensors",
71
+ "model.layers.15.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
72
+ "model.layers.15.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
73
+ "model.layers.15.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
74
+ "model.layers.15.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
75
+ "model.layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
76
+ "model.layers.15.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
77
+ "model.layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
78
+ "model.layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
79
+ "model.layers.16.input_layernorm.weight": "model-00001-of-00002.safetensors",
80
+ "model.layers.16.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
81
+ "model.layers.16.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
82
+ "model.layers.16.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
83
+ "model.layers.16.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
84
+ "model.layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
85
+ "model.layers.16.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
86
+ "model.layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
87
+ "model.layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
88
+ "model.layers.17.input_layernorm.weight": "model-00001-of-00002.safetensors",
89
+ "model.layers.17.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
90
+ "model.layers.17.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
91
+ "model.layers.17.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
92
+ "model.layers.17.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
93
+ "model.layers.17.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
94
+ "model.layers.17.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
95
+ "model.layers.17.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
96
+ "model.layers.17.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
97
+ "model.layers.18.input_layernorm.weight": "model-00001-of-00002.safetensors",
98
+ "model.layers.18.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
99
+ "model.layers.18.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
100
+ "model.layers.18.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
101
+ "model.layers.18.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
102
+ "model.layers.18.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
103
+ "model.layers.18.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
104
+ "model.layers.18.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
105
+ "model.layers.18.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
106
+ "model.layers.19.input_layernorm.weight": "model-00001-of-00002.safetensors",
107
+ "model.layers.19.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
108
+ "model.layers.19.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
109
+ "model.layers.19.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
110
+ "model.layers.19.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
111
+ "model.layers.19.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
112
+ "model.layers.19.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
113
+ "model.layers.19.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
114
+ "model.layers.19.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
115
+ "model.layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
116
+ "model.layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
117
+ "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
118
+ "model.layers.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
119
+ "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
120
+ "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
121
+ "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
122
+ "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
123
+ "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
124
+ "model.layers.20.input_layernorm.weight": "model-00002-of-00002.safetensors",
125
+ "model.layers.20.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
126
+ "model.layers.20.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
127
+ "model.layers.20.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
128
+ "model.layers.20.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
129
+ "model.layers.20.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
130
+ "model.layers.20.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
131
+ "model.layers.20.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
132
+ "model.layers.20.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
133
+ "model.layers.21.input_layernorm.weight": "model-00002-of-00002.safetensors",
134
+ "model.layers.21.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
135
+ "model.layers.21.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
136
+ "model.layers.21.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
137
+ "model.layers.21.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
138
+ "model.layers.21.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
139
+ "model.layers.21.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
140
+ "model.layers.21.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
141
+ "model.layers.21.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
142
+ "model.layers.22.input_layernorm.weight": "model-00002-of-00002.safetensors",
143
+ "model.layers.22.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
144
+ "model.layers.22.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
145
+ "model.layers.22.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
146
+ "model.layers.22.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
147
+ "model.layers.22.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
148
+ "model.layers.22.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
149
+ "model.layers.22.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
150
+ "model.layers.22.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
151
+ "model.layers.23.input_layernorm.weight": "model-00002-of-00002.safetensors",
152
+ "model.layers.23.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
153
+ "model.layers.23.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
154
+ "model.layers.23.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
155
+ "model.layers.23.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
156
+ "model.layers.23.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
157
+ "model.layers.23.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
158
+ "model.layers.23.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
159
+ "model.layers.23.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
160
+ "model.layers.24.input_layernorm.weight": "model-00002-of-00002.safetensors",
161
+ "model.layers.24.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
162
+ "model.layers.24.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
163
+ "model.layers.24.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
164
+ "model.layers.24.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
165
+ "model.layers.24.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
166
+ "model.layers.24.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
167
+ "model.layers.24.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
168
+ "model.layers.24.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
169
+ "model.layers.25.input_layernorm.weight": "model-00002-of-00002.safetensors",
170
+ "model.layers.25.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
171
+ "model.layers.25.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
172
+ "model.layers.25.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
173
+ "model.layers.25.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
174
+ "model.layers.25.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
175
+ "model.layers.25.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
176
+ "model.layers.25.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
177
+ "model.layers.25.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
178
+ "model.layers.26.input_layernorm.weight": "model-00002-of-00002.safetensors",
179
+ "model.layers.26.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
180
+ "model.layers.26.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
181
+ "model.layers.26.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
182
+ "model.layers.26.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
183
+ "model.layers.26.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
184
+ "model.layers.26.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
185
+ "model.layers.26.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
186
+ "model.layers.26.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
187
+ "model.layers.27.input_layernorm.weight": "model-00002-of-00002.safetensors",
188
+ "model.layers.27.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
189
+ "model.layers.27.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
190
+ "model.layers.27.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
191
+ "model.layers.27.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
192
+ "model.layers.27.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
193
+ "model.layers.27.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
194
+ "model.layers.27.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
195
+ "model.layers.27.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
196
+ "model.layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
197
+ "model.layers.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
198
+ "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
199
+ "model.layers.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
200
+ "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
201
+ "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
202
+ "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
203
+ "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
204
+ "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
205
+ "model.layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
206
+ "model.layers.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
207
+ "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
208
+ "model.layers.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
209
+ "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
210
+ "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
211
+ "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
212
+ "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
213
+ "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
214
+ "model.layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
215
+ "model.layers.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
216
+ "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
217
+ "model.layers.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
218
+ "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
219
+ "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
220
+ "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
221
+ "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
222
+ "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
223
+ "model.layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
224
+ "model.layers.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
225
+ "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
226
+ "model.layers.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
227
+ "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
228
+ "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
229
+ "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
230
+ "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
231
+ "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
232
+ "model.layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
233
+ "model.layers.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
234
+ "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
235
+ "model.layers.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
236
+ "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
237
+ "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
238
+ "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
239
+ "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
240
+ "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
241
+ "model.layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
242
+ "model.layers.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
243
+ "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
244
+ "model.layers.8.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
245
+ "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
246
+ "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
247
+ "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
248
+ "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
249
+ "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
250
+ "model.layers.9.input_layernorm.weight": "model-00001-of-00002.safetensors",
251
+ "model.layers.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
252
+ "model.layers.9.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
253
+ "model.layers.9.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
254
+ "model.layers.9.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
255
+ "model.layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
256
+ "model.layers.9.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
257
+ "model.layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
258
+ "model.layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
259
+ "model.norm.weight": "model-00002-of-00002.safetensors"
260
+ }
261
+ }
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ torch>=2.0.0
2
+ transformers>=4.30.0
3
+ snac>=0.1.0
4
+ numpy>=1.20.0
5
+ protobuf
special_tokens_map.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|audio|>"
4
+ ],
5
+ "bos_token": {
6
+ "content": "<|begin_of_text|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "eos_token": {
13
+ "content": "<|eot_id|>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false
18
+ },
19
+ "pad_token": {
20
+ "content": "<|finetune_right_pad_id|>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false
25
+ }
26
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fc3fecb199b4170636dbfab986d25f628157268d37b861f9cadaca60b1353bce
3
+ size 22849547
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff