SpeakSpace-Assistant-v1-3B

Alpha AI (www.alphaai.biz) fine-tuned canopylabs/orpheus-3b-0.1-ft to create SpeakSpace-Assistant-v1-3B — an English-only, single-speaker voice assistant model. The fine-tune uses custom voice recordings plus the Elise dataset (~3 hours, single-speaker English speech). Transcripts were augmented with emotion/expression tags like <sigh> and <laughs>, added as special tokens in the Orpheus tokenizer.

⚠️ Important: This model is intended for research, prototyping, and internal product demos. Do not use it to impersonate a real person without explicit consent. Review base-model and dataset licenses before commercial use.

TL;DR

Base: canopylabs/orpheus-3b-0.1-ft (~3B params).
Data: Custom Alpha AI dataset + MrDragonFox/Elise (English, ~3 hours).
Objective: Produce natural, expressive speech with inline emotion cues (<laughs>, <sigh>).
Language: English only.
Repo: Suggested as alpha-ai/SpeakSpace-Assistant-v1-3B.

Intended Use & Limitations

Intended use:

Internal voice assistants and demos.
Research on expressive TTS and emotion-tag-conditioned speech.
Applications where transcripts include small expressive markers.

Limitations:

Not multi-speaker or multilingual.
Quality limited by dataset size (~3 hrs + custom data).
Requires Orpheus vocoder/decoder to convert tokens to waveform.
Do not deploy for impersonation without explicit consent.

Model Details

Family: Orpheus 3B (decoder-based speech model).
Tokenizer: Extended with special tokens (<laughs>, <sigh>).
Fine-tuning: Supervised finetuning on audio–transcript pairs.
Output: Discrete audio tokens; decode with Orpheus vocoder.

Data

Sources:

Alpha AI custom speech dataset.
MrDragonFox/Elise (~3 hrs English single-speaker).

Preprocessing:

Aligned utterances with transcripts.
Expression tags inserted inline.
Special tokens added to tokenizer.

Prompt & Input Format

Model accepts text input with optional inline expressions:

Hello! <laughs> I can help with your schedule today.

Workflow: tokenize → generate audio tokens → decode via vocoder.

Training Summary

Objective: Predict audio tokens from transcripts (with expression markers).
Loss: Causal LM loss.
Optimizer: AdamW or AdamW-8bit (please add exact values).
Hyperparameters: Learning rate, batch size, gradient accumulation, seed — to be filled with actual values.

Evaluation

Recommended:

MOS (Mean Opinion Score): naturalness & expressiveness.
Speaker similarity: ABX or MOS vs. ground truth.
Intelligibility: WER via ASR.
Emotion accuracy: Human rating of <laughs>, <sigh> cues.

Add quantitative results when available.

Safety & Responsible Use

Use only with documented consent for training voices.
Guard against impersonation risks.
Consider watermarking or metadata tagging for provenance.
Do not generalize beyond training speaker’s identity.

License & Attribution

Base model: canopylabs/orpheus-3b-0.1-ft (review base license).
Dataset: MrDragonFox/Elise (check dataset license).
Fine-tune: Ensure compatibility of licenses.

Suggested citation:

SpeakSpace-Assistant-v1-3B — fine-tune of canopylabs/orpheus-3b-0.1-ft on Alpha AI custom dataset + MrDragonFox/Elise.

Acknowledgements

canopylabs — Orpheus base model.
MrDragonFox — Elise dataset.
Alpha AI research & engineering team.

Contact

Questions, issues, or collaborations:

Open a discussion on the Hugging Face repo.
Enterprise contact (Alpha AI): www.alphaai.biz | [email protected]
Enterprise contact (SpeakSpace): www.speakspace.co | [email protected]

Downloads last month: 89

Safetensors

Model size

3.3B params

Tensor type

F16

Model tree for alpha-ai/SpeakSpace-Assistant-v1-3B

Base model

meta-llama/Llama-3.2-3B-Instruct

Finetuned

canopylabs/orpheus-3b-0.1-pretrained