SpeakSpace Assistant

SpeakSpace-Assistant-v1-3B

Alpha AI (www.alphaai.biz) fine-tuned canopylabs/orpheus-3b-0.1-ft to create SpeakSpace-Assistant-v1-3B — an English-only, single-speaker voice assistant model. The fine-tune uses custom voice recordings plus the Elise dataset (~3 hours, single-speaker English speech). Transcripts were augmented with emotion/expression tags like <sigh> and <laughs>, added as special tokens in the Orpheus tokenizer.

⚠️ Important: This model is intended for research, prototyping, and internal product demos. Do not use it to impersonate a real person without explicit consent. Review base-model and dataset licenses before commercial use.


TL;DR

  • Base: canopylabs/orpheus-3b-0.1-ft (~3B params).
  • Data: Custom Alpha AI dataset + MrDragonFox/Elise (English, ~3 hours).
  • Objective: Produce natural, expressive speech with inline emotion cues (<laughs>, <sigh>).
  • Language: English only.
  • Repo: Suggested as alpha-ai/SpeakSpace-Assistant-v1-3B.

Intended Use & Limitations

Intended use:

  • Internal voice assistants and demos.
  • Research on expressive TTS and emotion-tag-conditioned speech.
  • Applications where transcripts include small expressive markers.

Limitations:

  • Not multi-speaker or multilingual.
  • Quality limited by dataset size (~3 hrs + custom data).
  • Requires Orpheus vocoder/decoder to convert tokens to waveform.
  • Do not deploy for impersonation without explicit consent.

Model Details

  • Family: Orpheus 3B (decoder-based speech model).
  • Tokenizer: Extended with special tokens (<laughs>, <sigh>).
  • Fine-tuning: Supervised finetuning on audio–transcript pairs.
  • Output: Discrete audio tokens; decode with Orpheus vocoder.

Data

Sources:

Preprocessing:

  • Aligned utterances with transcripts.
  • Expression tags inserted inline.
  • Special tokens added to tokenizer.

Prompt & Input Format

Model accepts text input with optional inline expressions:

Hello! <laughs> I can help with your schedule today.

Workflow: tokenize → generate audio tokens → decode via vocoder.


Training Summary

  • Objective: Predict audio tokens from transcripts (with expression markers).
  • Loss: Causal LM loss.
  • Optimizer: AdamW or AdamW-8bit (please add exact values).
  • Hyperparameters: Learning rate, batch size, gradient accumulation, seed — to be filled with actual values.

Evaluation

Recommended:

  • MOS (Mean Opinion Score): naturalness & expressiveness.
  • Speaker similarity: ABX or MOS vs. ground truth.
  • Intelligibility: WER via ASR.
  • Emotion accuracy: Human rating of <laughs>, <sigh> cues.

Add quantitative results when available.


Safety & Responsible Use

  • Use only with documented consent for training voices.
  • Guard against impersonation risks.
  • Consider watermarking or metadata tagging for provenance.
  • Do not generalize beyond training speaker’s identity.

License & Attribution

  • Base model: canopylabs/orpheus-3b-0.1-ft (review base license).
  • Dataset: MrDragonFox/Elise (check dataset license).
  • Fine-tune: Ensure compatibility of licenses.

Suggested citation:

SpeakSpace-Assistant-v1-3B — fine-tune of canopylabs/orpheus-3b-0.1-ft on Alpha AI custom dataset + MrDragonFox/Elise.

Acknowledgements

  • canopylabs — Orpheus base model.
  • MrDragonFox — Elise dataset.
  • Alpha AI research & engineering team.

Contact

Questions, issues, or collaborations:

Downloads last month
89
Safetensors
Model size
3.3B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for alpha-ai/SpeakSpace-Assistant-v1-3B

Dataset used to train alpha-ai/SpeakSpace-Assistant-v1-3B