Model Card for Emotive Icelandic

EmotiveIcelandic is an extended version of parler-tts/parler-tts-mini-multilingual-v1.1, fine-tuned on Icelandic emotional speech.
Features of the output can be described through a description prompt. EmotiveIcelandic is trained on all the existing ParlerTTS features, in addition
to a description of the emotional content and intensity of the utterance (see here).

See the snippet here to get started with the model.

Model Details

Model Description

This multilingual checkpoint of ParlerTTS is fine-tuned on talromur3_with_prompts: a prompt-annotated, high-quality Icelandic emotive speech corpus. The model output can be described through a natural language description of the utterance-level pitch, speech monotony, speech quality, reverberation, speaking rate and emotional content.

  • Model type: Text-To-Speech
  • Language(s) (NLP): Icelandic
  • License: CC-by-4.0
  • Finetuned from model : parler-tts/parler-tts-mini-multilingual-v1.1

Model Sources

Usage

Use the code below to get started with the model.

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = ParlerTTSForConditionalGeneration.from_pretrained("atlithor/EmotiveIcelandic").to(device)
tokenizer = AutoTokenizer.from_pretrained("atlithor/EmotiveIcelandic")
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)

prompt = "Þetta er frábær hugmynd!" # E: this is a great idea!
description = "Ingrid sounds very happy in this utterance. The recording is of very high quality, with Ingrid's voice sounding clear and very close up."

input_ids = description_tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("ingrid_happy.wav", audio_arr, model.config.sampling_rate)

Training Details

Training Data

The model is trained on talromur3_with_prompts: a high-quality emotive Icelandic speech corpus.
The corpus has been annotated with natural language descriptions (see more). The corpus is multi-speaker, consisting of 7 named voices that can be included in the description prompts for consistent synthesis.

Speaker name Speaker gender
Astrid female
Freya female
Ingrid female
Frida female
Leif male
Anders male
Bjorn male

All training utterances are spoken in 6 different emotions classes:

  1. Neutral
  2. Happy
  3. Sad
  4. Angry
  5. Surprised
  6. Helpful (child directed)

All non-neutral utterances are also assigned an emotional intensity label between 1 (very low) to 5 (very high) and the emotional intensity
can also be specified in the description prompt, e.g.:

  • Freya sounds extremely happy
  • Leif is somewhat surprised
  • Anders sounds happy
  • Ingrid comes across as quite sad

Citation

Coming later

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Downloads last month
16
Safetensors
Model size
938M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for atlithor/EmotiveIcelandic

Finetuned
(5)
this model

Dataset used to train atlithor/EmotiveIcelandic