πŸ‘¨β€πŸ³ SAUTE: Speaker-Aware Utterance Embedding Unit

SAUTE is a lightweight, speaker-aware transformer architecture designed for effective modeling of multi-speaker dialogues. It combines EDU-level utterance embeddings, speaker-sensitive memory, and efficient linear attention to encode rich conversational context with minimal overhead.


🧠 Overview

SAUTE is tailored for:

  • πŸ—£οΈ Multi-turn conversations
  • πŸ‘₯ Multi-speaker interactions
  • 🧡 Long-range dialog dependencies

It avoids the quadratic cost of full self-attention by summarizing per-speaker memory from EDU embeddings and injecting contextual information through lightweight linear attention mechanisms.


🧱 Architecture

πŸ” SAUTE contextualizes each token with speaker-specific memory summaries built from utterance-level embeddings.

  • EDU-Level Encoder: Mean-pooled BERT outputs per utterance.
  • Speaker Memory: Outer-product-based accumulation per speaker.
  • Contextualization Layer: Integrates memory summaries with current token representations.

saute-architecture


πŸš€ Key Features

  • 🧠 Speaker-Aware Memory: Structured per-speaker representation of dialogue context.
  • ⚑ Linear Attention: Efficient and scalable to long dialogues.
  • 🧩 Pretrained Transformer Compatible: Can plug into frozen or fine-tuned BERT models.
  • πŸͺΆ Lightweight: ~4M parameters less than 2-layer with strong MLM performance improvements.

πŸ“ˆ Performance (on SODA, Masked Language Modeling)

Model Avg MLM Acc Best MLM Acc
BERT-base (frozen) 33.45 45.89
+ 1-layer Transformer 68.20 76.69
+ 2-layer Transformer 71.81 79.54
+ 1-layer SAUTE (Ours) 72.05 80.40%
+ 3-layer Transformer 73.5 80.84
+ 3-layer SAUTE (Ours) 75.65 85.55%

SAUTE achieves the best accuracy using fewer parameters than multi-layer transformers.


πŸ“š Citation / Paper

πŸ“„ SAUTE: Speaker-Aware Utterance Embedding Unit (PDF)


πŸ”§ How to Use

from saute_model import SAUTEConfig, UtteranceEmbedings
from transformers import BertTokenizerFast

# Load tokenizer and model
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
model = UtteranceEmbedings.from_pretrained("JustinDuc/saute")

# Prepare inputs (example)
outputs = model(
    input_ids=input_ids,
    attention_mask=attention_mask,
    speaker_names=speaker_names
)
Downloads last month
219
Safetensors
Model size
151M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support