👨‍🍳 SAUTE: Speaker-Aware Utterance Embedding Unit

SAUTE is a lightweight, speaker-aware transformer architecture designed for effective modeling of multi-speaker dialogues. It combines EDU-level utterance embeddings, speaker-sensitive memory, and efficient linear attention to encode rich conversational context with minimal overhead.

🧠 Overview

SAUTE is tailored for:

🗣️ Multi-turn conversations
👥 Multi-speaker interactions
🧵 Long-range dialog dependencies

It avoids the quadratic cost of full self-attention by summarizing per-speaker memory from EDU embeddings and injecting contextual information through lightweight linear attention mechanisms.

🧱 Architecture

🔍 SAUTE contextualizes each token with speaker-specific memory summaries built from utterance-level embeddings.

EDU-Level Encoder: Mean-pooled BERT outputs per utterance.
Speaker Memory: Outer-product-based accumulation per speaker.
Contextualization Layer: Integrates memory summaries with current token representations.

🚀 Key Features

🧠 Speaker-Aware Memory: Structured per-speaker representation of dialogue context.
⚡ Linear Attention: Efficient and scalable to long dialogues.
🧩 Pretrained Transformer Compatible: Can plug into frozen or fine-tuned BERT models.
🪶 Lightweight: ~4M parameters less than 2-layer with strong MLM performance improvements.

📈 Performance (on SODA, Masked Language Modeling)

Model	Avg MLM Acc	Best MLM Acc
BERT-base (frozen)	33.45	45.89
+ 1-layer Transformer	68.20	76.69
+ 2-layer Transformer	71.81	79.54
+ 1-layer SAUTE (Ours)	72.05	80.40%
+ 3-layer Transformer	73.5	80.84
+ 3-layer SAUTE (Ours)	75.65	85.55%

SAUTE achieves the best accuracy using fewer parameters than multi-layer transformers.

📚 Citation / Paper

📄 SAUTE: Speaker-Aware Utterance Embedding Unit (PDF)

🔧 How to Use

from saute_model import SAUTEConfig, UtteranceEmbedings
from transformers import BertTokenizerFast

# Load tokenizer and model
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
model = UtteranceEmbedings.from_pretrained("JustinDuc/saute")

# Prepare inputs (example)
outputs = model(
    input_ids=input_ids,
    attention_mask=attention_mask,
    speaker_names=speaker_names
)