π¨βπ³ SAUTE: Speaker-Aware Utterance Embedding Unit
SAUTE is a lightweight, speaker-aware transformer architecture designed for effective modeling of multi-speaker dialogues. It combines EDU-level utterance embeddings, speaker-sensitive memory, and efficient linear attention to encode rich conversational context with minimal overhead.
π§ Overview
SAUTE is tailored for:
- π£οΈ Multi-turn conversations
- π₯ Multi-speaker interactions
- π§΅ Long-range dialog dependencies
It avoids the quadratic cost of full self-attention by summarizing per-speaker memory from EDU embeddings and injecting contextual information through lightweight linear attention mechanisms.
π§± Architecture
π SAUTE contextualizes each token with speaker-specific memory summaries built from utterance-level embeddings.
- EDU-Level Encoder: Mean-pooled BERT outputs per utterance.
- Speaker Memory: Outer-product-based accumulation per speaker.
- Contextualization Layer: Integrates memory summaries with current token representations.
π Key Features
- π§ Speaker-Aware Memory: Structured per-speaker representation of dialogue context.
- β‘ Linear Attention: Efficient and scalable to long dialogues.
- π§© Pretrained Transformer Compatible: Can plug into frozen or fine-tuned BERT models.
- πͺΆ Lightweight: ~4M parameters less than 2-layer with strong MLM performance improvements.
π Performance (on SODA, Masked Language Modeling)
Model | Avg MLM Acc | Best MLM Acc |
---|---|---|
BERT-base (frozen) | 33.45 | 45.89 |
+ 1-layer Transformer | 68.20 | 76.69 |
+ 2-layer Transformer | 71.81 | 79.54 |
+ 1-layer SAUTE (Ours) | 72.05 | 80.40% |
+ 3-layer Transformer | 73.5 | 80.84 |
+ 3-layer SAUTE (Ours) | 75.65 | 85.55% |
SAUTE achieves the best accuracy using fewer parameters than multi-layer transformers.
π Citation / Paper
π SAUTE: Speaker-Aware Utterance Embedding Unit (PDF)
π§ How to Use
from saute_model import SAUTEConfig, UtteranceEmbedings
from transformers import BertTokenizerFast
# Load tokenizer and model
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
model = UtteranceEmbedings.from_pretrained("JustinDuc/saute")
# Prepare inputs (example)
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
speaker_names=speaker_names
)
- Downloads last month
- 219