Attention Masks and Pad Tokens in Transformer Generation: Research Questions

Core Problem Statement

When running transformer models (specifically Llama-3.2-1B-Instruct) for text generation, we encounter warnings about missing attention masks and pad tokens, even for single input sequences. This leads to inconsistent generation outputs despite identical inputs.

Warning Messages Observed

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.

Key Research Questions

1. Why do single inputs require attention masks?

Initial Assumption: Single sequences without padding shouldn't need attention masks. Observed Reality: Even single inputs show different generation outputs when attention masks are missing.

2. What is the relationship between pad tokens and attention masks?

Question: How do pad_token_id and attention_mask work together in the generation process?

3. Why does pad_token_id = eos_token_id cause issues?

Specific Issue: When padding token equals end-of-sequence token, what ambiguity does this create?

Code Analysis

Current Implementation (Problematic)

def chat_current(system_prompt: str, user_prompt: str) -> str:
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]

    # Only returns input_ids tensor
    input_ids = tok.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(lm.device)

    with torch.inference_mode():
        output_ids = lm.generate(
            input_ids,  # Missing: attention_mask, pad_token_id
            max_new_tokens=2048,
            do_sample=True,
            temperature=0.2,
            repetition_penalty=1.1,
            top_k=100,
            top_p=0.95,
        )
    
    return tok.decode(output_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)

Fixed Implementation

def chat_fixed(system_prompt: str, user_prompt: str) -> str:
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]

    # Returns dictionary with input_ids AND attention_mask
    inputs = tok.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt",
        return_dict=True  # KEY CHANGE: Get both components
    )
    
    input_ids = inputs["input_ids"].to(lm.device)
    attention_mask = inputs["attention_mask"].to(lm.device)

    with torch.inference_mode():
        output_ids = lm.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,  # Explicit attention guidance
            pad_token_id=tok.eos_token_id,  # Explicit pad token
            max_new_tokens=2048,
            do_sample=True,
            temperature=0.2,
            repetition_penalty=1.1,
            top_k=100,
            top_p=0.95,
        )
    
    return tok.decode(output_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)

Model and Tokenizer Setup

model_name = "models/Llama-3.2-1B-Instruct"
tok = AutoTokenizer.from_pretrained(model_name)
# Critical: Set pad token if not available
if tok.pad_token is None:
    tok.pad_token = tok.eos_token

lm = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
).eval()

Observed Behavioral Differences

Input Structure Analysis

# Single input contains multiple components:
messages = [
    {"role": "system", "content": "You are a helpful assistant..."},
    {"role": "user", "content": "What is the capital of France?"},
]

# After apply_chat_template, becomes token sequence:
# [system_tokens, user_tokens, assistant_start_token]

Technical Hypotheses for Investigation

Hypothesis 1: Internal Masking Ambiguity

When attention_mask is missing, the model cannot distinguish between:

Real input tokens that should influence generation
Structural tokens (system prompts, role markers)
Token boundaries between different message roles

Hypothesis 2: EOS Token Dual Purpose Confusion

When pad_token_id == eos_token_id, the model faces ambiguity:

# Same token (128001) serves dual purposes:
# 1. End of sequence marker
# 2. Padding token for batch processing
# Model cannot infer which purpose applies in context

Hypothesis 3: Autoregressive Generation Context Boundary Issues

During generation, model needs to know:

Which input tokens provide valid context for next token prediction
Where the "prompt" ends and "generation" begins
How to weight attention across different input components

Research Objectives

Primary Questions

Mechanism Analysis: How exactly does missing attention_mask affect the internal attention computation?
Consistency Impact: Why do identical inputs produce different outputs without proper masking?
Single vs Batch Behavior: What differences exist between single sequence and batched sequence processing?

Secondary Questions

Model-Specific Behavior: Do different transformer architectures handle missing attention masks differently?
Generation Parameter Interaction: How do attention mask issues interact with sampling parameters (temperature, top_p, etc.)?
Performance Impact: What computational overhead does proper attention masking add?

Key Technical Areas for Deep Research

Attention Mechanism Internals

How attention weights are computed with/without explicit masks
Impact on multi-head attention distributions
Interaction with causal masking in autoregressive models

Tokenizer Behavior

How apply_chat_template constructs input sequences
Default attention mask generation behavior
Role of special tokens in attention computation

Generation Process

How model.generate() handles missing parameters
Internal assumptions and fallback behaviors
Impact on sampling and beam search algorithms

Expected Research Outcomes

Understanding of:

Exact mechanism causing output inconsistency
Best practices for single sequence generation
Relationship between attention masking and generation quality
Guidelines for production transformer deployment

References for Deep Research

Hugging Face Transformers documentation on attention masks
Technical blogs on transformer attention mechanisms (2024)
Community discussions on pad token vs attention mask differences
Official model documentation for Llama architecture attention handling