File size: 17,832 Bytes
14ba735
 
 
 
 
 
 
 
 
 
 
 
932b954
 
 
 
 
14ba735
 
 
 
2418f81
14ba735
 
 
 
 
 
932b954
14ba735
 
 
932b954
14ba735
 
 
932b954
14ba735
 
932b954
 
 
 
 
 
 
 
 
 
14ba735
 
932b954
14ba735
 
 
932b954
14ba735
932b954
14ba735
 
 
932b954
 
 
 
14ba735
932b954
14ba735
932b954
 
 
 
 
 
 
 
14ba735
932b954
14ba735
932b954
14ba735
932b954
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14ba735
 
932b954
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14ba735
932b954
14ba735
932b954
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14ba735
 
932b954
14ba735
 
932b954
 
14ba735
932b954
8d9c4ee
932b954
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14ba735
932b954
14ba735
932b954
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14ba735
 
 
 
 
932b954
 
 
 
14ba735
 
 
932b954
14ba735
932b954
 
 
 
14ba735
932b954
 
 
 
14ba735
932b954
14ba735
932b954
 
 
 
 
14ba735
932b954
 
 
 
 
14ba735
932b954
 
 
 
14ba735
932b954
14ba735
 
932b954
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14ba735
932b954
14ba735
932b954
14ba735
932b954
 
 
 
 
14ba735
932b954
14ba735
932b954
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14ba735
 
 
 
932b954
 
14ba735
 
 
 
8d9c4ee
932b954
14ba735
 
 
 
ebe95ef
14ba735
ebe95ef
 
7dedab4
ebe95ef
932b954
 
5718217
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
---
license: apache-2.0
library_name: transformers
pipeline_tag: text-classification
tags:
- qa-metrics
- call-center
- multi-head
- distilbert
- transcript-analysis
- customer-service
- quality-assurance
- child-helplines
- crisis-support
- social-impact
- swahili
- east-africa
language:
- en
datasets:
- custom
- openchs/synthetic_helpline_qa_scoring_v1
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: qa-helpline-distilbert-v1
  results:
  - task:
      type: text-classification
      name: Quality Assurance Multi-Head Classification
    metrics:
    - type: accuracy
      value: 0.85
      name: Overall Accuracy
    - type: f1
      value: 0.82
      name: Weighted F1 Score
widget:
- text: >-
    Hello, thank you for calling our helpline. My name is Sarah, how can I help
    you today? I understand your concern completely. Let me check that
    information for you right away. Please hold for just a moment. Thank you for
    holding. I've found the solution and can help you now. Is there anything
    else I can assist with? Thank you for calling, have a wonderful day!
base_model:
- distilbert/distilbert-base-uncased
---

# QA Multi-Head DistilBERT for Helpline Quality Assessment

## Model Description

This is a fine-tuned DistilBERT model designed for **multi-head quality assurance (QA) classification** of call center and helpline transcripts. Developed by **BITZ IT Consulting** as part of an AI pipeline for **child helplines and crisis support services** in East Africa, this model evaluates transcript quality across six key dimensions with 17 specific sub-metrics.

The model addresses a critical operational challenge in helpline services: most helpline calls between agents and callers go unmonitored due to the overwhelming manual effort required for quality assurance. Supervisors traditionally must listen to entire call recordings to evaluate performance, making comprehensive QA virtually impossible at scale. By automating this process through AI-powered QA scoring, this model significantly reduces the supervisory burden and enables systematic evaluation of call quality across all interactions, ensuring consistent service standards and targeted agent development.
## Model Architecture

- **Base Model**: DistilBERT (distilbert-base-uncased)
- **Architecture**: Multi-head classifier with 6 specialized output heads
- **Input**: Call center/helpline transcripts (max 512 tokens)
- **Output**: Binary predictions for 17 quality assurance sub-metrics
- **Training**: Fine-tuned on domain-specific helpline and call center data

## QA Heads and Sub-metrics

| Head | Sub-metrics | Count | Description |
|------|-------------|--------|-------------|
| **Opening** | Use of call opening phrase | 1 | Evaluates proper call initiation protocols |
| **Listening** | Non-interruption, empathy, paraphrasing, politeness, confidence | 5 | Assesses active listening and communication skills |
| **Proactiveness** | Extra issue solving, satisfaction confirmation, follow-up | 3 | Measures proactive service approach |
| **Resolution** | Information accuracy, language use, consultation, process adherence, clarity | 5 | Evaluates problem-solving effectiveness |
| **Hold** | Hold explanation, gratitude for waiting | 2 | Assesses proper hold procedures |
| **Closing** | Proper closing phrase | 1 | Evaluates professional call conclusion |

**Total Sub-metrics**: 17 across 6 main QA dimensions

## Social Impact and Use Case

This model is specifically designed to support **child helplines and crisis intervention services** in East Africa. It addresses several critical challenges:


- **Consistent Care**: Ensures uniform quality standards across different operators
- **Training Support**: Provides objective feedback for helpline staff development
- **Scalable Monitoring**: Enables quality assurance at scale for under-resourced services

The model is part of a broader AI pipeline that includes ASR (Automatic Speech Recognition), translation, Entity recognition, case classification and summarization components, all focused on protecting vulnerable populations.

## Model Performance

### Overall Performance
- **Overall Accuracy**: ~87.5%
- **Average F1 Score**: ~91.2%
- **Training Approach**: Multi-task learning with BCEWithLogitsLoss per head
- **Evaluation**: Comprehensive metrics across all QA dimensions
### Per-Head Performance


### Detailed Per-Head Performance

| Head | Accuracy | Precision | Recall | F1 Score | Performance Level |
|------|----------|-----------|---------|----------|------------------|
| **Closing** | 100.0% | 100.0% | 100.0% | 100.0% |  Perfect |
| **Resolution** | 90.5% | 98.5% | 98.5% | 98.5% |  Excellent |
| **Hold** | 90.5% | 66.7% | 100.0% | 80.0% |  Good |
| **Proactiveness** | 85.7% | 91.7% | 95.7% | 93.6% |  Good |
| **Opening** | 85.7% | 85.7% | 85.7% | 85.7% |  Good |
| **Listening** | 71.4% | 98.5% | 93.1% | 95.7% |  Mixed Performance |


### Performance Insights

- **Strongest Performance**: Closing and Resolution heads achieve near-perfect scores
- **Consistent Performance**: Opening, Proactiveness show balanced precision/recall
- **High Precision Models**: Most heads demonstrate excellent precision (>85%)
- **Listening Head**: Lower accuracy (71.4%) but exceptional F1 score (95.7%) indicates the model correctly identifies listening behaviors when present, with some false negatives
- **Hold Head**: High accuracy but lower precision suggests conservative predictions - catches all positive cases but with some false positives


## Installation and Usage

### Quick Start

```bash
pip install transformers torch
```

### Model Classes

```python
import torch
import torch.nn as nn
from transformers import DistilBertModel, DistilBertPreTrainedModel, AutoTokenizer

class MultiHeadQAClassifier(DistilBertPreTrainedModel):
    """
    Multi-head QA classifier for call center quality assessment.
    Each head corresponds to a different QA metric with specific sub-metrics.
    """
    
    def __init__(self, config):
        super().__init__(config)
        
        # QA heads configuration
        self.heads_config = getattr(config, 'heads_config', {
            "opening": 1,
            "listening": 5,
            "proactiveness": 3,
            "resolution": 5,
            "hold": 2,
            "closing": 1
        })
        
        self.bert = DistilBertModel(config)
        classifier_dropout = getattr(config, 'classifier_dropout', 0.1)
        self.dropout = nn.Dropout(classifier_dropout)

        # Multiple classification heads
        self.classifiers = nn.ModuleDict({
            head_name: nn.Linear(config.hidden_size, num_labels)
            for head_name, num_labels in self.heads_config.items()
        })
        
        # Initialize weights
        self.post_init()

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = self.dropout(outputs.last_hidden_state[:, 0])  # [CLS] token

        logits = {}
        losses = {}
        total_loss = 0

        for head_name, classifier in self.classifiers.items():
            head_logits = classifier(pooled_output)
            logits[head_name] = torch.sigmoid(head_logits)  # Convert to probabilities

            # Calculate loss if labels provided
            if labels is not None and head_name in labels:
                loss_fn = nn.BCEWithLogitsLoss()
                loss = loss_fn(head_logits, labels[head_name])
                losses[head_name] = loss.item()
                total_loss += loss

        return {
            "logits": logits,
            "loss": total_loss if labels is not None else None,
            "losses": losses if labels is not None else None
        }
```

### Inference Function

```python
def predict_qa_metrics(text: str, model, tokenizer, threshold: float = 0.5, device=None):
    """
    Predict QA metrics for a helpline transcript with beautiful output formatting.
    
    Args:
        text: Input transcript text
        model: Loaded MultiHeadQAClassifier model
        tokenizer: DistilBERT tokenizer
        threshold: Classification threshold (default: 0.5)
        device: Device to use for inference
    
    Returns:
        Dictionary with predictions and probabilities for each QA metric
    """
    if device is None:
        device = next(model.parameters()).device
    
    model.eval()
    
    # Sub-metric labels for formatted output
    HEAD_SUBMETRIC_LABELS = {
        "opening": ["Use of call opening phrase"],
        "listening": [
            "Caller was not interrupted",
            "Empathizes with the caller", 
            "Paraphrases or rephrases the issue",
            "Uses 'please' and 'thank you'",
            "Does not hesitate or sound unsure"
        ],
        "proactiveness": [
            "Willing to solve extra issues",
            "Confirms satisfaction with action points",
            "Follows up on case updates"
        ],
        "resolution": [
            "Gives accurate information",
            "Correct language use",
            "Consults if unsure",
            "Follows correct steps",
            "Explains solution process clearly"
        ],
        "hold": [
            "Explains before placing on hold",
            "Thanks caller for holding"
        ],
        "closing": ["Proper call closing phrase used"]
    }

    # Tokenize input
    encoding = tokenizer(
        text,
        return_tensors="pt",
        padding="max_length",
        truncation=True,
        max_length=512
    )
    
    input_ids = encoding["input_ids"].to(device)
    attention_mask = encoding["attention_mask"].to(device)
    
    # Forward pass
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs["logits"]
    
    # Format results
    results = {}
    print(f"๐Ÿ“ž Transcript: {text}\n")
    
    total_positive = 0
    total_metrics = 0
    
    for head_name, probs in logits.items():
        probs_np = probs.cpu().numpy()[0]
        submetrics = HEAD_SUBMETRIC_LABELS.get(head_name, [f"Submetric {i+1}" for i in range(len(probs_np))])
        
        print(f"๐Ÿ”น {head_name.upper()}:")
        head_results = []
        
        for prob, submetric in zip(probs_np, submetrics):
            prediction = prob > threshold
            indicator = "โœ“" if prediction else "โœ—"
            
            if prediction:
                total_positive += 1
            total_metrics += 1
            
            result_item = {
                "submetric": submetric,
                "probability": float(prob),
                "prediction": bool(prediction),
                "indicator": indicator
            }
            head_results.append(result_item)
            
            print(f"  โžค {submetric}: P={prob:.3f} โ†’ {indicator}")
        
        results[head_name] = head_results
    
    # Overall summary
    overall_accuracy = (total_positive / total_metrics) * 100
    print(f"\n Overall Score: {total_positive}/{total_metrics} ({overall_accuracy:.1f}%)")
    
    results["summary"] = {
        "total_positive": total_positive,
        "total_metrics": total_metrics,
        "accuracy": overall_accuracy
    }
    
    return results
```

### Complete Usage Example

```python
from transformers import AutoTokenizer
import torch

# Load model and tokenizer
MODEL_NAME = "openchs/qa-helpline-distilbert-v1"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = MultiHeadQAClassifier.from_pretrained(MODEL_NAME)

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

# Example helpline transcript
transcript = """
Hello, thank you for calling our child helpline. My name is Sarah, how can I help you today? 
I understand your concern completely and I want to help you through this difficult situation. 
Let me check what resources we have available for you. Please hold for just a moment while I 
look into this. Thank you for holding. I've found several support options that can help. 
Is there anything else I can assist you with today? Thank you for reaching out to us, 
and please don't hesitate to call again if you need further support.
"""

# Run prediction
results = predict_qa_metrics(transcript, model, tokenizer, threshold=0.5, device=device)

# Access specific results
opening_results = results["opening"]
listening_results = results["listening"]
overall_summary = results["summary"]
```

### FastAPI Integration

```python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional

app = FastAPI(title="QA Helpline Metrics API")

class TranscriptInput(BaseModel):
    text: str
    threshold: Optional[float] = 0.5

@app.post("/predict")
async def predict_transcript_quality(input_data: TranscriptInput):
    try:
        results = predict_qa_metrics(
            text=input_data.text,
            model=model,
            tokenizer=tokenizer,
            threshold=input_data.threshold
        )
        return {"success": True, "predictions": results}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))
```

## Training Details

### Training Data
- **Domain**: Child helplines and crisis support transcripts
- **Languages**: English
- **Size**: Custom dataset with balanced QA metric annotations with no PII
- **Preprocessing**: no PII removal, text normalization, quality filtering

### Training Configuration
- **Base Model**: distilbert-base-uncased
- **Optimizer**: AdamW (lr=2e-5)
- **Loss Function**: BCEWithLogitsLoss (per head)
- **Batch Size**: 4
- **Max Length**: 512 tokens
- **Epochs**: 5
- **Training Framework**: PyTorch + Transformers

### Data Preprocessing Pipeline
- Text cleaning and normalization
- Token length validation
- Quality assurance checks

## Limitations and Considerations

### Technical Limitations
- **Context Length**: Limited to 512 tokens (longer transcripts need chunking)
- **Language Bias**: Primary training on English 
- **Domain Specificity**: Optimized for helpline/call center contexts
- **Binary Classification**: Each sub-metric is binary (present/absent)

### Ethical Considerations
- **Human-in-the-Loop**: Designed to assist and compliment, not replace human judgment
- **Privacy**: Was trained on custom PII-less data
- **Bias Monitoring**: Regular evaluation for demographic and linguistic bias
- **Sensitive Context**: Special care needed when evaluating crisis support calls

### Performance Considerations
- Some heads (Listening, Proactiveness, Resolution) show room for improvement
- Model performance may vary with transcript quality and length
- Threshold tuning recommended based on specific use case requirements

## Intended Use Cases

### Primary Applications
- **Helpline Quality Assurance**: Automated initial assessment of call quality
- **Agent Training**: Provide structured feedback for skill development
- **Service Monitoring**: Consistent evaluation across different operators
- **Performance Analytics**: Track quality trends and improvement areas

### Social Impact Applications
- **Child Protection**: Ensure quality standards in child helpline services
- **Crisis Support**: Maintain high standards in mental health and crisis calls
- **Language Accessibility**: N/A
- **Capacity Building**: Training support for under-resourced helpline services

## Out of Scope Uses
- **Standalone Decision Making**: Should not be used without human oversight
- **General Text Classification**: Not optimized for non-helpline contexts
- **Real-time Critical Decisions**: Not suitable for immediate intervention decisions
- **Legal/Medical Advice Evaluation**: Not designed for professional advice assessment

## Model Developers

**BITZ IT Consulting** - AI Solutions for Social Impact

**Team:**
- **Data Engineering Lead**: Rogendo
- **Data Analysis**: Shemmiriam  
- **Quality Assurance**: Nelsonadagi
- **ML Engineering**: Collaborative team effort

**Mission**: Developing AI solutions that protect vulnerable populations and improve access to critical support services across East Africa.

## Evaluation and Monitoring

### Performance Tracking
- Regular evaluation on held-out test sets
- Cross-validation across different helpline types
- Continuous monitoring for performance degradation
- A/B testing for threshold optimization

### Bias and Fairness
- Demographic bias assessment
- Language performance parity monitoring
- Cultural appropriateness evaluation
- Regular stakeholder feedback incorporation

## Contributing and Support

### Community Contributions
- Feedback on model performance in different contexts
- Contributions to multilingual support (especially East African languages)
- Performance improvements and optimization suggestions
- Documentation and usage examples

### Research Collaboration
We welcome collaboration with:
- Child protection organizations
- Crisis support services
- Academic researchers in NLP and social good
- Other organizations serving vulnerable populations

## Citation

```bibtex
@model{qa_helpline_distilbert_2025,
  title={QA Multi-Head DistilBERT for Helpline Quality Assessment},
  author={BITZ IT Consulting Team},
  year={2025},
  publisher={Hugging Face},
  journal={Hugging Face Model Hub},
  howpublished={\url{https://huggingface.co/openchs/qa-helpline-distilbert-v1}},
  note={AI for Social Impact: Child Helplines and Crisis Support in East Africa}
}
```


## Model Card Contact

**Organization**: BITZ IT Consulting  
**Support**: Technical questions and collaboratifzon inquiries welcome

**Repository Issues**: https://huggingface.co/openchs/qa-helpline-distilbert-v1/discussions

---
**Making Technology Work for Those Who Need It Most**