File size: 17,832 Bytes
14ba735 932b954 14ba735 2418f81 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 8d9c4ee 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 932b954 14ba735 8d9c4ee 932b954 14ba735 ebe95ef 14ba735 ebe95ef 7dedab4 ebe95ef 932b954 5718217 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 |
---
license: apache-2.0
library_name: transformers
pipeline_tag: text-classification
tags:
- qa-metrics
- call-center
- multi-head
- distilbert
- transcript-analysis
- customer-service
- quality-assurance
- child-helplines
- crisis-support
- social-impact
- swahili
- east-africa
language:
- en
datasets:
- custom
- openchs/synthetic_helpline_qa_scoring_v1
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: qa-helpline-distilbert-v1
results:
- task:
type: text-classification
name: Quality Assurance Multi-Head Classification
metrics:
- type: accuracy
value: 0.85
name: Overall Accuracy
- type: f1
value: 0.82
name: Weighted F1 Score
widget:
- text: >-
Hello, thank you for calling our helpline. My name is Sarah, how can I help
you today? I understand your concern completely. Let me check that
information for you right away. Please hold for just a moment. Thank you for
holding. I've found the solution and can help you now. Is there anything
else I can assist with? Thank you for calling, have a wonderful day!
base_model:
- distilbert/distilbert-base-uncased
---
# QA Multi-Head DistilBERT for Helpline Quality Assessment
## Model Description
This is a fine-tuned DistilBERT model designed for **multi-head quality assurance (QA) classification** of call center and helpline transcripts. Developed by **BITZ IT Consulting** as part of an AI pipeline for **child helplines and crisis support services** in East Africa, this model evaluates transcript quality across six key dimensions with 17 specific sub-metrics.
The model addresses a critical operational challenge in helpline services: most helpline calls between agents and callers go unmonitored due to the overwhelming manual effort required for quality assurance. Supervisors traditionally must listen to entire call recordings to evaluate performance, making comprehensive QA virtually impossible at scale. By automating this process through AI-powered QA scoring, this model significantly reduces the supervisory burden and enables systematic evaluation of call quality across all interactions, ensuring consistent service standards and targeted agent development.
## Model Architecture
- **Base Model**: DistilBERT (distilbert-base-uncased)
- **Architecture**: Multi-head classifier with 6 specialized output heads
- **Input**: Call center/helpline transcripts (max 512 tokens)
- **Output**: Binary predictions for 17 quality assurance sub-metrics
- **Training**: Fine-tuned on domain-specific helpline and call center data
## QA Heads and Sub-metrics
| Head | Sub-metrics | Count | Description |
|------|-------------|--------|-------------|
| **Opening** | Use of call opening phrase | 1 | Evaluates proper call initiation protocols |
| **Listening** | Non-interruption, empathy, paraphrasing, politeness, confidence | 5 | Assesses active listening and communication skills |
| **Proactiveness** | Extra issue solving, satisfaction confirmation, follow-up | 3 | Measures proactive service approach |
| **Resolution** | Information accuracy, language use, consultation, process adherence, clarity | 5 | Evaluates problem-solving effectiveness |
| **Hold** | Hold explanation, gratitude for waiting | 2 | Assesses proper hold procedures |
| **Closing** | Proper closing phrase | 1 | Evaluates professional call conclusion |
**Total Sub-metrics**: 17 across 6 main QA dimensions
## Social Impact and Use Case
This model is specifically designed to support **child helplines and crisis intervention services** in East Africa. It addresses several critical challenges:
- **Consistent Care**: Ensures uniform quality standards across different operators
- **Training Support**: Provides objective feedback for helpline staff development
- **Scalable Monitoring**: Enables quality assurance at scale for under-resourced services
The model is part of a broader AI pipeline that includes ASR (Automatic Speech Recognition), translation, Entity recognition, case classification and summarization components, all focused on protecting vulnerable populations.
## Model Performance
### Overall Performance
- **Overall Accuracy**: ~87.5%
- **Average F1 Score**: ~91.2%
- **Training Approach**: Multi-task learning with BCEWithLogitsLoss per head
- **Evaluation**: Comprehensive metrics across all QA dimensions
### Per-Head Performance
### Detailed Per-Head Performance
| Head | Accuracy | Precision | Recall | F1 Score | Performance Level |
|------|----------|-----------|---------|----------|------------------|
| **Closing** | 100.0% | 100.0% | 100.0% | 100.0% | Perfect |
| **Resolution** | 90.5% | 98.5% | 98.5% | 98.5% | Excellent |
| **Hold** | 90.5% | 66.7% | 100.0% | 80.0% | Good |
| **Proactiveness** | 85.7% | 91.7% | 95.7% | 93.6% | Good |
| **Opening** | 85.7% | 85.7% | 85.7% | 85.7% | Good |
| **Listening** | 71.4% | 98.5% | 93.1% | 95.7% | Mixed Performance |
### Performance Insights
- **Strongest Performance**: Closing and Resolution heads achieve near-perfect scores
- **Consistent Performance**: Opening, Proactiveness show balanced precision/recall
- **High Precision Models**: Most heads demonstrate excellent precision (>85%)
- **Listening Head**: Lower accuracy (71.4%) but exceptional F1 score (95.7%) indicates the model correctly identifies listening behaviors when present, with some false negatives
- **Hold Head**: High accuracy but lower precision suggests conservative predictions - catches all positive cases but with some false positives
## Installation and Usage
### Quick Start
```bash
pip install transformers torch
```
### Model Classes
```python
import torch
import torch.nn as nn
from transformers import DistilBertModel, DistilBertPreTrainedModel, AutoTokenizer
class MultiHeadQAClassifier(DistilBertPreTrainedModel):
"""
Multi-head QA classifier for call center quality assessment.
Each head corresponds to a different QA metric with specific sub-metrics.
"""
def __init__(self, config):
super().__init__(config)
# QA heads configuration
self.heads_config = getattr(config, 'heads_config', {
"opening": 1,
"listening": 5,
"proactiveness": 3,
"resolution": 5,
"hold": 2,
"closing": 1
})
self.bert = DistilBertModel(config)
classifier_dropout = getattr(config, 'classifier_dropout', 0.1)
self.dropout = nn.Dropout(classifier_dropout)
# Multiple classification heads
self.classifiers = nn.ModuleDict({
head_name: nn.Linear(config.hidden_size, num_labels)
for head_name, num_labels in self.heads_config.items()
})
# Initialize weights
self.post_init()
def forward(self, input_ids, attention_mask, labels=None):
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
pooled_output = self.dropout(outputs.last_hidden_state[:, 0]) # [CLS] token
logits = {}
losses = {}
total_loss = 0
for head_name, classifier in self.classifiers.items():
head_logits = classifier(pooled_output)
logits[head_name] = torch.sigmoid(head_logits) # Convert to probabilities
# Calculate loss if labels provided
if labels is not None and head_name in labels:
loss_fn = nn.BCEWithLogitsLoss()
loss = loss_fn(head_logits, labels[head_name])
losses[head_name] = loss.item()
total_loss += loss
return {
"logits": logits,
"loss": total_loss if labels is not None else None,
"losses": losses if labels is not None else None
}
```
### Inference Function
```python
def predict_qa_metrics(text: str, model, tokenizer, threshold: float = 0.5, device=None):
"""
Predict QA metrics for a helpline transcript with beautiful output formatting.
Args:
text: Input transcript text
model: Loaded MultiHeadQAClassifier model
tokenizer: DistilBERT tokenizer
threshold: Classification threshold (default: 0.5)
device: Device to use for inference
Returns:
Dictionary with predictions and probabilities for each QA metric
"""
if device is None:
device = next(model.parameters()).device
model.eval()
# Sub-metric labels for formatted output
HEAD_SUBMETRIC_LABELS = {
"opening": ["Use of call opening phrase"],
"listening": [
"Caller was not interrupted",
"Empathizes with the caller",
"Paraphrases or rephrases the issue",
"Uses 'please' and 'thank you'",
"Does not hesitate or sound unsure"
],
"proactiveness": [
"Willing to solve extra issues",
"Confirms satisfaction with action points",
"Follows up on case updates"
],
"resolution": [
"Gives accurate information",
"Correct language use",
"Consults if unsure",
"Follows correct steps",
"Explains solution process clearly"
],
"hold": [
"Explains before placing on hold",
"Thanks caller for holding"
],
"closing": ["Proper call closing phrase used"]
}
# Tokenize input
encoding = tokenizer(
text,
return_tensors="pt",
padding="max_length",
truncation=True,
max_length=512
)
input_ids = encoding["input_ids"].to(device)
attention_mask = encoding["attention_mask"].to(device)
# Forward pass
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
logits = outputs["logits"]
# Format results
results = {}
print(f"๐ Transcript: {text}\n")
total_positive = 0
total_metrics = 0
for head_name, probs in logits.items():
probs_np = probs.cpu().numpy()[0]
submetrics = HEAD_SUBMETRIC_LABELS.get(head_name, [f"Submetric {i+1}" for i in range(len(probs_np))])
print(f"๐น {head_name.upper()}:")
head_results = []
for prob, submetric in zip(probs_np, submetrics):
prediction = prob > threshold
indicator = "โ" if prediction else "โ"
if prediction:
total_positive += 1
total_metrics += 1
result_item = {
"submetric": submetric,
"probability": float(prob),
"prediction": bool(prediction),
"indicator": indicator
}
head_results.append(result_item)
print(f" โค {submetric}: P={prob:.3f} โ {indicator}")
results[head_name] = head_results
# Overall summary
overall_accuracy = (total_positive / total_metrics) * 100
print(f"\n Overall Score: {total_positive}/{total_metrics} ({overall_accuracy:.1f}%)")
results["summary"] = {
"total_positive": total_positive,
"total_metrics": total_metrics,
"accuracy": overall_accuracy
}
return results
```
### Complete Usage Example
```python
from transformers import AutoTokenizer
import torch
# Load model and tokenizer
MODEL_NAME = "openchs/qa-helpline-distilbert-v1"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = MultiHeadQAClassifier.from_pretrained(MODEL_NAME)
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
# Example helpline transcript
transcript = """
Hello, thank you for calling our child helpline. My name is Sarah, how can I help you today?
I understand your concern completely and I want to help you through this difficult situation.
Let me check what resources we have available for you. Please hold for just a moment while I
look into this. Thank you for holding. I've found several support options that can help.
Is there anything else I can assist you with today? Thank you for reaching out to us,
and please don't hesitate to call again if you need further support.
"""
# Run prediction
results = predict_qa_metrics(transcript, model, tokenizer, threshold=0.5, device=device)
# Access specific results
opening_results = results["opening"]
listening_results = results["listening"]
overall_summary = results["summary"]
```
### FastAPI Integration
```python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional
app = FastAPI(title="QA Helpline Metrics API")
class TranscriptInput(BaseModel):
text: str
threshold: Optional[float] = 0.5
@app.post("/predict")
async def predict_transcript_quality(input_data: TranscriptInput):
try:
results = predict_qa_metrics(
text=input_data.text,
model=model,
tokenizer=tokenizer,
threshold=input_data.threshold
)
return {"success": True, "predictions": results}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
```
## Training Details
### Training Data
- **Domain**: Child helplines and crisis support transcripts
- **Languages**: English
- **Size**: Custom dataset with balanced QA metric annotations with no PII
- **Preprocessing**: no PII removal, text normalization, quality filtering
### Training Configuration
- **Base Model**: distilbert-base-uncased
- **Optimizer**: AdamW (lr=2e-5)
- **Loss Function**: BCEWithLogitsLoss (per head)
- **Batch Size**: 4
- **Max Length**: 512 tokens
- **Epochs**: 5
- **Training Framework**: PyTorch + Transformers
### Data Preprocessing Pipeline
- Text cleaning and normalization
- Token length validation
- Quality assurance checks
## Limitations and Considerations
### Technical Limitations
- **Context Length**: Limited to 512 tokens (longer transcripts need chunking)
- **Language Bias**: Primary training on English
- **Domain Specificity**: Optimized for helpline/call center contexts
- **Binary Classification**: Each sub-metric is binary (present/absent)
### Ethical Considerations
- **Human-in-the-Loop**: Designed to assist and compliment, not replace human judgment
- **Privacy**: Was trained on custom PII-less data
- **Bias Monitoring**: Regular evaluation for demographic and linguistic bias
- **Sensitive Context**: Special care needed when evaluating crisis support calls
### Performance Considerations
- Some heads (Listening, Proactiveness, Resolution) show room for improvement
- Model performance may vary with transcript quality and length
- Threshold tuning recommended based on specific use case requirements
## Intended Use Cases
### Primary Applications
- **Helpline Quality Assurance**: Automated initial assessment of call quality
- **Agent Training**: Provide structured feedback for skill development
- **Service Monitoring**: Consistent evaluation across different operators
- **Performance Analytics**: Track quality trends and improvement areas
### Social Impact Applications
- **Child Protection**: Ensure quality standards in child helpline services
- **Crisis Support**: Maintain high standards in mental health and crisis calls
- **Language Accessibility**: N/A
- **Capacity Building**: Training support for under-resourced helpline services
## Out of Scope Uses
- **Standalone Decision Making**: Should not be used without human oversight
- **General Text Classification**: Not optimized for non-helpline contexts
- **Real-time Critical Decisions**: Not suitable for immediate intervention decisions
- **Legal/Medical Advice Evaluation**: Not designed for professional advice assessment
## Model Developers
**BITZ IT Consulting** - AI Solutions for Social Impact
**Team:**
- **Data Engineering Lead**: Rogendo
- **Data Analysis**: Shemmiriam
- **Quality Assurance**: Nelsonadagi
- **ML Engineering**: Collaborative team effort
**Mission**: Developing AI solutions that protect vulnerable populations and improve access to critical support services across East Africa.
## Evaluation and Monitoring
### Performance Tracking
- Regular evaluation on held-out test sets
- Cross-validation across different helpline types
- Continuous monitoring for performance degradation
- A/B testing for threshold optimization
### Bias and Fairness
- Demographic bias assessment
- Language performance parity monitoring
- Cultural appropriateness evaluation
- Regular stakeholder feedback incorporation
## Contributing and Support
### Community Contributions
- Feedback on model performance in different contexts
- Contributions to multilingual support (especially East African languages)
- Performance improvements and optimization suggestions
- Documentation and usage examples
### Research Collaboration
We welcome collaboration with:
- Child protection organizations
- Crisis support services
- Academic researchers in NLP and social good
- Other organizations serving vulnerable populations
## Citation
```bibtex
@model{qa_helpline_distilbert_2025,
title={QA Multi-Head DistilBERT for Helpline Quality Assessment},
author={BITZ IT Consulting Team},
year={2025},
publisher={Hugging Face},
journal={Hugging Face Model Hub},
howpublished={\url{https://huggingface.co/openchs/qa-helpline-distilbert-v1}},
note={AI for Social Impact: Child Helplines and Crisis Support in East Africa}
}
```
## Model Card Contact
**Organization**: BITZ IT Consulting
**Support**: Technical questions and collaboratifzon inquiries welcome
**Repository Issues**: https://huggingface.co/openchs/qa-helpline-distilbert-v1/discussions
---
**Making Technology Work for Those Who Need It Most**
|