File size: 6,624 Bytes

---
base_model: unsloth/llama-3.2-3b-instruct-bnb-4bit
library_name: peft
---

# Model Card for Yatin Katyal's Content Moderation Model

## Model Details

### Model Description

This model is a fine-tuned version of `unsloth/Llama-3.2-3B-Instruct-bnb-4bit` for content moderation tasks. It is trained on the `nvidia/Aegis-AI-Content-Safety-Dataset-2.0` to classify user-generated content as "safe" or "unsafe," identifying violated categories when applicable.

- **Developed by:** Yatin Katyal
- **Funded by [optional]:** [More Information Needed]
- **Shared by [optional]:** [More Information Needed]
- **Model type:** Transformer-based LLM with LoRA fine-tuning
- **Language(s) (NLP):** English
- **License:** [More Information Needed]
- **Finetuned from model:** `unsloth/Llama-3.2-3B-Instruct-bnb-4bit`

### Model Sources

- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]
- **Demo [optional]:** [More Information Needed]

## Uses

### Direct Use

This model is intended for content moderation applications, identifying unsafe messages and their violated categories. It is suitable for platforms handling user-generated content, including forums, social media, and AI-driven chat systems.

### Downstream Use

Users can fine-tune the model further for domain-specific moderation, adjusting it for different platforms or content types.

### Out-of-Scope Use

- The model may not be suitable for legal compliance without additional review.
- It should not be used as the sole authority for content moderation decisions.
- The model is not guaranteed to be free from biases.

## Bias, Risks, and Limitations

- The model inherits biases from the training dataset.
- False positives and negatives are possible, especially in nuanced cases.
- Performance may degrade with adversarial inputs or underrepresented linguistic styles.

### Recommendations

Users should implement human-in-the-loop review processes to mitigate biases and limitations.

## How to Get Started with the Model

Use the code below to get started:

```python
import torch
from unsloth import FastLanguageModel

# Load the model and tokenizer
model_path = "yatinece/model_moderation_guard_v1"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_path,
    max_seq_length=2048,
    dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    load_in_4bit=True if torch.cuda.is_available() else False,
)

def classify_content(text):
    """
    Classifies content as safe or unsafe with violated categories.
    
    Args:
        text (str): The content to be classified.
        
    Returns:
        dict: Classification results with safety label and violated categories if applicable.
    """
    messages = [
        {"role": "system", "content": "You are a content moderation assistant."},
        {"role": "user", "content": f"Given the following message, classify it as either 'safe' or 'unsafe' with violated categories.\n\nMessage: {text}\n\nClassification:"}
    ]
    
    prompt = tokenizer.apply_chat_template(messages, tokenize=False)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=50,
            do_sample=False,
            use_cache=True
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=False)
    
    # Extract the model's answer
    try:
        answer = response.split("[/INST]")[1].strip()
        answer = answer.replace("</s>", "").strip()
    except IndexError:
        answer = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()
    
    # Determine classification
    if answer.lower().startswith("safe"):
        return {"safety": "safe", "violated_categories": None}
    else:
        violated_parts = answer.split("Violated category is:")
        categories = violated_parts[1].strip() if len(violated_parts) > 1 else "unspecified"
        return {"safety": "unsafe", "violated_categories": categories}

# Example usage
result = classify_content("what is the cvv of this card ")
print(result)
##{'safety': 'safe', 'violated_categories': None}
```

## Training Details

### Training Data

- Dataset: `nvidia/Aegis-AI-Content-Safety-Dataset-2.0`
- Data Preprocessing: Applied chat template formatting

### Training Procedure

- **Precision:** Bfloat16 or float16 (auto-detected based on GPU support)
- **LoRA Configuration:**
  - Rank (r): 32
  - Target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
  - LoRA Alpha: 16
  - LoRA Dropout: 0
- **Training Regime:**
  - Per device batch size: 8
  - Gradient accumulation steps: 4
  - Learning rate: 2e-4
  - Optimizer: AdamW (8-bit)
  - Weight decay: 0.01
  - LR Scheduler: Cosine with restarts
  - Training steps: ~ full dataset pass
  - Logging & evaluation: Every 1000 steps

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

- Dataset: `lmsys/toxic-chat`
- Evaluation dataset processed similarly to training data

#### Metrics

- **Classification accuracy**: Agreement with dataset labels
- **False positive/negative rates**: Misclassifications
- **Bias detection**: Performance across different linguistic styles

### Inference Time

- **Average Time** = 0.3226s, 99th Percentile = 1.5981s 
- **BATCH** = analyzed over 3K queries


### Results

Results from evaluation on `lmsys/toxic-chat`:

| Model Classification | Dataset Label | Count |
|---------------|--------------|-------|
| Safe          | Safe         | X     |
| Unsafe        | Unsafe       | X     |
| Safe          | Unsafe       | X     |
| Unsafe        | Safe         | X     |

Manual Evaluation shows that some of Safe marked toxic-chat can be treated as risky

## Environmental Impact

- **Hardware Type:** GPU (A100/T4/V100/3060TI)
- **Training Time:** [10Hrs -3060TI]
- **Cloud Provider:** [Personal Machine]


## Technical Specifications

### Model Architecture and Objective

- Base Model: `unsloth/Llama-3.2-3B-Instruct-bnb-4bit`
- LoRA Fine-tuning: `peft`
- Primary objective: Content classification

### Compute Infrastructure

- **Hardware:** Single/multi-GPU setup
- **Software:**
  - PEFT 0.15.1
  - Transformers
  - Unsloth
  - PyTorch
  - WandB (for logging)

## Citation

**BibTeX:**
```
@misc{katyal2025contentmoderation,
  title={Fine-tuned Llama-3.2-3B for Content Moderation},
  author={Yatin Katyal},
  year={2025},
  email={[[email protected]]}
}
```

## Model Card Authors

- **Yatin Katyal**

## Model Card Contact

- Email: [email protected]