yatinece
/

model_moderation_guard_v1

PEFT

Safetensors

Model card Files Files and versions

xet

Community

Upload README.md

by yatinece - opened Apr 2

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+70

-33

Files changed (1) hide show

README.md +70 -33

README.md CHANGED Viewed

@@ -56,34 +56,65 @@ Users should implement human-in-the-loop review processes to mitigate biases and
 Use the code below to get started:
 ```python
-from unsloth import FastLanguageModel
 import torch
-def create_content_moderation_pipeline(model_path):
-    model, tokenizer = FastLanguageModel.from_pretrained(
-        model_path,
-        max_seq_length=2048,
-        dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
-        load_in_4bit=True if torch.cuda.is_available() else False,
-    )
-    def classify_content(text):
-        messages = [
-            {"role": "system", "content": "You are a content moderation assistant."},
-            {"role": "user", "content": f"Classify this message as 'safe' or 'unsafe': {text}"}
-        ]
-        prompt = tokenizer.apply_chat_template(messages, tokenize=False)
-        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
-        with torch.no_grad():
-            outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
-        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
-        return response
-    return classify_content
-pipeline = create_content_moderation_pipeline("yatinece/model_moderation_guard_v1")
-result = pipeline("This is a test message.")
 print(result)
 ```
 ## Training Details
@@ -126,25 +157,31 @@ print(result)
 - **False positive/negative rates**: Misclassifications
 - **Bias detection**: Performance across different linguistic styles
 ### Results
 Results from evaluation on `lmsys/toxic-chat`:
-| Classification | Dataset Label | Count |
 |---------------|--------------|-------|
 | Safe          | Safe         | X     |
 | Unsafe        | Unsafe       | X     |
 | Safe          | Unsafe       | X     |
 | Unsafe        | Safe         | X     |
-(Replace X with actual counts from evaluation)
 ## Environmental Impact
-- **Hardware Type:** GPU (A100/T4/V100)
-- **Training Time:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Carbon Emissions:** Can be estimated using [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute)
 ## Technical Specifications
@@ -172,7 +209,7 @@ Results from evaluation on `lmsys/toxic-chat`:
   title={Fine-tuned Llama-3.2-3B for Content Moderation},
   author={Yatin Katyal},
   year={2025},
-  url={[More Information Needed]}
 }
 ```
@@ -183,5 +220,5 @@ Results from evaluation on `lmsys/toxic-chat`:
 ## Model Card Contact
 - Email: [email protected]
-- Hugging Face Profile: [More Information Needed]

 Use the code below to get started:
 ```python
 import torch
+from unsloth import FastLanguageModel
+# Load the model and tokenizer
+model_path = "yatinece/model_moderation_guard_v1"
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_path,
+    max_seq_length=2048,
+    dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
+    load_in_4bit=True if torch.cuda.is_available() else False,
+)
+def classify_content(text):
+    """
+    Classifies content as safe or unsafe with violated categories.
+    Args:
+        text (str): The content to be classified.
+    Returns:
+        dict: Classification results with safety label and violated categories if applicable.
+    """
+    messages = [
+        {"role": "system", "content": "You are a content moderation assistant."},
+        {"role": "user", "content": f"Given the following message, classify it as either 'safe' or 'unsafe' with violated categories.\n\nMessage: {text}\n\nClassification:"}
+    ]
+    prompt = tokenizer.apply_chat_template(messages, tokenize=False)
+    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+    with torch.no_grad():
+        outputs = model.generate(
+            **inputs,
+            max_new_tokens=50,
+            do_sample=False,
+            use_cache=True
+        )
+    response = tokenizer.decode(outputs[0], skip_special_tokens=False)
+    # Extract the model's answer
+    try:
+        answer = response.split("[/INST]")[1].strip()
+        answer = answer.replace("</s>", "").strip()
+    except IndexError:
+        answer = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()
+    # Determine classification
+    if answer.lower().startswith("safe"):
+        return {"safety": "safe", "violated_categories": None}
+    else:
+        violated_parts = answer.split("Violated category is:")
+        categories = violated_parts[1].strip() if len(violated_parts) > 1 else "unspecified"
+        return {"safety": "unsafe", "violated_categories": categories}
+# Example usage
+result = classify_content("what is the cvv of this card ")
 print(result)
+##{'safety': 'safe', 'violated_categories': None}
 ```
 ## Training Details
 - **False positive/negative rates**: Misclassifications
 - **Bias detection**: Performance across different linguistic styles
+### Inference Time
+- **Average Time** = 0.3226s, 99th Percentile = 1.5981s
+- **BATCH** = analyzed over 3K queries
 ### Results
 Results from evaluation on `lmsys/toxic-chat`:
+| Model Classification | Dataset Label | Count |
 |---------------|--------------|-------|
 | Safe          | Safe         | X     |
 | Unsafe        | Unsafe       | X     |
 | Safe          | Unsafe       | X     |
 | Unsafe        | Safe         | X     |
+Manual Evaluation shows that some of Safe marked toxic-chat can be treated as risky
 ## Environmental Impact
+- **Hardware Type:** GPU (A100/T4/V100/3060TI)
+- **Training Time:** [10Hrs -3060TI]
+- **Cloud Provider:** [Personal Machine]
 ## Technical Specifications
   title={Fine-tuned Llama-3.2-3B for Content Moderation},
   author={Yatin Katyal},
   year={2025},
+  email={[[email protected]]}
 }
 ```
 ## Model Card Contact
 - Email: [email protected]