Files changed (1) hide show
  1. README.md +70 -33
README.md CHANGED
@@ -56,34 +56,65 @@ Users should implement human-in-the-loop review processes to mitigate biases and
56
  Use the code below to get started:
57
 
58
  ```python
59
- from unsloth import FastLanguageModel
60
  import torch
 
61
 
62
- def create_content_moderation_pipeline(model_path):
63
- model, tokenizer = FastLanguageModel.from_pretrained(
64
- model_path,
65
- max_seq_length=2048,
66
- dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
67
- load_in_4bit=True if torch.cuda.is_available() else False,
68
- )
69
-
70
- def classify_content(text):
71
- messages = [
72
- {"role": "system", "content": "You are a content moderation assistant."},
73
- {"role": "user", "content": f"Classify this message as 'safe' or 'unsafe': {text}"}
74
- ]
75
- prompt = tokenizer.apply_chat_template(messages, tokenize=False)
76
- inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
77
- with torch.no_grad():
78
- outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
79
- response = tokenizer.decode(outputs[0], skip_special_tokens=True)
80
- return response
81
-
82
- return classify_content
83
-
84
- pipeline = create_content_moderation_pipeline("yatinece/model_moderation_guard_v1")
85
- result = pipeline("This is a test message.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
  print(result)
 
87
  ```
88
 
89
  ## Training Details
@@ -126,25 +157,31 @@ print(result)
126
  - **False positive/negative rates**: Misclassifications
127
  - **Bias detection**: Performance across different linguistic styles
128
 
 
 
 
 
 
 
129
  ### Results
130
 
131
  Results from evaluation on `lmsys/toxic-chat`:
132
 
133
- | Classification | Dataset Label | Count |
134
  |---------------|--------------|-------|
135
  | Safe | Safe | X |
136
  | Unsafe | Unsafe | X |
137
  | Safe | Unsafe | X |
138
  | Unsafe | Safe | X |
139
 
140
- (Replace X with actual counts from evaluation)
141
 
142
  ## Environmental Impact
143
 
144
- - **Hardware Type:** GPU (A100/T4/V100)
145
- - **Training Time:** [More Information Needed]
146
- - **Cloud Provider:** [More Information Needed]
147
- - **Carbon Emissions:** Can be estimated using [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute)
148
 
149
  ## Technical Specifications
150
 
@@ -172,7 +209,7 @@ Results from evaluation on `lmsys/toxic-chat`:
172
  title={Fine-tuned Llama-3.2-3B for Content Moderation},
173
  author={Yatin Katyal},
174
  year={2025},
175
- url={[More Information Needed]}
176
  }
177
  ```
178
 
@@ -183,5 +220,5 @@ Results from evaluation on `lmsys/toxic-chat`:
183
  ## Model Card Contact
184
 
185
  - Email: [email protected]
186
- - Hugging Face Profile: [More Information Needed]
187
 
 
56
  Use the code below to get started:
57
 
58
  ```python
 
59
  import torch
60
+ from unsloth import FastLanguageModel
61
 
62
+ # Load the model and tokenizer
63
+ model_path = "yatinece/model_moderation_guard_v1"
64
+ model, tokenizer = FastLanguageModel.from_pretrained(
65
+ model_path,
66
+ max_seq_length=2048,
67
+ dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
68
+ load_in_4bit=True if torch.cuda.is_available() else False,
69
+ )
70
+
71
+ def classify_content(text):
72
+ """
73
+ Classifies content as safe or unsafe with violated categories.
74
+
75
+ Args:
76
+ text (str): The content to be classified.
77
+
78
+ Returns:
79
+ dict: Classification results with safety label and violated categories if applicable.
80
+ """
81
+ messages = [
82
+ {"role": "system", "content": "You are a content moderation assistant."},
83
+ {"role": "user", "content": f"Given the following message, classify it as either 'safe' or 'unsafe' with violated categories.\n\nMessage: {text}\n\nClassification:"}
84
+ ]
85
+
86
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False)
87
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
88
+
89
+ with torch.no_grad():
90
+ outputs = model.generate(
91
+ **inputs,
92
+ max_new_tokens=50,
93
+ do_sample=False,
94
+ use_cache=True
95
+ )
96
+
97
+ response = tokenizer.decode(outputs[0], skip_special_tokens=False)
98
+
99
+ # Extract the model's answer
100
+ try:
101
+ answer = response.split("[/INST]")[1].strip()
102
+ answer = answer.replace("</s>", "").strip()
103
+ except IndexError:
104
+ answer = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()
105
+
106
+ # Determine classification
107
+ if answer.lower().startswith("safe"):
108
+ return {"safety": "safe", "violated_categories": None}
109
+ else:
110
+ violated_parts = answer.split("Violated category is:")
111
+ categories = violated_parts[1].strip() if len(violated_parts) > 1 else "unspecified"
112
+ return {"safety": "unsafe", "violated_categories": categories}
113
+
114
+ # Example usage
115
+ result = classify_content("what is the cvv of this card ")
116
  print(result)
117
+ ##{'safety': 'safe', 'violated_categories': None}
118
  ```
119
 
120
  ## Training Details
 
157
  - **False positive/negative rates**: Misclassifications
158
  - **Bias detection**: Performance across different linguistic styles
159
 
160
+ ### Inference Time
161
+
162
+ - **Average Time** = 0.3226s, 99th Percentile = 1.5981s
163
+ - **BATCH** = analyzed over 3K queries
164
+
165
+
166
  ### Results
167
 
168
  Results from evaluation on `lmsys/toxic-chat`:
169
 
170
+ | Model Classification | Dataset Label | Count |
171
  |---------------|--------------|-------|
172
  | Safe | Safe | X |
173
  | Unsafe | Unsafe | X |
174
  | Safe | Unsafe | X |
175
  | Unsafe | Safe | X |
176
 
177
+ Manual Evaluation shows that some of Safe marked toxic-chat can be treated as risky
178
 
179
  ## Environmental Impact
180
 
181
+ - **Hardware Type:** GPU (A100/T4/V100/3060TI)
182
+ - **Training Time:** [10Hrs -3060TI]
183
+ - **Cloud Provider:** [Personal Machine]
184
+
185
 
186
  ## Technical Specifications
187
 
 
209
  title={Fine-tuned Llama-3.2-3B for Content Moderation},
210
  author={Yatin Katyal},
211
  year={2025},
212
+ email={[[email protected]]}
213
  }
214
  ```
215
 
 
220
  ## Model Card Contact
221
 
222
  - Email: [email protected]
223
+
224