Personify 67M

This model is a fine-tuned version of distilbert-base-uncased for the task of identifying one of four predefined individuals based on how they introduce themselves or are referred to in text. It was trained on the custom dataset qingy2024/RCJ-Dataset.

The primary goal of this model is to classify input text into one of the following categories, representing different individuals or groups:

0: Qing_Group
1: Ruo_Group
2: Fwooter_Group
3: Jimmy_Group

This model can be useful in specific contexts, like chat moderation or user tagging, where identifying these individuals from their known aliases is required.

Model Details

Base Model: distilbert-base-uncased
Fine-tuning Dataset: qingy2024/RCJ-Dataset (Link to your dataset on the Hub)
Language: English (en)
Task: Text Classification
Number of Labels: 4
Label Mapping:
- 0: Qing_Group (Associated aliases: Chunkamo, Qing, Chungust)
- 1: Ruo_Group (Associated aliases: Woundamo, Ruo, Ruoyun, NN, Neonark, Wounder)
- 2: Fwooter_Group (Associated aliases: Fwooter, Fwattomo, Baby, Babu, Fwatty)
- 3: Jimmy_Group (Associated aliases: Jimmy, GG, Jimmamo, Gart goo, Slam dunk, Little Tiub, Thin teeks, Yeet yart slam dunk)

How to Use 🚀

You can use this model directly with the transformers library for inference.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

model_name_or_path = "qingy2024/personify-67m"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path)

# Example texts
texts = [
    "hello, i am qing",
    "hi rcj I am ruo",
    "i am gg",
    "call me baby for short"
]

# Perform inference
for text in texts:
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    logits = outputs.logits
    probabilities = F.softmax(logits, dim=-1)
    predicted_class_id = torch.argmax(probabilities, dim=-1).item()
    
    predicted_label = model.config.id2label[predicted_class_id]
    confidence = probabilities[0][predicted_class_id].item()
    
    print(f"Input: '{text}'")
    print(f"Predicted Label ID: {predicted_class_id}")
    print(f"Predicted Label: {predicted_label}")
    print(f"Confidence: {confidence:.4f}")
    print("---")

# Expected Output (will vary slightly):
# Input: 'hello, i am qing'
# Predicted Label ID: 0
# Predicted Label: Qing_Group
# Confidence: 0.99XX
# ---
# Input: 'hi rcj I am ruo'
# Predicted Label ID: 1
# Predicted Label: Ruo_Group
# Confidence: 0.99XX
# ---
# Input: 'i am gg'
# Predicted Label ID: 3
# Predicted Label: Jimmy_Group
# Confidence: 0.99XX
# ---
# Input: 'call me baby for short'
# Predicted Label ID: 2
# Predicted Label: Fwooter_Group
# Confidence: 0.99XX
# ---

Training Procedure 🛠️

This model was fine-tuned using the Hugging Face transformers library and the Trainer API.

Dataset: qingy2024/RCJ-Dataset was used, containing example sentences with corresponding person labels.
Preprocessing: Texts were tokenized using the distilbert-base-uncased tokenizer.
Training Script: A custom Python script leveraging transformers.Trainer was used. (You can link to your training script if it's public or provide key parameters).
Key Hyperparameters (Example):
- Learning Rate: 2e-5
- Batch Size: 16
- Number of Epochs: 3
- Optimizer: AdamW

Evaluation Results 📊

Performance metrics on a held-out test split of qingy2024/RCJ-Dataset:

Accuracy: 1.0
F1-score (weighted): 1.0

Intended Uses & Limitations ⚠️

Intended Uses:

This model is primarily intended for identifying one of the four specific individuals/groups within a controlled environment where their aliases are known and frequently used (e.g., a specific online community, chat platform, or game).
It can be used for automated tagging, content routing, or user-specific interactions based on text input.

Limitations:

Specificity: The model is only trained to recognize the predefined set of individuals and their associated aliases. It will not generalize to other individuals or aliases not present in the training data.
Context Dependence: Performance may degrade if the input text significantly deviates from the style and phrasing present in the qingy2024/RCJ-Dataset.
Ambiguity: Some aliases might be common words or phrases, potentially leading to misclassifications if the context is not clear.
Not for General Person Recognition: This is NOT a general-purpose Named Entity Recognition (NER) model or a face/biometric recognition system. It only works with the specific textual aliases it was trained on.
Data Quality: The model's performance is heavily reliant on the quality and representativeness of the qingy2024/RCJ-Dataset.

Bias, Risks, and Ethical Considerations

Misidentification: There's a risk of misidentifying individuals, which could lead to incorrect actions or assumptions if the model is used in an automated decision-making process.
Over-reliance: Users should be cautious about over-relying on the model's predictions without human oversight, especially in sensitive applications.
Privacy: If the aliases are linked to real-world identities, ensure that the use of this model complies with privacy regulations and user consent.
Fairness: While the training data is synthetic for this specific task, always be mindful that biases in data can lead to biased model behavior.

Disclaimer

This model was created for a specific, limited use case and/or experimental purposes. Please use responsibly and be aware of its limitations.

Model Card Authors: qingy2024

Contact: qingy2024

qingy2024
/

personify-67m