CLIP-based Facial Expression Recognition (FER-2013)
This is a fine-tuned version of openai/clip-vit-large-patch14
for the task of Facial Expression Recognition (FER).
This model was trained on the FER-2013 dataset and can classify a facial image into one of seven emotions: angry, disgust, fear, happy, neutral, sad, and surprise. It was created using a transfer learning approach where the pre-trained CLIP vision encoder was frozen, and a new linear classification head was trained on top of it to recognize the emotion classes.
Model Description
- Base Model:
openai/clip-vit-large-patch14
- Task: Image Classification (Facial Expression Recognition)
- Framework: PyTorch
- Dataset: FER-2013
- Final Accuracy (Test Set): 72%
Intended Uses & Limitations
Intended Uses
This model is intended for academic research and as a baseline for developing more advanced emotion recognition systems. Potential applications include:
- Analyzing sentiment in user-submitted media (e.g., product review videos).
- Content analysis for social science research on emotion portrayal in images.
- A building block for assistive technology applications.
Limitations and Bias
This model inherits the limitations of its training data, the FER-2013 dataset.
- Dataset Bias: The FER-2013 dataset is known to have biases in its representation of age, gender, and race. As a result, the model's performance may be inconsistent across different demographic groups. It is not recommended for use in production systems that affect individuals without thorough bias evaluation and mitigation.
- Posed vs. Natural Expressions: The dataset primarily contains posed, front-facing, and often exaggerated expressions. The model will likely perform worse on real-world images that feature subtle, natural, or non-frontal expressions.
- Ambiguity of Emotion: Emotion is subjective and context-dependent. A static image cannot capture the full story. The model's predictions are based on learned visual patterns from the dataset and should not be considered an objective measure of a person's true emotional state.
- Misuse Potential: This model should NOT be used for applications that involve making automated judgments about an individual's character, truthfulness, or employability. It is not suitable for surveillance or any application that could have a significant adverse impact on people's lives.
How to Use
To use this model, you first need to define its custom architecture, then load the saved weights from this repository.
1. Installation
pip install transformers torch safetensors Pillow huggingface_hub requests
2. Prediction Script
This runnable script downloads the model from this repository and predicts the emotion for an example image from the web.
import torch
import torch.nn as nn
from PIL import Image
from transformers import AutoProcessor, CLIPVisionModel
from safetensors.torch import load_file
import torch.nn.functional as F
from huggingface_hub import hf_hub_download
import requests
# --- Configuration ---
# This is the repository ID for the model on the Hugging Face Hub
REPO_ID = "syntheticbot/clip-face-expression"
FILENAME = "model.safetensors"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# The class names must be in alphabetical order as used during training
CLASS_NAMES = ['angry', 'disgust', 'fear', 'happy', 'neutral', 'sad', 'surprise']
NUM_CLASSES = len(CLASS_NAMES)
# --- Define the Model Architecture ---
# This class must be defined to match the architecture of the saved model
class ClipClassifier(nn.Module):
def __init__(self, vision_model, num_classes):
super(ClipClassifier, self).__init__()
self.vision_model = vision_model
# The base model's config is needed to get the hidden size for the classifier
self.classifier = nn.Linear(vision_model.config.hidden_size, num_classes)
def forward(self, pixel_values):
outputs = self.vision_model(pixel_values=pixel_values)
image_features = outputs.pooler_output
logits = self.classifier(image_features)
return logits
# --- Load Model and Processor ---
print("Loading model and processor...")
processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14")
vision_model = CLIPVisionModel.from_pretrained("openai/clip-vit-large-patch14").to(DEVICE)
# Instantiate the custom classifier
model = ClipClassifier(vision_model, NUM_CLASSES).to(DEVICE)
# Download the model weights from the Hub and load them
print(f"Downloading model from {REPO_ID}...")
model_path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME)
state_dict = load_file(model_path, device=DEVICE)
model.load_state_dict(state_dict)
model.eval()
# --- Run Prediction on an Example Image ---
# Example image from the web
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/9/9d/Carol_Burnett_1958.JPG/250px-Carol_Burnett_1958.JPG"
try:
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
except Exception as e:
print(f"Could not load image from URL: {e}")
exit()
print("Processing image and making prediction...")
with torch.no_grad():
processed_image = processor(images=image, return_tensors="pt")['pixel_values'].to(DEVICE)
logits = model(processed_image)
probabilities = F.softmax(logits, dim=1)
top_prob, top_idx = torch.max(probabilities, 1)
predicted_class = CLASS_NAMES[top_idx.item()]
print(f"\nPredicted Emotion: {predicted_class}")
print(f"Confidence: {top_prob.item() * 100:.2f}%")
Training Procedure
The model was trained using a transfer learning approach. The openai/clip-vit-large-patch14
vision encoder was used as a frozen feature extractor. A single linear layer was added as a classification head and trained on the FER-2013 dataset.
Hyperparameters
- Learning Rate: 1e-3
- Batch Size: 128
- Optimizer: Adam
- Loss Function: Cross-Entropy Loss
- Number of Epochs: 10
Evaluation Results
The model was evaluated on the FER-2013 test set, which contains 7,178 images.
Overall Accuracy: 72%
Classification Report
precision | recall | f1-score | support | |
---|---|---|---|---|
angry | 0.67 | 0.62 | 0.65 | 958 |
disgust | 0.68 | 0.64 | 0.66 | 111 |
fear | 0.54 | 0.51 | 0.53 | 1024 |
happy | 0.89 | 0.93 | 0.91 | 1774 |
neutral | 0.68 | 0.74 | 0.71 | 1233 |
sad | 0.61 | 0.61 | 0.61 | 1247 |
surprise | 0.83 | 0.76 | 0.79 | 831 |
accuracy | 0.72 | 7178 | ||
macro avg | 0.70 | 0.69 | 0.69 | 7178 |
weighted avg | 0.72 | 0.72 | 0.72 | 7178 |
Confusion Matrix
Citation
Citing this Model
@misc{syntheticbot_2024_ferclip,
author = {syntheticbot},
title = {CLIP-based Facial Expression Recognition (FER-2013)},
year = {2024},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{https://huggingface.co/syntheticbot/clip-face-expression}}
}
Citing the Original CLIP Model
@inproceedings{radford2021learning,
title={Learning transferable visual models from natural language supervision},
author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others},
booktitle={International conference on machine learning},
pages={8748--8763},
year={2021},
organization={PMLR}
}
- Downloads last month
- 37
Model tree for syntheticbot/clip-face-expression
Base model
openai/clip-vit-large-patch14