CLIP-based Facial Expression Recognition (FER-2013)

This is a fine-tuned version of openai/clip-vit-large-patch14 for the task of Facial Expression Recognition (FER).

This model was trained on the FER-2013 dataset and can classify a facial image into one of seven emotions: angry, disgust, fear, happy, neutral, sad, and surprise. It was created using a transfer learning approach where the pre-trained CLIP vision encoder was frozen, and a new linear classification head was trained on top of it to recognize the emotion classes.

Model Description

Base Model: openai/clip-vit-large-patch14
Task: Image Classification (Facial Expression Recognition)
Framework: PyTorch
Dataset: FER-2013
Final Accuracy (Test Set): 72%

Intended Uses & Limitations

Intended Uses

This model is intended for academic research and as a baseline for developing more advanced emotion recognition systems. Potential applications include:

Analyzing sentiment in user-submitted media (e.g., product review videos).
Content analysis for social science research on emotion portrayal in images.
A building block for assistive technology applications.

Limitations and Bias

This model inherits the limitations of its training data, the FER-2013 dataset.

Dataset Bias: The FER-2013 dataset is known to have biases in its representation of age, gender, and race. As a result, the model's performance may be inconsistent across different demographic groups. It is not recommended for use in production systems that affect individuals without thorough bias evaluation and mitigation.
Posed vs. Natural Expressions: The dataset primarily contains posed, front-facing, and often exaggerated expressions. The model will likely perform worse on real-world images that feature subtle, natural, or non-frontal expressions.
Ambiguity of Emotion: Emotion is subjective and context-dependent. A static image cannot capture the full story. The model's predictions are based on learned visual patterns from the dataset and should not be considered an objective measure of a person's true emotional state.
Misuse Potential: This model should NOT be used for applications that involve making automated judgments about an individual's character, truthfulness, or employability. It is not suitable for surveillance or any application that could have a significant adverse impact on people's lives.

How to Use

To use this model, you first need to define its custom architecture, then load the saved weights from this repository.

1. Installation

pip install transformers torch safetensors Pillow huggingface_hub requests

2. Prediction Script

This runnable script downloads the model from this repository and predicts the emotion for an example image from the web.

import torch
import torch.nn as nn
from PIL import Image
from transformers import AutoProcessor, CLIPVisionModel
from safetensors.torch import load_file
import torch.nn.functional as F
from huggingface_hub import hf_hub_download
import requests

# --- Configuration ---
# This is the repository ID for the model on the Hugging Face Hub
REPO_ID = "syntheticbot/clip-face-expression" 
FILENAME = "model.safetensors"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# The class names must be in alphabetical order as used during training
CLASS_NAMES = ['angry', 'disgust', 'fear', 'happy', 'neutral', 'sad', 'surprise']
NUM_CLASSES = len(CLASS_NAMES)

# --- Define the Model Architecture ---
# This class must be defined to match the architecture of the saved model
class ClipClassifier(nn.Module):
    def __init__(self, vision_model, num_classes):
        super(ClipClassifier, self).__init__()
        self.vision_model = vision_model
        # The base model's config is needed to get the hidden size for the classifier
        self.classifier = nn.Linear(vision_model.config.hidden_size, num_classes)

    def forward(self, pixel_values):
        outputs = self.vision_model(pixel_values=pixel_values)
        image_features = outputs.pooler_output
        logits = self.classifier(image_features)
        return logits

# --- Load Model and Processor ---
print("Loading model and processor...")
processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14")
vision_model = CLIPVisionModel.from_pretrained("openai/clip-vit-large-patch14").to(DEVICE)

# Instantiate the custom classifier
model = ClipClassifier(vision_model, NUM_CLASSES).to(DEVICE)

# Download the model weights from the Hub and load them
print(f"Downloading model from {REPO_ID}...")
model_path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME)
state_dict = load_file(model_path, device=DEVICE)
model.load_state_dict(state_dict)
model.eval()

# --- Run Prediction on an Example Image ---
# Example image from the web
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/9/9d/Carol_Burnett_1958.JPG/250px-Carol_Burnett_1958.JPG"
try:
    image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
except Exception as e:
    print(f"Could not load image from URL: {e}")
    exit()

print("Processing image and making prediction...")
with torch.no_grad():
    processed_image = processor(images=image, return_tensors="pt")['pixel_values'].to(DEVICE)
    logits = model(processed_image)
    probabilities = F.softmax(logits, dim=1)
    top_prob, top_idx = torch.max(probabilities, 1)
    predicted_class = CLASS_NAMES[top_idx.item()]

print(f"\nPredicted Emotion: {predicted_class}")
print(f"Confidence: {top_prob.item() * 100:.2f}%")

Training Procedure

The model was trained using a transfer learning approach. The openai/clip-vit-large-patch14 vision encoder was used as a frozen feature extractor. A single linear layer was added as a classification head and trained on the FER-2013 dataset.

Hyperparameters

Learning Rate: 1e-3
Batch Size: 128
Optimizer: Adam
Loss Function: Cross-Entropy Loss
Number of Epochs: 10

Evaluation Results

The model was evaluated on the FER-2013 test set, which contains 7,178 images.

Overall Accuracy: 72%

Classification Report

	precision	recall	f1-score	support
angry	0.67	0.62	0.65	958
disgust	0.68	0.64	0.66	111
fear	0.54	0.51	0.53	1024
happy	0.89	0.93	0.91	1774
neutral	0.68	0.74	0.71	1233
sad	0.61	0.61	0.61	1247
surprise	0.83	0.76	0.79	831

accuracy			0.72	7178
macro avg	0.70	0.69	0.69	7178
weighted avg	0.72	0.72	0.72	7178

Confusion Matrix

Citation

Citing this Model

@misc{syntheticbot_2024_ferclip,
  author = {syntheticbot},
  title = {CLIP-based Facial Expression Recognition (FER-2013)},
  year = {2024},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/syntheticbot/clip-face-expression}}
}

Citing the Original CLIP Model

@inproceedings{radford2021learning,
  title={Learning transferable visual models from natural language supervision},
  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others},
  booktitle={International conference on machine learning},
  pages={8748--8763},
  year={2021},
  organization={PMLR}
}

syntheticbot
/

clip-face-expression