PIDIT: Political Ideology Detection in Italian Texts

A Multi-Task BERT + ALBERTO Model for Gender and Ideology Prediction 🇮🇹

This tf.keras model combines two pre-trained encoders — BERT and ALBERTO — to perform multi-task classification on Italian-language texts.
It is designed to predict:

Author gender (binary classification)
Binary ideology (e.g., progressive vs conservative)
Multiclass ideology (4 ideological classes)

✨ Architecture

TFBertModel from bert-base-italian-uncased (frozen)
TFAutoModel from alberto-base-uncased (frozen)
Concatenated outputs + dense layers
Three output heads:
- gender: Dense(1, activation="sigmoid")
- ideology_binary: Dense(1, activation="sigmoid")
- ideology_multiclass: Dense(4, activation="softmax")

📥 Input

The model takes 6 input tensors:

bert_input_ids, bert_token_type_ids, bert_attention_mask
alberto_input_ids, alberto_token_type_ids, alberto_attention_mask

All tensors have shape (batch_size, max_length).

🚀 Usage

Load model and tokenizers

from huggingface_hub import snapshot_download
from transformers import TFBertModel, TFAutoModel
import tensorflow as tf

# Download the model locally
model_path = snapshot_download("leeeov4/PIDIT")

# Load the model
model = tf.keras.models.load_model(model_path, custom_objects={
    "TFBertModel": TFBertModel,
    "TFAutoModel": TFAutoModel
})

# Load the tokenizers

from transformers import AutoTokenizer

bert_tokenizer = AutoTokenizer.from_pretrained("leeeov4/PIDIT/bert_tokenizer")
alberto_tokenizer = AutoTokenizer.from_pretrained("leeeov4/PIDIT/alberto_tokenizer")

Preprocessing Example

def preprocess_text(text, max_length=250):
    bert_tokens = bert_tokenizer(text, max_length=max_length, padding='max_length', truncation=True, return_tensors='tf')
    alberto_tokens = alberto_tokenizer(text, max_length=max_length, padding='max_length', truncation=True, return_tensors='tf')

    return {
        'bert_input_ids': bert_tokens['input_ids'],
        'bert_token_type_ids': bert_tokens['token_type_ids'],
        'bert_attention_mask': bert_tokens['attention_mask'],
        'alberto_input_ids': alberto_tokens['input_ids'],
        'alberto_token_type_ids': alberto_tokens['token_type_ids'],
        'alberto_attention_mask': alberto_tokens['attention_mask']
    }

Inference

text = "Oggi, sabato 31 dicembre, alle ore 9.34, nel Monastero Mater Ecclesiae in Vaticano, il Signore ha chiamato a Sé il Santo Padre Emerito Benedetto XVI."
inputs = preprocess_text(text)
outputs = model.predict(inputs)

gender_prob = outputs[0][0][0]
ideology_binary_prob = outputs[1][0][0]
ideology_multiclass_probs = outputs[2][0]

print("Predicted gender (male probability):", gender_prob)
print("Predicted binary ideology (left probability):", ideology_binary_prob)
print("Multiclass ideology distribution (left, right, moderate left, moderate right):", ideology_multiclass_probs)