Category Labelling?
Hey, thanks for providing these great models!
I've noticed that the labels put out by the model range from 0 to 20. However, the CAP categories range from 1 to 23, with categories 11 and 22 being nonexistent. I am assuming that you simply indexed the categories from 0 to their length and added the "LABEL_" afterwards. This would make it so that LABEL_0 is CAP category 1, LABEL_10 is CAP 12 (as 11 does not exist), LABEL_20 is CAP 23 (as 22 does not exist either), etc.
However, I cannot find any documentation on this, so I would appreciate it if you could confirm or point me in the direction of the correct code translation scheme. Tbh, it is rather confusing when first encountered and makes for a very disappointing first model evaluation, so maybe a hint on the model page would be helpful?
I realized the "LABEL_" prefix was only added in the id2label config of the model. So if my above assumptions about labeling are correct, this would be the code to turn the label output into actual CAP codes (minus the LABEL_ prefix):
from transformers import AutoModelForSequenceClassification
CAP_NUM_DICT = {0: '1', 1: '2', 2: '3', 3: '4', 4: '5', 5: '6', # translates Model Output to CAP Major Policy Codes
6: '7', 7: '8', 8: '9', 9: '10', 10: '12', 11: '13', 12: '14',
13: '15', 14: '16', 15: '17', 16: '18', 17: '19', 18: '20', 19:
'21', 20: '23',}
# load the model
cap_model = AutoModelForSequenceClassification.from_pretrained("poltextlab/xlm-roberta-large-english-media-cap-v3")
cap_model.config.id2label = CAP_NUM_DICT # replace the labels with the CAP Major Policy Codes
cap_model.config.label2id = {value: key for key, value in CAP_NUM_DICT.items()} # same for the label2id (reversing the dictionary)
Dear Tim König,
We will update the model cards accordingly. Thank you for noting this issue. The translation table from the CAP model results to the CAP codes is the following:
0: '1',
1: '2',
2: '3',
3: '4',
4: '5',
5: '6',
6: '7',
7: '8',
8: '9',
9: '10',
10: '12',
11: '13',
12: '14',
13: '15',
14: '16',
15: '17',
16: '18',
17: '19',
18: '20',
19: '21',
20: '23'
Please let us know if you have any other questions.
Best Regards,
poltextLAB
Thanks for the quick reply! On a related note, I noticed that some of the coded media datasets provided by the CAP use additional, media-specific codes. While it is perfectly reasonable to not include these codes in the model, I would also suggest a quick note here, seeing how these are media-specific models.
Dear Tim König,
Thank you for noting this. We didn't include media labels for this specific model because we wanted the coding system to be as consistent as possible across models. We named this one a 'media' model because it was fine-tuned using only media data from the CAP dataset. We will update the model card to make this clearer.
Best Regards,
poltextLAB