nathanlane commited on
Commit
b0ab2f3
·
verified ·
1 Parent(s): b5bc2b2

Upload industrial policy classifier model (hub_ready) with automated model card

Browse files
README.md CHANGED
@@ -1,3 +1,169 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: bert-base-uncased
4
+ tags:
5
+ - text-classification
6
+ - industrial-policy
7
+ - economics
8
+ - policy-analysis
9
+ - bert
10
+ - government-policy
11
+ - trade-policy
12
+ language:
13
+ - en
14
+ pipeline_tag: text-classification
15
+ widget:
16
+ - text: "Government provides subsidies to promote renewable energy development"
17
+ example_title: "IP goal Example"
18
+ - text: "Company announces quarterly earnings report"
19
+ example_title: "No IP goal Example"
20
+ - text: "The document mentions policy changes"
21
+ example_title: "Not enough information Example"
22
+ metrics:
23
+ - accuracy
24
+ - f1
25
+ - precision
26
+ - recall
27
+ library_name: transformers
28
+ ---
29
+
30
+ # Industrial Policy Classification Model
31
+
32
+ This model classifies text documents to determine whether they describe industrial policy goals. It was fine-tuned from bert-base-uncased on a dataset of policy documents and measures.
33
+
34
+ ## Model Description
35
+
36
+ This is a BERT-based text classification model trained to identify industrial policy intentions in text. The model can classify text into 3 categories:
37
+
38
+ - **IP goal** (0): Text describes an industrial policy objective or intervention
39
+ - **No IP goal** (1): Text does not describe an industrial policy objective
40
+ - **Not enough information** (2): Insufficient information to determine policy intent
41
+
42
+
43
+ ## Intended Use
44
+
45
+ This model is designed for research purposes to analyze policy documents, government measures, and related texts to identify industrial policy intentions. It can be used by:
46
+
47
+ - Economics researchers studying industrial policy
48
+ - Policy analysts examining government interventions
49
+ - Data scientists working with policy text classification
50
+ - Government agencies analyzing policy effectiveness
51
+
52
+ ## Model Performance
53
+
54
+ - **Accuracy**: 0.941
55
+ - **F1 Score**: 0.941
56
+ - **Precision**: 0.941
57
+ - **Recall**: 0.941
58
+ - **Test Loss**: 0.2886
59
+
60
+ *Metrics evaluated on held-out test set*
61
+
62
+ ## Training Data
63
+
64
+ The model was trained on annotated policy documents including:
65
+ - Expert-annotated policy measures from multiple countries
66
+ - Government trade and industrial policy documents
67
+ - WTO and multilateral organization policy entries
68
+ - Economic policy text spanning different sectors and time periods
69
+
70
+ The training dataset includes documents from various income-level countries to ensure robust performance across different economic contexts.
71
+
72
+ ## Training Procedure
73
+
74
+ ### Model Architecture
75
+ - **Base model**: bert-base-uncased
76
+ - **Architecture**: BertForSequenceClassification
77
+ - **Number of labels**: 3
78
+ - **Fine-tuning approach**: Full model fine-tuning with classification head
79
+
80
+ ### Training Configuration
81
+ - **Optimization**: Hyperparameter tuning using Optuna for optimal performance
82
+ - **Data balancing**: Oversampling applied to handle class imbalance
83
+ - **Validation strategy**: Stratified splits with income-based validation
84
+ - **Cross-validation**: Income group validation to test generalization
85
+
86
+ ## Usage
87
+
88
+ ```python
89
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
90
+
91
+ # Load model and tokenizer
92
+ model_name = "industrialpolicygroup/industrialpolicy-classifier"
93
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
94
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
95
+
96
+ # Create classification pipeline
97
+ classifier = pipeline("text-classification",
98
+ model=model,
99
+ tokenizer=tokenizer)
100
+
101
+ # Example usage
102
+ text = "Government provides subsidies to promote renewable energy development"
103
+ result = classifier(text)
104
+ print(result)
105
+
106
+ # Expected output format:
107
+ # [{'label': 'LABEL_0', 'score': 0.95}]
108
+ #
109
+ # Label mappings:
110
+
111
+ ## Limitations and Bias
112
+
113
+ - The model is trained primarily on English text
114
+ - Performance may vary on policy domains not well-represented in training data
115
+ - The model reflects the annotation guidelines and may not capture all nuances of industrial policy
116
+ - Bias towards certain types of policy language present in training data
117
+ - May require domain adaptation for highly specialized policy areas
118
+
119
+ ## Evaluation and Validation
120
+
121
+ The model underwent rigorous evaluation including:
122
+ - Standard train/validation/test splits
123
+ - Income-based validation across country groups
124
+ - Cross-domain evaluation on different policy types
125
+ - Comparison with traditional machine learning baselines
126
+
127
+ ## Ethical Considerations
128
+
129
+ This model is intended for research and analysis purposes. Users should be aware that:
130
+ - Policy classification can have implications for economic research and policy recommendations
131
+ - The model's outputs should be interpreted by domain experts
132
+ - Results should be validated against human expert judgment for critical applications
133
+
134
+ ## Citation
135
+
136
+ If you use this model in your research, please cite:
137
+
138
+ ```bibtex
139
+ @article{industrialpolicy2025,
140
+ title={Measuring Industrial Policy Using Natural Language Processing},
141
+ author={Lane, Nathaniel and [Additional Authors]},
142
+ journal={[Journal Name]},
143
+ year={2025}
144
+ }
145
+ ```
146
+
147
+ ## Model Details
148
+
149
+ - **Developed by**: Industrial Policy Group
150
+ - **Model type**: Text Classification (BERT-based)
151
+ - **Language**: English
152
+ - **License**: Apache 2.0
153
+ - **Fine-tuned from**: bert-base-uncased
154
+
155
+ ## Technical Specifications
156
+
157
+ - **Input**: Text (up to 512 tokens)
158
+ - **Output**: Classification probabilities for 3 classes
159
+ - **Framework**: PyTorch + Transformers
160
+ - **Model size**: ~110M parameters
161
+
162
+ ## Contact
163
+
164
+ For questions about this model or the research, please contact the Industrial Policy Group.
165
+
166
+ ---
167
+
168
+ *Model card auto-generated on 2025-06-19 11:24:30 from model files*
169
+ *Source model: bert-base-uncased-3_classes-finetuned_hub_ready_20250617_151525*
config.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertForSequenceClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "gradient_checkpointing": false,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "id2label": {
12
+ "0": "LABEL_0",
13
+ "1": "LABEL_1",
14
+ "2": "LABEL_2"
15
+ },
16
+ "initializer_range": 0.02,
17
+ "intermediate_size": 3072,
18
+ "label2id": {
19
+ "LABEL_0": 0,
20
+ "LABEL_1": 1,
21
+ "LABEL_2": 2
22
+ },
23
+ "layer_norm_eps": 1e-12,
24
+ "max_position_embeddings": 512,
25
+ "model_type": "bert",
26
+ "num_attention_heads": 12,
27
+ "num_hidden_layers": 12,
28
+ "pad_token_id": 0,
29
+ "position_embedding_type": "absolute",
30
+ "problem_type": "single_label_classification",
31
+ "torch_dtype": "float32",
32
+ "transformers_version": "4.52.4",
33
+ "type_vocab_size": 2,
34
+ "use_cache": true,
35
+ "vocab_size": 30522
36
+ }
final_test_set_result.csv ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ test_loss,test_model_preparation_time,test_accuracy,test_f1,test_precision,test_recall,test_runtime,test_samples_per_second,test_steps_per_second
2
+ 0.2885808050632477,0.0013,0.9409090909090909,0.9409010797631487,0.9409677026089238,0.9409090909090909,16.0815,27.361,3.42
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:849c193a9fa5dc6ecb572b0dcb3ca516b5ff05bafc271a3a9e801735873a818c
3
+ size 437961724
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "extra_special_tokens": {},
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 512,
50
+ "pad_token": "[PAD]",
51
+ "sep_token": "[SEP]",
52
+ "strip_accents": null,
53
+ "tokenize_chinese_chars": true,
54
+ "tokenizer_class": "BertTokenizer",
55
+ "unk_token": "[UNK]"
56
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:db96411a66f890b4c49f005ff274c91c418a46159a861b66b101e7efbc3d310f
3
+ size 4088
vocab.txt ADDED
The diff for this file is too large to render. See raw diff