Update README.md

2cc47b0 verified 3 months ago

8 kB

	---
	license: apache-2.0
	base_model: bert-base-uncased
	tags:
	- text-classification
	- industrial-policy
	- economics
	- policy-analysis
	- bert
	- government-policy
	- trade-policy
	language:
	- en
	pipeline_tag: text-classification
	widget:
	- text: "Government provides subsidies to promote renewable energy development"
	example_title: "IP goal Example"
	- text: "Company announces quarterly earnings report"
	example_title: "No IP goal Example"
	- text: "The document mentions policy changes"
	example_title: "Not enough information Example"
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	library_name: transformers
	---

	# Industrial Policy Classification Model v1.0

	This model classifies text documents to determine whether they describe industrial policy goals. It was fine-tuned from bert-base-uncased on a dataset of policy documents and measures.

	Accompanies the paper:

	Juhász, Réka, Lane, Nathan J., Oehlsen, Emily, and Perez, Veronica C. (2025). Measuring Industrial Policy: A Text-Based Approach. National Bureau of Economic Research. Available at: https://www.nber.org/papers/w33895

	The output data is available at: industrialpolicydata.com

	## Model Description

	This is a BERT-based text classification model trained to identify industrial policy intentions in text. The model can classify text into 3 categories:

	- IP goal (0): Text describes an industrial policy objective or intervention
	- No IP goal (1): Text does not describe an industrial policy objective
	- Not enough information (2): Insufficient information to determine policy intent


	The model was trained on expert-annotated policy documents. The input data for this project was provided in 2023 by the Global Trade Alerts project. See the Global Trade Alert (2025) data Available at: https://www.globaltradealert.org/

	## Intended Use

	This model is designed for research purposes to analyze policy documents, government measures, and related texts to identify industrial policy intentions. It can be used by:

	- Economics researchers studying industrial policy
	- Policy analysts examining government interventions
	- Data scientists working with policy text classification
	- Government agencies analyzing policy effectiveness

	## Model Performance

	- Accuracy: 0.941
	- F1 Score: 0.941
	- Precision: 0.941
	- Recall: 0.941
	- Test Loss: 0.2886

	Metrics evaluated on held-out test set

	## Training Data

	The model was trained on expert-annotated policy documents. The input data for this project was provided by the Global Trade Alerts project.

	## Training Procedure

	### Model Architecture
	- Base model: bert-base-uncased
	- Architecture: BertForSequenceClassification
	- Number of labels: 3
	- Fine-tuning approach: Full model fine-tuning with classification head

	### Training Configuration
	- Optimization: Hyperparameter tuning using Optuna for optimal performance
	- Data balancing: Oversampling applied to handle class imbalance
	- Validation strategy: Stratified splits with income-based validation
	- Cross-validation: Income group validation to test generalization

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

	# Load model and tokenizer
	model_name = "industrialpolicygroup/industrialpolicy-classifier"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# Create classification pipeline
	classifier = pipeline("text-classification",
	model=model,
	tokenizer=tokenizer)

	# Example usage
	text = "Government provides subsidies to promote renewable energy development"
	result = classifier(text)
	print(result)

	# Expected output format:
	# [{'label': 'LABEL_0', 'score': 0.95}]
	#
	# Label mappings:

	```

	## Limitations and Bias

	- The model is trained primarily on English text from the Global Trade Alerts project
	- Performance may vary on policy domains not well-represented in training data
	- The model reflects the annotation guidelines and may not capture all nuances of industrial policy
	- Bias towards certain types of policy language present in training data
	- May require domain adaptation for highly specialized policy areas

	## Evaluation and Validation

	The model underwent rigorous evaluation including:
	- Standard train/validation/test splits
	- Income-based validation across country groups
	- Cross-domain evaluation on different policy types
	- Comparison with traditional machine learning baselines

	## Ethical Considerations

	This model is intended for research and analysis purposes. Users should be aware that:
	- Policy classification can have implications for economic research and policy recommendations
	- The model's outputs should be interpreted by domain experts
	- Results should be validated against human expert judgment for critical applications

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@techreport{NBERw33895,
	title = "Measuring Industrial Policy: A Text-Based Approach",
	author = "Juhász, Réka and Lane, Nathan J and Oehlsen, Emily and Perez, Veronica C",
	institution = "National Bureau of Economic Research",
	type = "Working Paper",
	series = "Working Paper Series",
	number = "33895",
	year = "2025",
	month = "June",
	doi = {10.3386/w33895},
	URL = "http://www.nber.org/papers/w33895",
	abstract = {Since the 18th century, policymakers have debated the merits of industrial policy (IP). Yet, economists lack basic facts about its use due to measurement challenges. We propose a new approach to IP measurement based on information contained in policy text. We show how off-the-shelf supervised machine learning tools can be used to categorize industrial policies at scale. Using this approach, we validate longstanding concerns with earlier approaches to measurement which conflate IP with other types of policy. We apply our methodology to a global database of commercial policy descriptions, and provide a first look at IP use at the country, industry, and year levels (2010-2022). The new data on IP suggest that i) IP is on the rise; ii) modern IP tends to use subsidies and export promotion measures as opposed to tariffs; iii) rich countries heavily dominate IP use; iv) IP tends to target sectors with an established comparative advantage, particularly in high-income countries.},
	}
	```

	## Model Details

	- Developed by: Industrial Policy Group
	- Model type: Text Classification (BERT-based)
	- Language: English
	- License: Apache 2.0
	- Fine-tuned from: bert-base-uncased

	## Technical Specifications

	### Architecture Details
	- Model Type: BERT
	- Architecture Class: BertForSequenceClassification
	- Transformers Version: 4.52.4

	### Model Dimensions
	- Vocabulary Size: 30,522
	- Hidden Size: 768
	- Number of Attention Heads: 12
	- Number of Hidden Layers: 12
	- Intermediate Size: 3,072
	- Max Position Embeddings: 512

	### Training Configuration
	- Hidden Dropout Probability: 0.1
	- Attention Dropout Probability: 0.1
	- Layer Norm Epsilon: 1e-12
	- Initializer Range: 0.02

	### Classification Configuration
	- Number of Labels: Unknown
	- Problem Type: single_label_classification
	- Padding Token ID: 0
	- Position Embedding Type: absolute
	- Torch Dtype: float32
	- Use Cache: True

	### Model Size and Requirements
	- Model Size: ~109M parameters (~418MB on disk)
	- Input: Text (up to 512 tokens)
	- Output: Classification probabilities for 3 classes
	- Framework: PyTorch + Transformers
	- Precision: float32

	## Citations for source data

	Global Trade Alert (2025). Global Trade Alert Database. Available at: https://www.globaltradealert.org/


	## Contact

	For questions about this model or the research, please contact the Industrial Policy Group.

	---

	Model card auto-generated on 2025-06-19 14:07:03 from model files
	Source model: bert-base-uncased-3_classes-finetuned_hub_ready_20250617_151525