Boilerplate Detection for Financial Text
This model identifies boilerplate (formulaic, repetitive) language in financial analyst reports and distinguishes it from substantive business content.
Model Description
The model uses a frozen sentence transformer (all-mpnet-base-v2) combined with a lightweight classification head to identify boilerplate text segments. Training data consisted of analyst reports from 2000-2020, where boilerplate examples were identified as frequently repeated segments across reports from the same brokerage house. To construct the training dataset, we sampled reports to find the most frequently repeated segments. For a segment to be classified as a positive example, it must be among the top 10% most frequently repeated segments and appear at least five times by the same broker within the same year. Negative examples were identified by randomly selecting segments with no repetition in each broker-year sample.
The architecture combines mean-pooled embeddings from the sentence transformer with a simple 3-layer neural network (768 โ 16 โ 8 โ 2) for classification.
Usage
Since this model uses a custom architecture, you need to use the direct loading approach rather than the pipeline interface:
import sys
import huggingface_hub
from transformers import AutoTokenizer
import torch
# Load model components
model_path = huggingface_hub.snapshot_download('maifeng/boilerplate_detection')
sys.path.insert(0, model_path)
from modeling_boilerplate import BoilerplateDetector, BoilerplateConfig
# Initialize model
config = BoilerplateConfig.from_pretrained('maifeng/boilerplate_detection')
model = BoilerplateDetector.from_pretrained('maifeng/boilerplate_detection')
tokenizer = AutoTokenizer.from_pretrained('maifeng/boilerplate_detection')
# Move model to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
model.eval()
# Classify texts
texts = [
"The securities and related financial instruments described herein may not be eligible for sale in all jurisdictions or to certain categories of investors. This material is not intended as an offer or solicitation for the purchase or sale of any security or other financial instrument.",
"Morgan Stanley & Co. LLC and its affiliates disclaim any and all liability relating to these materials, including, without limitation, any express or implied representations or warranties for statements or errors contained in, or omissions from, these materials.",
"And while we acknowledge the company has made significant progress on the cost side, Harman will have to consistently execute on those cost cutting initiatives for the next several quarters to help prop-up its low-price and low-margin customized business.",
"Microsoft's Azure cloud revenue grew 29% year-over-year in constant currency, with particular strength in AI services where usage increased 180% quarter-over-quarter. The company signed 15 new enterprise AI contracts worth over $100 million each during the quarter."
]
# Classification threshold (default 0.5, can be adjusted based on precision/recall requirements)
threshold = 0.5
results = []
for text in texts:
inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
inputs = {k: v.to(device) for k, v in inputs.items()} # Move inputs to device
with torch.no_grad():
outputs = model(**inputs)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)[0]
boilerplate_prob = probs[1].item()
label = 'BOILERPLATE' if boilerplate_prob > threshold else 'NOT_BOILERPLATE'
results.append({'text': text, 'label': label, 'boilerplate_probability': boilerplate_prob})
for result in results:
print(f"{result['label']:>15}: {result['boilerplate_probability']:.3f} - {result['text'][:80]}...")
Citation
If you find the model useful, please cite:
@article{li2025dissecting,
title={Dissecting Corporate Culture Using Generative AI},
author={Li, Kai and Mai, Feng and Shen, Rui and Yang, Chelsea and Zhang, Tengfei},
journal={Review of Financial Studies},
year={2025}
}
- Downloads last month
- 1