alea-institute commited on
Commit
9b2d083
·
verified ·
1 Parent(s): 1e5de17

Update README and config files - README.md

Browse files
Files changed (1) hide show
  1. README.md +184 -50
README.md CHANGED
@@ -48,61 +48,130 @@ The model demonstrates strong knowledge of contract-specific terminology and com
48
 
49
  ## Usage
50
 
51
- You can use this model for masked language modeling with the following code:
52
 
53
  ```python
54
- from transformers import AutoModelForMaskedLM, AutoTokenizer
55
- import torch
56
 
57
- # Load model and tokenizer
58
- tokenizer = AutoTokenizer.from_pretrained("alea-institute/kl3m-doc-pico-contracts-001")
59
- model = AutoModelForMaskedLM.from_pretrained("alea-institute/kl3m-doc-pico-contracts-001")
60
 
61
- # Example 1: Credit agreement
62
- text = "<|cls|> This<|mask|> Credit Agreement is hereby <|sep|>"
63
- inputs = tokenizer(text, return_tensors="pt")
64
- outputs = model(**inputs)
65
 
66
- # Get top predictions for masked token
67
- masked_index = torch.where(inputs.input_ids[0] == tokenizer.mask_token_id)[0].item()
68
- probs = outputs.logits[0, masked_index].softmax(dim=0)
69
- top_5 = torch.topk(probs, 5)
70
 
71
- print("Example 1 - Top 5 predictions:")
72
- for i, (score, idx) in enumerate(zip(top_5.values, top_5.indices)):
73
- token = tokenizer.decode(idx).strip()
74
- print(f"{i+1}. {token} ({score.item():.3f})")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
  # Output:
77
- # Example 1 - Top 5 predictions:
78
- # 1. Revolving (0.158)
79
- # 2. Credit (0.030)
80
- # 3. Loan (0.029)
81
- # 4. Term (0.016)
82
- # 5. New (0.016)
83
-
84
- # Example 2: Confidentiality agreement
85
- text = "<|cls|> The<|mask|> Agreement contains the entire understanding between <|sep|>"
86
- inputs = tokenizer(text, return_tensors="pt")
87
- outputs = model(**inputs)
88
-
89
- # Get top predictions for masked token
90
- masked_index = torch.where(inputs.input_ids[0] == tokenizer.mask_token_id)[0].item()
91
- probs = outputs.logits[0, masked_index].softmax(dim=0)
92
- top_5 = torch.topk(probs, 5)
93
-
94
- print("\nExample 2 - Top 5 predictions:")
95
- for i, (score, idx) in enumerate(zip(top_5.values, top_5.indices)):
96
- token = tokenizer.decode(idx).strip()
97
- print(f"{i+1}. {token} ({score.item():.3f})")
98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
  # Output:
100
- # Example 2 - Top 5 predictions:
101
- # 1. Confidentiality (0.207)
102
- # 2. Purchase (0.109)
103
- # 3. Subscription (0.039)
104
- # 4. Credit (0.037)
105
- # 5. Entire (0.026)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
  ```
107
 
108
  ## Training
@@ -113,17 +182,21 @@ It leverages the KL3M tokenizer which provides 9-17% more efficient tokenization
113
 
114
  ## Special Tokens
115
 
116
- This model uses custom special tokens which must be used explicitly:
117
 
118
- - CLS token: `<|cls|>` (ID: 5) - Should be added at the beginning of input text
119
  - MASK token: `<|mask|>` (ID: 6) - Used to mark tokens for prediction
120
- - SEP token: `<|sep|>` (ID: 4) - Should be added at the end of input text
121
  - PAD token: `<|pad|>` (ID: 2) - Used for padding sequences to a uniform length
122
  - BOS token: `<|start|>` (ID: 0) - Beginning of sequence
123
  - EOS token: `<|end|>` (ID: 1) - End of sequence
124
  - UNK token: `<|unk|>` (ID: 3) - Unknown token
125
 
126
- For best results, you should **explicitly add** the CLS and SEP tokens to your input text, as shown in the examples. While the tokenizer can add these automatically in some cases, explicitly adding them ensures proper context for the model's predictions.
 
 
 
 
127
 
128
  ## Contract-Specific Capabilities
129
 
@@ -135,6 +208,56 @@ The model shows particularly strong performance in identifying contract-specific
135
 
136
  This demonstrates the model's specialized knowledge of contract structures and terminology following the additional training steps on contract data.
137
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
  ## Limitations
139
 
140
  While compact and specialized for contracts, this model has some limitations:
@@ -148,7 +271,7 @@ While compact and specialized for contracts, this model has some limitations:
148
  ## References
149
 
150
  - [KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications](https://arxiv.org/abs/2503.17247)
151
- - [The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models]() (Forthcoming)
152
 
153
  ## Citation
154
 
@@ -165,6 +288,17 @@ If you use this model in your research, please cite:
165
  }
166
  ```
167
 
 
 
 
 
 
 
 
 
 
 
 
168
  ## License
169
 
170
  This model is licensed under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/).
 
48
 
49
  ## Usage
50
 
51
+ You can use this model for masked language modeling with the simple pipeline API:
52
 
53
  ```python
54
+ from transformers import pipeline
 
55
 
56
+ # Load the fill-mask pipeline with the model
57
+ fill_mask = pipeline('fill-mask', model="alea-institute/kl3m-doc-pico-contracts-001")
 
58
 
59
+ # Example: Contract clause heading
60
+ # Note the mask token placement - directly adjacent to "AND" without space
61
+ text = "<|cls|> 8. REPRESENTATIONS AND<|mask|>. Each party hereby represents and warrants to the other party as of the date hereof as follows: <|sep|>"
62
+ results = fill_mask(text)
63
 
64
+ # Display predictions
65
+ print("Top predictions:")
66
+ for i, result in enumerate(results[:5]):
67
+ print(f"{i+1}. {result['token_str']} (score: {result['score']:.4f})")
68
 
69
+ # Output:
70
+ # Top predictions:
71
+ # 1. WARRANTIES (score: 0.7282)
72
+ # 2. Warranties (score: 0.1674)
73
+ # 3. REPRESENTATIONS (score: 0.0377)
74
+ # 4. warrants (score: 0.0127)
75
+ # 5. WARR (score: 0.0074)
76
+ ```
77
+
78
+ You can try additional examples:
79
+
80
+ ```python
81
+ # Example: Defined term
82
+ text2 = "<|cls|> \"Effective<|mask|>\" means the date on which all conditions precedent set forth in Article V are satisfied or waived by the Administrative Agent. <|sep|>"
83
+ results2 = fill_mask(text2)
84
+
85
+ # Display predictions
86
+ print("\nDefined Term Example - Top 5 predictions:")
87
+ for i, result in enumerate(results2[:5]):
88
+ print(f"{i+1}. {result['token_str']} (score: {result['score']:.4f})")
89
+
90
+ # Output:
91
+ # Defined Term Example - Top 5 predictions:
92
+ # 1. Date (score: 0.9735)
93
+ # 2. Time (score: 0.0075)
94
+ # 3. date (score: 0.0053)
95
+ # 4. Dates (score: 0.0024)
96
+ # 5. Agreement (score: 0.0011)
97
+
98
+ # Example: Regulatory context
99
+ text3 = "<|cls|> All transactions shall comply with the requirements set forth in the Truth in<|mask|> Act and its implementing Regulation Z. <|sep|>"
100
+ results3 = fill_mask(text3)
101
+
102
+ # Display predictions
103
+ print("\nRegulatory Example - Top 5 predictions:")
104
+ for i, result in enumerate(results3[:5]):
105
+ print(f"{i+1}. {result['token_str']} (score: {result['score']:.4f})")
106
 
107
  # Output:
108
+ # Regulatory Example - Top 5 predictions:
109
+ # 1. Lending (score: 0.2906)
110
+ # 2. Control (score: 0.1233)
111
+ # 3. ment (score: 0.0519)
112
+ # 4. Credit (score: 0.0454)
113
+ # 5. Disabilities (score: 0.0451)
114
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
 
116
+ Note how this model strongly outperforms the more general model on contract terminology, with much higher confidence in its predictions for terms like "WARRANTIES" and "Date".
117
+
118
+ For feature extraction and document similarity analysis, the pipeline API is also recommended:
119
+
120
+ ```python
121
+ from transformers import pipeline
122
+ import numpy as np
123
+ from sklearn.metrics.pairwise import cosine_similarity
124
+
125
+ # Load the feature-extraction pipeline
126
+ extractor = pipeline('feature-extraction', model="alea-institute/kl3m-doc-pico-contracts-001", return_tensors=True)
127
+
128
+ # Example legal documents
129
+ texts = [
130
+ # Court Complaint
131
+ "<|cls|> IN THE UNITED STATES DISTRICT COURT FOR THE EASTERN DISTRICT OF PENNSYLVANIA\n\nJOHN DOE,\nPlaintiff,\n\nvs.\n\nACME CORPORATION,\nDefendant.\n\nCIVIL ACTION NO. 21-12345\n\nCOMPLAINT\n\nPlaintiff John Doe, by and through his undersigned counsel, hereby files this Complaint against Defendant Acme Corporation, and in support thereof, alleges as follows: <|sep|>",
132
+
133
+ # Consumer Terms
134
+ "<|cls|> TERMS AND CONDITIONS\n\nLast Updated: April 10, 2025\n\nThese Terms and Conditions (\"Terms\") govern your access to and use of the Service. By accessing or using the Service, you agree to be bound by these Terms. If you do not agree to these Terms, you may not access or use the Service. These Terms constitute a legally binding agreement between you and the Company. <|sep|>",
135
+
136
+ # Credit Agreement
137
+ "<|cls|> CREDIT AGREEMENT\n\nDated as of April 10, 2025\n\nAmong\n\nACME BORROWER INC.,\nas the Borrower,\n\nBANK OF FINANCE,\nas Administrative Agent,\n\nand\n\nTHE LENDERS PARTY HERETO\n\nThis CREDIT AGREEMENT (\"Agreement\") is entered into as of April 10, 2025, among ACME BORROWER INC., a Delaware corporation (the \"Borrower\"), each lender from time to time party hereto (collectively, the \"Lenders\"), and BANK OF FINANCE, as Administrative Agent. <|sep|>"
138
+ ]
139
+
140
+ # Strategy 1: CLS token embeddings
141
+ cls_embeddings = []
142
+ for text in texts:
143
+ features = extractor(text)
144
+ # Get the CLS token (first token) embedding
145
+ features_array = features[0].numpy() if hasattr(features[0], 'numpy') else features[0]
146
+ cls_embedding = features_array[0]
147
+ cls_embeddings.append(cls_embedding)
148
+
149
+ # Calculate similarity between documents using CLS tokens
150
+ cls_similarity = cosine_similarity(np.vstack(cls_embeddings))
151
+ print("\nDocument similarity (CLS token):")
152
+ print(np.round(cls_similarity, 3))
153
  # Output:
154
+ # [[1. 0.660 0.626]
155
+ # [0.660 1. 0.798]
156
+ # [0.626 0.798 1. ]]
157
+
158
+ # Strategy 2: Mean pooling
159
+ mean_embeddings = []
160
+ for text in texts:
161
+ features = extractor(text)
162
+ # Average over all tokens
163
+ features_array = features[0].numpy() if hasattr(features[0], 'numpy') else features[0]
164
+ mean_embedding = np.mean(features_array, axis=0)
165
+ mean_embeddings.append(mean_embedding)
166
+
167
+ # Calculate similarity using mean pooling
168
+ mean_similarity = cosine_similarity(np.vstack(mean_embeddings))
169
+ print("\nDocument similarity (Mean pooling):")
170
+ print(np.round(mean_similarity, 3))
171
+ # Output:
172
+ # [[1. 0.733 0.825]
173
+ # [0.733 1. 0.678]
174
+ # [0.825 0.678 1. ]]
175
  ```
176
 
177
  ## Training
 
182
 
183
  ## Special Tokens
184
 
185
+ This model includes the following special tokens:
186
 
187
+ - CLS token: `<|cls|>` (ID: 5) - Used for the beginning of input text
188
  - MASK token: `<|mask|>` (ID: 6) - Used to mark tokens for prediction
189
+ - SEP token: `<|sep|>` (ID: 4) - Used for the end of input text
190
  - PAD token: `<|pad|>` (ID: 2) - Used for padding sequences to a uniform length
191
  - BOS token: `<|start|>` (ID: 0) - Beginning of sequence
192
  - EOS token: `<|end|>` (ID: 1) - End of sequence
193
  - UNK token: `<|unk|>` (ID: 3) - Unknown token
194
 
195
+ Important usage notes:
196
+
197
+ When using the MASK token for predictions, be aware that this model uses a **space-prefixed BPE tokenizer**. The <|mask|> token should be placed IMMEDIATELY after the previous token with NO space, because most tokens in this tokenizer have an initial space encoded within them. For example: `"word<|mask|>"` rather than `"word <|mask|>"`.
198
+
199
+ This space-aware placement is crucial for getting accurate predictions, as demonstrated in our test examples.
200
 
201
  ## Contract-Specific Capabilities
202
 
 
208
 
209
  This demonstrates the model's specialized knowledge of contract structures and terminology following the additional training steps on contract data.
210
 
211
+ ## Standard Test Examples
212
+
213
+ Using our standardized test examples for comparing embedding models:
214
+
215
+ ### Fill-Mask Results
216
+
217
+ 1. **Contract Clause Heading**:
218
+ `"<|cls|> 8. REPRESENTATIONS AND<|mask|>. Each party hereby represents and warrants to the other party as of the date hereof as follows: <|sep|>"`
219
+
220
+ Top 5 predictions:
221
+ 1. WARRANTIES (0.7282)
222
+ 2. Warranties (0.1674)
223
+ 3. REPRESENTATIONS (0.0377)
224
+ 4. warrants (0.0127)
225
+ 5. WARR (0.0074)
226
+
227
+ Note: This model shows extremely strong performance on contract-specific language compared to the more general kl3m-doc-pico-001 model, correctly predicting "WARRANTIES" with high confidence.
228
+
229
+ 2. **Defined Term Example**:
230
+ `"<|cls|> \"Effective<|mask|>\" means the date on which all conditions precedent set forth in Article V are satisfied or waived by the Administrative Agent. <|sep|>"`
231
+
232
+ Top 5 predictions:
233
+ 1. Date (0.9735)
234
+ 2. Time (0.0075)
235
+ 3. date (0.0053)
236
+ 4. Dates (0.0024)
237
+ 5. Agreement (0.0011)
238
+
239
+ 3. **Regulation Example**:
240
+ `"<|cls|> All transactions shall comply with the requirements set forth in the Truth in<|mask|> Act and its implementing Regulation Z. <|sep|>"`
241
+
242
+ Top 5 predictions:
243
+ 1. Lending (0.2906)
244
+ 2. Control (0.1233)
245
+ 3. ment (0.0519)
246
+ 4. Credit (0.0454)
247
+ 5. Disabilities (0.0451)
248
+
249
+ ### Document Similarity Results
250
+
251
+ Using the standardized document examples for embeddings:
252
+
253
+ | Document Pair | Cosine Similarity (CLS token) | Cosine Similarity (Mean pooling) |
254
+ |---------------|-------------------------------|----------------------------------|
255
+ | Court Complaint vs. Consumer Terms | 0.660 | 0.733 |
256
+ | Court Complaint vs. Credit Agreement | 0.626 | 0.825 |
257
+ | Consumer Terms vs. Credit Agreement | 0.798 | 0.678 |
258
+
259
+ The contract-specialized model shows balanced similarity measurements that effectively capture document relationships. With CLS token embeddings, it identifies greatest similarity between Consumer Terms and Credit Agreement (0.798), while with mean pooling it shows highest similarity between Court Complaint and Credit Agreement (0.825).
260
+
261
  ## Limitations
262
 
263
  While compact and specialized for contracts, this model has some limitations:
 
271
  ## References
272
 
273
  - [KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications](https://arxiv.org/abs/2503.17247)
274
+ - [The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models](https://arxiv.org/abs/2504.07854)
275
 
276
  ## Citation
277
 
 
288
  }
289
  ```
290
 
291
+ ```bibtex
292
+ @misc{bommarito2025kl3mdata,
293
+ title={The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models},
294
+ author={Bommarito II, Michael J. and Bommarito, Jillian and Katz, Daniel Martin},
295
+ year={2025},
296
+ eprint={2504.07854},
297
+ archivePrefix={arXiv},
298
+ primaryClass={cs.CL}
299
+ }
300
+ ```
301
+
302
  ## License
303
 
304
  This model is licensed under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/).