alea-institute
/

kl3m-doc-pico-contracts-001

@@ -48,61 +48,130 @@ The model demonstrates strong knowledge of contract-specific terminology and com
 ## Usage
-You can use this model for masked language modeling with the following code:
 ```python
-from transformers import AutoModelForMaskedLM, AutoTokenizer
-import torch
-# Load model and tokenizer
-tokenizer = AutoTokenizer.from_pretrained("alea-institute/kl3m-doc-pico-contracts-001")
-model = AutoModelForMaskedLM.from_pretrained("alea-institute/kl3m-doc-pico-contracts-001")
-# Example 1: Credit agreement
-text = "<|cls|> This<|mask|> Credit Agreement is hereby <|sep|>"
-inputs = tokenizer(text, return_tensors="pt")
-outputs = model(**inputs)
-# Get top predictions for masked token
-masked_index = torch.where(inputs.input_ids[0] == tokenizer.mask_token_id)[0].item()
-probs = outputs.logits[0, masked_index].softmax(dim=0)
-top_5 = torch.topk(probs, 5)
-print("Example 1 - Top 5 predictions:")
-for i, (score, idx) in enumerate(zip(top_5.values, top_5.indices)):
-    token = tokenizer.decode(idx).strip()
-    print(f"{i+1}. {token} ({score.item():.3f})")
 # Output:
-# Example 1 - Top 5 predictions:
-# 1. Revolving (0.158)
-# 2. Credit (0.030)
-# 3. Loan (0.029)
-# 4. Term (0.016)
-# 5. New (0.016)
-# Example 2: Confidentiality agreement
-text = "<|cls|> The<|mask|> Agreement contains the entire understanding between <|sep|>"
-inputs = tokenizer(text, return_tensors="pt")
-outputs = model(**inputs)
-# Get top predictions for masked token
-masked_index = torch.where(inputs.input_ids[0] == tokenizer.mask_token_id)[0].item()
-probs = outputs.logits[0, masked_index].softmax(dim=0)
-top_5 = torch.topk(probs, 5)
-print("\nExample 2 - Top 5 predictions:")
-for i, (score, idx) in enumerate(zip(top_5.values, top_5.indices)):
-    token = tokenizer.decode(idx).strip()
-    print(f"{i+1}. {token} ({score.item():.3f})")
 # Output:
-# Example 2 - Top 5 predictions:
-# 1. Confidentiality (0.207)
-# 2. Purchase (0.109)
-# 3. Subscription (0.039)
-# 4. Credit (0.037)
-# 5. Entire (0.026)
 ```
 ## Training
@@ -113,17 +182,21 @@ It leverages the KL3M tokenizer which provides 9-17% more efficient tokenization
 ## Special Tokens
-This model uses custom special tokens which must be used explicitly:
-- CLS token: `<|cls|>` (ID: 5) - Should be added at the beginning of input text
 - MASK token: `<|mask|>` (ID: 6) - Used to mark tokens for prediction
-- SEP token: `<|sep|>` (ID: 4) - Should be added at the end of input text
 - PAD token: `<|pad|>` (ID: 2) - Used for padding sequences to a uniform length
 - BOS token: `<|start|>` (ID: 0) - Beginning of sequence
 - EOS token: `<|end|>` (ID: 1) - End of sequence
 - UNK token: `<|unk|>` (ID: 3) - Unknown token
-For best results, you should **explicitly add** the CLS and SEP tokens to your input text, as shown in the examples. While the tokenizer can add these automatically in some cases, explicitly adding them ensures proper context for the model's predictions.
 ## Contract-Specific Capabilities
@@ -135,6 +208,56 @@ The model shows particularly strong performance in identifying contract-specific
 This demonstrates the model's specialized knowledge of contract structures and terminology following the additional training steps on contract data.
 ## Limitations
 While compact and specialized for contracts, this model has some limitations:
@@ -148,7 +271,7 @@ While compact and specialized for contracts, this model has some limitations:
 ## References
 - [KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications](https://arxiv.org/abs/2503.17247)
-- [The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models]() (Forthcoming)
 ## Citation
@@ -165,6 +288,17 @@ If you use this model in your research, please cite:
 }
 ```
 ## License
 This model is licensed under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/).

 ## Usage
+You can use this model for masked language modeling with the simple pipeline API:
 ```python
+from transformers import pipeline
+# Load the fill-mask pipeline with the model
+fill_mask = pipeline('fill-mask', model="alea-institute/kl3m-doc-pico-contracts-001")
+# Example: Contract clause heading
+# Note the mask token placement - directly adjacent to "AND" without space
+text = "<|cls|> 8. REPRESENTATIONS AND<|mask|>. Each party hereby represents and warrants to the other party as of the date hereof as follows: <|sep|>"
+results = fill_mask(text)
+# Display predictions
+print("Top predictions:")
+for i, result in enumerate(results[:5]):
+    print(f"{i+1}. {result['token_str']} (score: {result['score']:.4f})")
+# Output:
+# Top predictions:
+# 1. WARRANTIES (score: 0.7282)
+# 2. Warranties (score: 0.1674)
+# 3. REPRESENTATIONS (score: 0.0377)
+# 4. warrants (score: 0.0127)
+# 5. WARR (score: 0.0074)
+```
+You can try additional examples:
+```python
+# Example: Defined term
+text2 = "<|cls|> \"Effective<|mask|>\" means the date on which all conditions precedent set forth in Article V are satisfied or waived by the Administrative Agent. <|sep|>"
+results2 = fill_mask(text2)
+# Display predictions
+print("\nDefined Term Example - Top 5 predictions:")
+for i, result in enumerate(results2[:5]):
+    print(f"{i+1}. {result['token_str']} (score: {result['score']:.4f})")
+# Output:
+# Defined Term Example - Top 5 predictions:
+# 1. Date (score: 0.9735)
+# 2. Time (score: 0.0075)
+# 3. date (score: 0.0053)
+# 4. Dates (score: 0.0024)
+# 5. Agreement (score: 0.0011)
+# Example: Regulatory context
+text3 = "<|cls|> All transactions shall comply with the requirements set forth in the Truth in<|mask|> Act and its implementing Regulation Z. <|sep|>"
+results3 = fill_mask(text3)
+# Display predictions
+print("\nRegulatory Example - Top 5 predictions:")
+for i, result in enumerate(results3[:5]):
+    print(f"{i+1}. {result['token_str']} (score: {result['score']:.4f})")
 # Output:
+# Regulatory Example - Top 5 predictions:
+# 1. Lending (score: 0.2906)
+# 2. Control (score: 0.1233)
+# 3. ment (score: 0.0519)
+# 4. Credit (score: 0.0454)
+# 5. Disabilities (score: 0.0451)
+```
+Note how this model strongly outperforms the more general model on contract terminology, with much higher confidence in its predictions for terms like "WARRANTIES" and "Date".
+For feature extraction and document similarity analysis, the pipeline API is also recommended:
+```python
+from transformers import pipeline
+import numpy as np
+from sklearn.metrics.pairwise import cosine_similarity
+# Load the feature-extraction pipeline
+extractor = pipeline('feature-extraction', model="alea-institute/kl3m-doc-pico-contracts-001", return_tensors=True)
+# Example legal documents
+texts = [
+    # Court Complaint
+    "<|cls|> IN THE UNITED STATES DISTRICT COURT FOR THE EASTERN DISTRICT OF PENNSYLVANIA\n\nJOHN DOE,\nPlaintiff,\n\nvs.\n\nACME CORPORATION,\nDefendant.\n\nCIVIL ACTION NO. 21-12345\n\nCOMPLAINT\n\nPlaintiff John Doe, by and through his undersigned counsel, hereby files this Complaint against Defendant Acme Corporation, and in support thereof, alleges as follows: <|sep|>",
+    # Consumer Terms
+    "<|cls|> TERMS AND CONDITIONS\n\nLast Updated: April 10, 2025\n\nThese Terms and Conditions (\"Terms\") govern your access to and use of the Service. By accessing or using the Service, you agree to be bound by these Terms. If you do not agree to these Terms, you may not access or use the Service. These Terms constitute a legally binding agreement between you and the Company. <|sep|>",
+    # Credit Agreement
+    "<|cls|> CREDIT AGREEMENT\n\nDated as of April 10, 2025\n\nAmong\n\nACME BORROWER INC.,\nas the Borrower,\n\nBANK OF FINANCE,\nas Administrative Agent,\n\nand\n\nTHE LENDERS PARTY HERETO\n\nThis CREDIT AGREEMENT (\"Agreement\") is entered into as of April 10, 2025, among ACME BORROWER INC., a Delaware corporation (the \"Borrower\"), each lender from time to time party hereto (collectively, the \"Lenders\"), and BANK OF FINANCE, as Administrative Agent. <|sep|>"
+]
+# Strategy 1: CLS token embeddings
+cls_embeddings = []
+for text in texts:
+    features = extractor(text)
+    # Get the CLS token (first token) embedding
+    features_array = features[0].numpy() if hasattr(features[0], 'numpy') else features[0]
+    cls_embedding = features_array[0]
+    cls_embeddings.append(cls_embedding)
+# Calculate similarity between documents using CLS tokens
+cls_similarity = cosine_similarity(np.vstack(cls_embeddings))
+print("\nDocument similarity (CLS token):")
+print(np.round(cls_similarity, 3))
 # Output:
+# [[1.    0.660 0.626]
+#  [0.660 1.    0.798]
+#  [0.626 0.798 1.   ]]
+# Strategy 2: Mean pooling
+mean_embeddings = []
+for text in texts:
+    features = extractor(text)
+    # Average over all tokens
+    features_array = features[0].numpy() if hasattr(features[0], 'numpy') else features[0]
+    mean_embedding = np.mean(features_array, axis=0)
+    mean_embeddings.append(mean_embedding)
+# Calculate similarity using mean pooling
+mean_similarity = cosine_similarity(np.vstack(mean_embeddings))
+print("\nDocument similarity (Mean pooling):")
+print(np.round(mean_similarity, 3))
+# Output:
+# [[1.    0.733 0.825]
+#  [0.733 1.    0.678]
+#  [0.825 0.678 1.   ]]
 ```
 ## Training
 ## Special Tokens
+This model includes the following special tokens:
+- CLS token: `<|cls|>` (ID: 5) - Used for the beginning of input text
 - MASK token: `<|mask|>` (ID: 6) - Used to mark tokens for prediction
+- SEP token: `<|sep|>` (ID: 4) - Used for the end of input text
 - PAD token: `<|pad|>` (ID: 2) - Used for padding sequences to a uniform length
 - BOS token: `<|start|>` (ID: 0) - Beginning of sequence
 - EOS token: `<|end|>` (ID: 1) - End of sequence
 - UNK token: `<|unk|>` (ID: 3) - Unknown token
+Important usage notes:
+When using the MASK token for predictions, be aware that this model uses a **space-prefixed BPE tokenizer**. The <|mask|> token should be placed IMMEDIATELY after the previous token with NO space, because most tokens in this tokenizer have an initial space encoded within them. For example: `"word<|mask|>"` rather than `"word <|mask|>"`.
+This space-aware placement is crucial for getting accurate predictions, as demonstrated in our test examples.
 ## Contract-Specific Capabilities
 This demonstrates the model's specialized knowledge of contract structures and terminology following the additional training steps on contract data.
+## Standard Test Examples
+Using our standardized test examples for comparing embedding models:
+### Fill-Mask Results
+1. **Contract Clause Heading**:
+   `"<|cls|> 8. REPRESENTATIONS AND<|mask|>. Each party hereby represents and warrants to the other party as of the date hereof as follows: <|sep|>"`
+   Top 5 predictions:
+   1. WARRANTIES (0.7282)
+   2. Warranties (0.1674)
+   3. REPRESENTATIONS (0.0377)
+   4. warrants (0.0127)
+   5. WARR (0.0074)
+   Note: This model shows extremely strong performance on contract-specific language compared to the more general kl3m-doc-pico-001 model, correctly predicting "WARRANTIES" with high confidence.
+2. **Defined Term Example**:
+   `"<|cls|> \"Effective<|mask|>\" means the date on which all conditions precedent set forth in Article V are satisfied or waived by the Administrative Agent. <|sep|>"`
+   Top 5 predictions:
+   1. Date (0.9735)
+   2. Time (0.0075)
+   3. date (0.0053)
+   4. Dates (0.0024)
+   5. Agreement (0.0011)
+3. **Regulation Example**:
+   `"<|cls|> All transactions shall comply with the requirements set forth in the Truth in<|mask|> Act and its implementing Regulation Z. <|sep|>"`
+   Top 5 predictions:
+   1. Lending (0.2906)
+   2. Control (0.1233)
+   3. ment (0.0519)
+   4. Credit (0.0454)
+   5. Disabilities (0.0451)
+### Document Similarity Results
+Using the standardized document examples for embeddings:
+| Document Pair | Cosine Similarity (CLS token) | Cosine Similarity (Mean pooling) |
+|---------------|-------------------------------|----------------------------------|
+| Court Complaint vs. Consumer Terms | 0.660 | 0.733 |
+| Court Complaint vs. Credit Agreement | 0.626 | 0.825 |
+| Consumer Terms vs. Credit Agreement | 0.798 | 0.678 |
+The contract-specialized model shows balanced similarity measurements that effectively capture document relationships. With CLS token embeddings, it identifies greatest similarity between Consumer Terms and Credit Agreement (0.798), while with mean pooling it shows highest similarity between Court Complaint and Credit Agreement (0.825).
 ## Limitations
 While compact and specialized for contracts, this model has some limitations:
 ## References
 - [KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications](https://arxiv.org/abs/2503.17247)
+- [The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models](https://arxiv.org/abs/2504.07854)
 ## Citation
 }
 ```
+```bibtex
+@misc{bommarito2025kl3mdata,
+  title={The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models},
+  author={Bommarito II, Michael J. and Bommarito, Jillian and Katz, Daniel Martin},
+  year={2025},
+  eprint={2504.07854},
+  archivePrefix={arXiv},
+  primaryClass={cs.CL}
+}
+```
 ## License
 This model is licensed under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/).