LEXPT Law SFT (CAP subset)
Dataset Summary
LEXPT Law SFT is a supervised fine-tuning corpus for U.S. case-law analysis. It provides chat-style instruction/response records derived from public-domain judicial opinions (e.g., the Caselaw Access Project, “CAP”) and lawyer-authored prompts targeting appellate/habeas skills:
- Case skeleton extraction (posture, issues, holdings, standards, disposition)
- Variance vs. constructive amendment analysis
- Preservation/waiver and prejudice analysis
- Habeas procedural-default framing (cause–prejudice; innocence gateway)
- Evidence topics (authentication, 801(d)(2)(E), Rule 403, juror aids)
- IRAC drafting and advocacy point-headings (petitioner/state)
- Bluebook formatting exercises
The data are curated for base+LoRA legal assistants and are compatible with tokenizer.apply_chat_template(...)
(ChatML-style roles). All opinion texts are public-domain; prompts/annotations are newly authored and released under CC-BY-4.0.
Intended Use
- Fine-tuning or LoRA-adapting general LLMs for opinion-grounded legal reasoning.
- Evaluation/benchmarking of structured appellate/habeas analysis on held-out opinions.
- Not for production of legal advice; this is a research/engineering dataset to improve structured legal outputs.
Use Cases (15 task templates)
Core extraction (case skeleton)
Extract (1) procedural posture, (2) issues, (3) holdings (one line each), (4) standards of review, (5) disposition from a provided opinion excerpt.Variance vs. constructive amendment
Define both doctrines, then classify the opinion’s problem (proof–pleading discrepancy vs. alteration of elements) and justify using the court’s analysis.Preservation / waiver
Identify the exact trial steps necessary to preserve a fatal-variance claim (contemporaneous objection, motion grounds specificity, request for continuance) and assess whether they occurred.Prejudice analysis (variance)
Evaluate whether variant proof (e.g., gun vs. knife) misled the defense, caused surprise, or impaired preparation; point to record facts showing (no) prejudice.Habeas framing (procedural default)
Explain how a state-trial variance claim is reviewed on federal habeas when no contemporaneous objection was made; outline cause-and-prejudice / actual-innocence gateways if prompted.Standard of review
State which standard(s) the court applied (de novo, abuse of discretion, harmless error) and why; explain how lack of preservation narrowed the scope.Argument for petitioner/appellant
Draft 4–8 concise advocacy points that a means discrepancy (e.g., knife → gun) violated Sixth-Amendment notice and was not harmless.Argument for the state/appellee
Draft 4–8 concise counterpoints on waiver (failure to object), lack of prejudice/surprise, alignment with defense theory, and adequacy of notice.Record checklist
Bullet list of record items to pull for briefing (charging instrument; key witness testimony; objections or lack thereof; motions and grounds; any continuance requested; state appeal; federal habeas pleadings).Remedies
State the proper remedies if a preserved fatal variance is found on direct appeal vs. habeas (reversal, new trial, or other relief), and when harmless error applies.Hypothetical preservation
Re-analyze outcome/posture assuming defense counsel objected when variant proof emerged and sought a continuance; discuss how that affects prejudice and review.Notice pleading in informations
Explain required factual specificity to satisfy notice; apply to “assault with intent to kill” and assess whether the instrument’s means (knife vs. gun) is material.Jury-instruction angle
Propose a limiting/clarifying instruction to mitigate variance prejudice (e.g., confining the theory to the charged means) and analyze whether refusal would be reversible error.Bluebook formatting
Provide full and short-form citations for the controlling decision(s) and the referenced state case; compose a citation string suitable for a brief’s argument section.One-page IRAC
Produce an IRAC with exact headers—Issue, Rule, Application, Conclusion—summarizing the variance/notice dispute and the court’s reasoning.
Data Structure
Record Schema
Field | Type | Description |
---|---|---|
id |
str | Unique identifier (e.g., ridgeway_habeas_0001 ). |
case_name |
str | Case caption (e.g., “Ridgeway v. Hutto”). |
court |
str | Court (e.g., “8th Cir.”). |
year |
int | Decision year. |
jurisdiction |
str | “federal” or “state”. |
prompt_type |
str | One of the 15 task categories (see Use Cases). |
opinion_text |
str | Public-domain opinion excerpt used as context. |
messages |
list | ChatML-style messages: `[{"role": "system" |
source_ref |
str | Short provenance note (e.g., “CAP; citation: 474 F.2d 22 (8th Cir. 1973)”). |
Example Record
{
"id": "ridgeway_habeas_0001",
"case_name": "Ridgeway v. Hutto",
"court": "8th Cir.",
"year": 1973,
"jurisdiction": "federal",
"prompt_type": "core_extraction",
"opinion_text": "…public-domain opinion excerpt…",
"messages": [
{
"role": "system",
"content": "You are a legal analysis assistant. Return ONLY the final answer. No prefaces or meta-commentary."
},
{
"role": "user",
"content": "From the opinion text, list: (1) procedural posture, (2) issues, (3) holdings, (4) standards of review, (5) disposition.\n\nOPINION TEXT:\n…"
},
{
"role": "assistant",
"content": "1) …\n2) …\n3) …\n4) …\n5) …"
}
],
"source_ref": "CAP; citation: 474 F.2d 22 (8th Cir. 1973)"
}
Splits
train
: update after uploadvalidation
: update after uploadtest
(optional): update after upload
Split policy: Do not split tasks for the same case across train/val/test to avoid leakage.
How to Use
Load with 🤗 Datasets
from datasets import load_dataset
ds = load_dataset("sik247/lexpt-law-sft") # replace with your repo id
print(ds)
print(ds["train"][0])
Use with Chat Templates (Transformers)
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("unsloth/gpt-oss-20b") # or your base
sample = ds["train"][0]["messages"]
prompt = tok.apply_chat_template(sample, add_generation_prompt=True, tokenize=False)
Curation & Construction
- Sources: public-domain opinions (e.g., CAP).
- Selection: appellate/habeas cases and issues suited for structured outputs (lists, checklists, IRAC).
- Annotation: prompts and answers authored by legal-knowledgeable contributors; emphasis on final-answer-only style.
- Preprocessing: remove site boilerplate; normalize whitespace/quotes; ensure consistent role formatting; de-duplicate near-identical snippets.
Quality Control
- Spot checks for: (i) factual alignment with the opinion excerpt, (ii) formatting adherence (lists/IRAC), (iii) concise, jurisdiction-aware language.
- Where uncertainty exists, assistant outputs avoid invented facts/citations and prefer “Insufficient information.”
Ethical Considerations & Limitations
- Not legal advice. This dataset trains formatting and structure for legal analysis; always verify with primary sources.
- Coverage: U.S. appellate caselaw; not exhaustive across jurisdictions or dates.
- Model risk: Misstatements of doctrine or miscitation can occur; downstream users should validate.
- Bias: Judicial texts may reflect historical or jurisdictional bias; outputs may inherit such patterns.
Licensing
- Opinion texts: Public domain (as supplied by CAP and similar sources).
- Prompts & annotations: © 2025 sik247, released under CC-BY-4.0.
- When redistributing, include attribution: “sik247 / LEXPT Law SFT (CAP subset)”.
Citation
If you use this dataset, please cite:
sik247. LEXPT Law SFT (CAP subset). 2025. Hugging Face Dataset.
And acknowledge the public-domain opinion sources (e.g., CAP) per their attribution guidance.
Maintainer
- Author/Maintainer:
sik247
- Issues/requests: open a Discussion on the dataset page.
Changelog
- v1.0 — Initial release with CAP-based opinion excerpts, 15 task templates, and ChatML records. Update counts and add additional jurisdictions in subsequent versions.
Model tree for sik247/lexpt
Base model
openai/gpt-oss-20b