LEXPT Law SFT (CAP subset)

Dataset Summary

LEXPT Law SFT is a supervised fine-tuning corpus for U.S. case-law analysis. It provides chat-style instruction/response records derived from public-domain judicial opinions (e.g., the Caselaw Access Project, “CAP”) and lawyer-authored prompts targeting appellate/habeas skills:

  • Case skeleton extraction (posture, issues, holdings, standards, disposition)
  • Variance vs. constructive amendment analysis
  • Preservation/waiver and prejudice analysis
  • Habeas procedural-default framing (cause–prejudice; innocence gateway)
  • Evidence topics (authentication, 801(d)(2)(E), Rule 403, juror aids)
  • IRAC drafting and advocacy point-headings (petitioner/state)
  • Bluebook formatting exercises

The data are curated for base+LoRA legal assistants and are compatible with tokenizer.apply_chat_template(...) (ChatML-style roles). All opinion texts are public-domain; prompts/annotations are newly authored and released under CC-BY-4.0.


Intended Use

  • Fine-tuning or LoRA-adapting general LLMs for opinion-grounded legal reasoning.
  • Evaluation/benchmarking of structured appellate/habeas analysis on held-out opinions.
  • Not for production of legal advice; this is a research/engineering dataset to improve structured legal outputs.

Use Cases (15 task templates)

  1. Core extraction (case skeleton)
    Extract (1) procedural posture, (2) issues, (3) holdings (one line each), (4) standards of review, (5) disposition from a provided opinion excerpt.

  2. Variance vs. constructive amendment
    Define both doctrines, then classify the opinion’s problem (proof–pleading discrepancy vs. alteration of elements) and justify using the court’s analysis.

  3. Preservation / waiver
    Identify the exact trial steps necessary to preserve a fatal-variance claim (contemporaneous objection, motion grounds specificity, request for continuance) and assess whether they occurred.

  4. Prejudice analysis (variance)
    Evaluate whether variant proof (e.g., gun vs. knife) misled the defense, caused surprise, or impaired preparation; point to record facts showing (no) prejudice.

  5. Habeas framing (procedural default)
    Explain how a state-trial variance claim is reviewed on federal habeas when no contemporaneous objection was made; outline cause-and-prejudice / actual-innocence gateways if prompted.

  6. Standard of review
    State which standard(s) the court applied (de novo, abuse of discretion, harmless error) and why; explain how lack of preservation narrowed the scope.

  7. Argument for petitioner/appellant
    Draft 4–8 concise advocacy points that a means discrepancy (e.g., knife → gun) violated Sixth-Amendment notice and was not harmless.

  8. Argument for the state/appellee
    Draft 4–8 concise counterpoints on waiver (failure to object), lack of prejudice/surprise, alignment with defense theory, and adequacy of notice.

  9. Record checklist
    Bullet list of record items to pull for briefing (charging instrument; key witness testimony; objections or lack thereof; motions and grounds; any continuance requested; state appeal; federal habeas pleadings).

  10. Remedies
    State the proper remedies if a preserved fatal variance is found on direct appeal vs. habeas (reversal, new trial, or other relief), and when harmless error applies.

  11. Hypothetical preservation
    Re-analyze outcome/posture assuming defense counsel objected when variant proof emerged and sought a continuance; discuss how that affects prejudice and review.

  12. Notice pleading in informations
    Explain required factual specificity to satisfy notice; apply to “assault with intent to kill” and assess whether the instrument’s means (knife vs. gun) is material.

  13. Jury-instruction angle
    Propose a limiting/clarifying instruction to mitigate variance prejudice (e.g., confining the theory to the charged means) and analyze whether refusal would be reversible error.

  14. Bluebook formatting
    Provide full and short-form citations for the controlling decision(s) and the referenced state case; compose a citation string suitable for a brief’s argument section.

  15. One-page IRAC
    Produce an IRAC with exact headers—Issue, Rule, Application, Conclusion—summarizing the variance/notice dispute and the court’s reasoning.


Data Structure

Record Schema

Field Type Description
id str Unique identifier (e.g., ridgeway_habeas_0001).
case_name str Case caption (e.g., “Ridgeway v. Hutto”).
court str Court (e.g., “8th Cir.”).
year int Decision year.
jurisdiction str “federal” or “state”.
prompt_type str One of the 15 task categories (see Use Cases).
opinion_text str Public-domain opinion excerpt used as context.
messages list ChatML-style messages: `[{"role": "system"
source_ref str Short provenance note (e.g., “CAP; citation: 474 F.2d 22 (8th Cir. 1973)”).

Example Record

{
  "id": "ridgeway_habeas_0001",
  "case_name": "Ridgeway v. Hutto",
  "court": "8th Cir.",
  "year": 1973,
  "jurisdiction": "federal",
  "prompt_type": "core_extraction",
  "opinion_text": "…public-domain opinion excerpt…",
  "messages": [
    {
      "role": "system",
      "content": "You are a legal analysis assistant. Return ONLY the final answer. No prefaces or meta-commentary."
    },
    {
      "role": "user",
      "content": "From the opinion text, list: (1) procedural posture, (2) issues, (3) holdings, (4) standards of review, (5) disposition.\n\nOPINION TEXT:\n…"
    },
    {
      "role": "assistant",
      "content": "1) …\n2) …\n3) …\n4) …\n5) …"
    }
  ],
  "source_ref": "CAP; citation: 474 F.2d 22 (8th Cir. 1973)"
}

Splits

  • train: update after upload
  • validation: update after upload
  • test (optional): update after upload

Split policy: Do not split tasks for the same case across train/val/test to avoid leakage.


How to Use

Load with 🤗 Datasets

from datasets import load_dataset
ds = load_dataset("sik247/lexpt-law-sft")  # replace with your repo id
print(ds)
print(ds["train"][0])

Use with Chat Templates (Transformers)

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("unsloth/gpt-oss-20b")  # or your base

sample = ds["train"][0]["messages"]
prompt = tok.apply_chat_template(sample, add_generation_prompt=True, tokenize=False)

Curation & Construction

  • Sources: public-domain opinions (e.g., CAP).
  • Selection: appellate/habeas cases and issues suited for structured outputs (lists, checklists, IRAC).
  • Annotation: prompts and answers authored by legal-knowledgeable contributors; emphasis on final-answer-only style.
  • Preprocessing: remove site boilerplate; normalize whitespace/quotes; ensure consistent role formatting; de-duplicate near-identical snippets.

Quality Control

  • Spot checks for: (i) factual alignment with the opinion excerpt, (ii) formatting adherence (lists/IRAC), (iii) concise, jurisdiction-aware language.
  • Where uncertainty exists, assistant outputs avoid invented facts/citations and prefer “Insufficient information.”

Ethical Considerations & Limitations

  • Not legal advice. This dataset trains formatting and structure for legal analysis; always verify with primary sources.
  • Coverage: U.S. appellate caselaw; not exhaustive across jurisdictions or dates.
  • Model risk: Misstatements of doctrine or miscitation can occur; downstream users should validate.
  • Bias: Judicial texts may reflect historical or jurisdictional bias; outputs may inherit such patterns.

Licensing

  • Opinion texts: Public domain (as supplied by CAP and similar sources).
  • Prompts & annotations: © 2025 sik247, released under CC-BY-4.0.
  • When redistributing, include attribution: “sik247 / LEXPT Law SFT (CAP subset)”.

Citation

If you use this dataset, please cite:

sik247. LEXPT Law SFT (CAP subset). 2025. Hugging Face Dataset.

And acknowledge the public-domain opinion sources (e.g., CAP) per their attribution guidance.


Maintainer

  • Author/Maintainer: sik247
  • Issues/requests: open a Discussion on the dataset page.

Changelog

  • v1.0 — Initial release with CAP-based opinion excerpts, 15 task templates, and ChatML records. Update counts and add additional jurisdictions in subsequent versions.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sik247/lexpt

Base model

openai/gpt-oss-20b
Adapter
(17)
this model

Dataset used to train sik247/lexpt