fix output snippet in readme

ab70036 verified 4 months ago

6.59 kB

	---
	language:
	- en
	library_name: transformers
	license: cc-by-4.0
	tags:
	- kl3m
	- kl3m-002
	- patent
	- all the patents
	- slm
	date: '2024-03-12T00:00:00.000Z'
	pipeline_tag: text-generation
	widget:
	- text: "# Title\n"
	- temperature: 0.3
	- do_sample: True
	---

	# All the Patents 170m Model

	`kl3m-002-170m-patent` is a a (very) small language model (SLM) model fine-tuned from `kl3m-002-170m` to
	generate "realistic" patent text. For more information about the base model,
	please see [its model page](https://huggingface.co/alea-institute/kl3m-002-170m).

	# All the Patents

	## Why?

	#### If a GPT2-sized model can generate a valid set of claims, should anyone be able to monopolize the invention?

	At their heart, patents are a temporary, sanctioned monopoly on an invention through a license to sue. This monopoly
	is justified by the public good created by encouraging innovation and the long-term impact of that innovation being
	shared in the public domain.

	Unfortunately, this worthy policy goal has been lost in the chaos and misuse of the patent system.

	One of the most common sources of frustration is the granting of "obvious" patents. While some inventions are clearly novel
	and non-obvious, many are not - but still slip through the examination process. These obvious but granted patents then
	loom large over the market, creating a "thicket" that discourages use or subsequent invention in the area of the granted
	patent. "Undoing" the grant of a patent is a costly and time-consuming process with possible negative consequences, and
	so many of these patents simply sit as prior art on the books, even if the patentholder knows they could never enforce them.

	Congress and various stakeholders have discussed and proposed changes over time, including most recently the
	America Invents Act (AIA), but the problem of obvious patents persists.

	But what if someone were to generate all the obvious inventions and make them public?

	What if we shared the means of producing these obvious inventions so that everyone could help generate them on a normal CPU or consumer GPU?

	And what if we could then make those obvious inventions easily searchable for anyone, including PTO examiners themselves, to use?

	## How it Works

	We start with a small, GPT2-sized large language model - [kl3m-170](https://273ventures.com/kl3m-the-first-legal-large-language-model/) - which was trained on a clean, copyright-free dataset.
	This helps us ensure that generations do not include copyrighted text, which would allow third-parties to interfere with the project
	via DMCA takedown requests.

	Next, we fine-tune this model on two simultaneous tasks:

	1. Top-down drafting: We start from the most abstract parts of the patent - the title and abstract - and then generate the detailed claims. This is a traditional next-token prediction order.

	```text
	# Patent

	## Title
	{title}

	## Abstract
	{abstract}

	## Claims

	1. {claim 1}

	2. {claim 2}

	...
	```

	2. Bottom-up: We start from the most detailed part of the patent - the claims - and then generate the abstract and title. This reversed order can be thought of as similar to traditional extractive/abstractive summarization tasks.

	```text
	# Patent

	## Claims

	1. {claim 1}

	2. {claim 2}

	...

	## Abstract
	{abstract}

	## Title
	{title}
	```

	Once this fine-tuning is complete, we can then generate new patents using either technique by prompting the model as follows:

	1. Top-down prompt: `"# Patent\n\n## Title"`

	2. Bottom-up prompt: `"# Patent\n\n## Claims"`

	It's critical that generation occurs with sufficient randomness and diversity to ensure that the generated patents are not
	simply reproductions of the training data. This is a key area of ongoing research and development.

	**Much like the real process of invention, most of the "ideas" generated by this process will be either nonsense or
	unpatentable otherwise. Our goal is to estimate the "hit rate" of the model and continue to improve the efficiency and
	accessibility of the generation process so that the "cost per obvious invention" is as low as possible.**

	## Current Status

	This project is still in its infancy. We're doing R&D to develop prototype tools to demonstrate the possibility and
	cost of generating and sharing these obvious inventions. This R&D is currently focused on data collection,
	data curation, model training, and model evaluation.


	## Generation

	You can generate your own examples as follows. For a "complete" patent, you'll want to extend the `max_new_tokens` value to the biggest number you can fit in your available VRAM.

	```python
	import json
	from transformers import pipeline

	# Load the model and tokenizer on CPU
	p = pipeline('text-generation', 'alea-institute/kl3m-002-170m-patent', device='cpu')

	# Example usage on CPU
	text = "# Patent\n\n## Title"
	print(
	json.dumps(
	[
	r.get("generated_text")
	for r in p(text, do_sample=True, temperature=0.5, num_return_sequences=3, max_new_tokens=32)
	],
	indent=2
	)
	)
	```

	```json
	[
	"# Patent\n\n## Title\nMethod for manufacturing a temperature-controllable polyurethane composition and method",
	"# Patent\n\n## Title\nElectronic device\n\n## Abstract\nAn electronic device includes a display panel and a",
	"# Patent\n\n## Title\nMethods and devices for tissue repair using a neural network\n\n## Abstract"
	]
	```

	### Related Material

	* https://www.federalregister.gov/documents/2024/02/27/2024-03967/updated-guidance-for-making-a-proper-determination-of-obviousness

	## License

	This model was originally developed by 273 Ventures and has been donated to the ALEA Institute.

	The model weights are released under the CC-BY 4.0 License.

	## Contact

	The KL3M model family is now maintained by the [ALEA Institute](https://aleainstitute.ai). For technical support, collaboration opportunities, or general inquiries:

	- GitHub: https://github.com/alea-institute/kl3m-model-research
	- Email: [email protected]
	- Website: https://aleainstitute.ai

	## Acknowledgments

	Special thanks to 273 Ventures for developing and donating this model to the open-source community through the Alea Institute.


	## Citation

	Tokenizer, dataset, and model publications are pending.

	## Contact

	For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [[email protected]](mailto:[email protected]) or
	create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-model-research).

	![https://aleainstitute.ai](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)