sam-mosaic
/

pile-of-law-tokenizer

Model card Files Files and versions Community

pile-of-law-tokenizer / README.md

sam-mosaic's picture

Update README.md

be6ca24 over 2 years ago

|

history blame contribute delete

463 Bytes

	# Pile of Law Tokenizer

	This tokenizer should be a drop-in replacement for the GPT2Tokenizer. It has the same special tokens, but was trained on a random 1M samples from [the pile of law](https://huggingface.co/datasets/pile-of-law/pile-of-law) train split.

	It has exactly 52,000 tokens, which is not identical to GPT2.

	Usage:

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("sam-mosaic/pile-of-law-tokenizer")
	```