More readme updates
Browse files
README.md
CHANGED
|
@@ -32,17 +32,17 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
| 32 |
model = AutoModelForCausalLM.from_pretrained("tomg-group-umd/huginn-0125", torch_dtype=torch.bfloat16, trust_remote_code=True)
|
| 33 |
tokenizer = AutoTokenizer.from_pretrained("tomg-group-umd/huginn-0125")
|
| 34 |
```
|
| 35 |
-
###
|
| 36 |
-
By providing the argument `num_steps`, the model will execute a pass with that amount of compute:
|
| 37 |
```python
|
| 38 |
-
input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)
|
| 39 |
model.eval()
|
| 40 |
model.to(device)
|
| 41 |
|
| 42 |
model(input_ids, num_steps=32)
|
| 43 |
```
|
| 44 |
The model has about 1.5B parameters in non-recurrent code, 0.5B parameters in the embedding, and 1.5B recurrent parameters, so, as a guideline,
|
| 45 |
-
the number of materialized parameters is `num_steps * 1.5B + 2B`. Playing with this parameter is what makes this model interesting
|
| 46 |
The model is trained to accept an arbitrary number of steps. However, using fewer than 4 steps will result in very coarse answers. If given enough context to reason about, benchmarks show the model improving up to around `num_steps=64`. Beyond that, more steps generally do not hurt, but we see no further improvements.
|
| 47 |
|
| 48 |
|
|
@@ -60,7 +60,7 @@ config = GenerationConfig(max_length=256, stop_strings=["<|end_text|>", "<|end_t
|
|
| 60 |
eos_token_id=65505,bos_token_id=65504,pad_token_id=65509)
|
| 61 |
|
| 62 |
|
| 63 |
-
input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)
|
| 64 |
outputs = model.generate(input_ids, config, tokenizer=tokenizer, num_steps=16)
|
| 65 |
```
|
| 66 |
|
|
@@ -84,7 +84,7 @@ model.generate(input_ids, config, num_steps=64, tokenizer=tokenizer)
|
|
| 84 |
|
| 85 |
### KV-cache Details
|
| 86 |
The model requires its own KV-cache implementation `HuginnDynamicCache`, otherwise the KV-caches of later calls to the recurrent block will overwrite the earlier ones.
|
| 87 |
-
|
| 88 |
|
| 89 |
```python
|
| 90 |
# first step:
|
|
@@ -98,25 +98,34 @@ outputs = model(input_ids=input_ids, use_cache=True, past_key_values=past_key_va
|
|
| 98 |
## Advanced Features
|
| 99 |
|
| 100 |
### Per-Token Adaptive Compute
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
```python
|
| 102 |
-
|
| 103 |
-
|
| 104 |
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
use_cache=True, past_key_values=past_key_values,
|
| 108 |
-
do_sample=False, temperature=None, top_k=None, top_p=None, min_p=None,
|
| 109 |
-
return_dict_in_generate=True,
|
| 110 |
-
eos_token_id=65505,bos_token_id=65504,pad_token_id=65509)
|
| 111 |
-
# Note: num_steps and other model arguments CANNOT be included here, they will shadow model args at runtime
|
| 112 |
|
| 113 |
-
input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)[:, :-1]
|
| 114 |
-
outputs = model.generate(input_ids, config, tokenizer=tokenizer)
|
| 115 |
```
|
|
|
|
| 116 |
|
| 117 |
### KV-cache Sharing
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
|
| 119 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 120 |
|
| 121 |
|
| 122 |
|
|
|
|
| 32 |
model = AutoModelForCausalLM.from_pretrained("tomg-group-umd/huginn-0125", torch_dtype=torch.bfloat16, trust_remote_code=True)
|
| 33 |
tokenizer = AutoTokenizer.from_pretrained("tomg-group-umd/huginn-0125")
|
| 34 |
```
|
| 35 |
+
### Modifying the Model's Depth at Test Time:
|
| 36 |
+
By providing the argument `num_steps`, the model will execute a forward pass with that amount of compute:
|
| 37 |
```python
|
| 38 |
+
input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)
|
| 39 |
model.eval()
|
| 40 |
model.to(device)
|
| 41 |
|
| 42 |
model(input_ids, num_steps=32)
|
| 43 |
```
|
| 44 |
The model has about 1.5B parameters in non-recurrent code, 0.5B parameters in the embedding, and 1.5B recurrent parameters, so, as a guideline,
|
| 45 |
+
the number of materialized parameters is `num_steps * 1.5B + 2B`. Playing with this parameter is what makes this model interesting, and different from fixed-depth transformers!
|
| 46 |
The model is trained to accept an arbitrary number of steps. However, using fewer than 4 steps will result in very coarse answers. If given enough context to reason about, benchmarks show the model improving up to around `num_steps=64`. Beyond that, more steps generally do not hurt, but we see no further improvements.
|
| 47 |
|
| 48 |
|
|
|
|
| 60 |
eos_token_id=65505,bos_token_id=65504,pad_token_id=65509)
|
| 61 |
|
| 62 |
|
| 63 |
+
input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)
|
| 64 |
outputs = model.generate(input_ids, config, tokenizer=tokenizer, num_steps=16)
|
| 65 |
```
|
| 66 |
|
|
|
|
| 84 |
|
| 85 |
### KV-cache Details
|
| 86 |
The model requires its own KV-cache implementation `HuginnDynamicCache`, otherwise the KV-caches of later calls to the recurrent block will overwrite the earlier ones.
|
| 87 |
+
The current implementation will always try to inject this Cache implementation, but that may break with huggingface updates. If you do not use generate, but implement your own generation, use a pattern like this:
|
| 88 |
|
| 89 |
```python
|
| 90 |
# first step:
|
|
|
|
| 98 |
## Advanced Features
|
| 99 |
|
| 100 |
### Per-Token Adaptive Compute
|
| 101 |
+
When generating, you can also a variable amount of compute per-token. The model is not trained for this, so this is a proof-of-concept, that can do this task zero-shot.
|
| 102 |
+
You can pick between a few sane stopping rules, `entropy-diff`, `latent-diff`,`kl` and `argmax-stability`, via `criterion=kl`. The exit threshold can be modified via `exit_threshold=5e-4`.
|
| 103 |
+
We suggest using `kl` for interesting exits and `argmax-stability` for conservative exits. Note that using these variables overrides the default generation function. Not all arguments that are valid for the normal `generate` call are valid here. To make this more explicit, you can also directly call `generate_with_adaptive_compute`:
|
| 104 |
+
|
| 105 |
```python
|
| 106 |
+
from transformers import TextStreamer
|
| 107 |
+
streamer = TextStreamer(tokenizer)
|
| 108 |
|
| 109 |
+
model.generate_with_adaptive_compute(input_ids, config, num_steps=64, tokenizer=tokenizer, streamer=streamer,
|
| 110 |
+
continuous_compute=False, criterion="kl", exit_threshold=5e-4, cache_kwargs={"lookup_strategy": "latest-m4"})
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 111 |
|
|
|
|
|
|
|
| 112 |
```
|
| 113 |
+
Your cache strategy should be set to `"latest-m4"` if using adaptive compute.
|
| 114 |
|
| 115 |
### KV-cache Sharing
|
| 116 |
+
To reduce KV cache memory requirements, the model can be run with fewer KV-caches, with later iterations in the recurrence overwriting earlier caches. To use this feature, set
|
| 117 |
+
the cache argument `lookup_strategy` to include `compress-s16` (where the last number determine the size of the cache).
|
| 118 |
+
```
|
| 119 |
+
model.generate_with_adaptive_compute(input_ids, config, num_steps=64, tokenizer=tokenizer, streamer=streamer,
|
| 120 |
+
continuous_compute=False, cache_kwargs={"lookup_strategy": "compress-s16"})
|
| 121 |
+
```
|
| 122 |
+
You can combine this per-token adaptive compute. In that case your lookup strategy should be `latest-m4-compress-s16`.
|
| 123 |
|
| 124 |
+
### Warmstart / Continuous CoT
|
| 125 |
+
At each generation step, the recurrence can be warmstarted with the final state from the previous token by setting `continuous_compute=True`, like so
|
| 126 |
+
```
|
| 127 |
+
model.generate_with_adaptive_compute(input_ids, config, num_steps=64, tokenizer=tokenizer, streamer=streamer, continuous_compute=True)
|
| 128 |
+
```
|
| 129 |
|
| 130 |
|
| 131 |
|