tomg-group-umd
/

huginn-0125

@@ -32,17 +32,17 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
 model = AutoModelForCausalLM.from_pretrained("tomg-group-umd/huginn-0125", torch_dtype=torch.bfloat16, trust_remote_code=True)
 tokenizer = AutoTokenizer.from_pretrained("tomg-group-umd/huginn-0125")
 ```
-### Fixed depth Usage
-By providing the argument `num_steps`, the model will execute a pass with that amount of compute:
 ```python
-input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)[:, :-1]
 model.eval()
 model.to(device)
 model(input_ids, num_steps=32)
 ```
 The model has about 1.5B parameters in non-recurrent code, 0.5B parameters in the embedding, and 1.5B recurrent parameters, so, as a guideline,
-the number of materialized parameters is `num_steps * 1.5B + 2B`. Playing with this parameter is what makes this model interesting (and different from fixed-depth) transformers!
 The model is trained to accept an arbitrary number of steps. However, using fewer than 4 steps will result in very coarse answers. If given enough context to reason about, benchmarks show the model improving up to around `num_steps=64`. Beyond that, more steps generally do not hurt, but we see no further improvements.
@@ -60,7 +60,7 @@ config = GenerationConfig(max_length=256, stop_strings=["<|end_text|>", "<|end_t
                           eos_token_id=65505,bos_token_id=65504,pad_token_id=65509)
-input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)[:, :-1]
 outputs = model.generate(input_ids, config, tokenizer=tokenizer, num_steps=16)
 ```
@@ -84,7 +84,7 @@ model.generate(input_ids, config, num_steps=64, tokenizer=tokenizer)
 ### KV-cache Details
 The model requires its own KV-cache implementation `HuginnDynamicCache`, otherwise the KV-caches of later calls to the recurrent block will overwrite the earlier ones.
-This should be handled automatically by this implementation, but may break with huggingface updates. If you do not use generate, but implement your own generation, use a pattern like this:
 ```python
 # first step:
@@ -98,25 +98,34 @@ outputs = model(input_ids=input_ids, use_cache=True, past_key_values=past_key_va
 ## Advanced Features
 ### Per-Token Adaptive Compute
 ```python
-model.to(device=device, dtype=torch.bfloat16)
-model.eval()
-past_key_values = DynamicCache()
-config = GenerationConfig(max_length=64, stop_strings=["<|end_text|>", "<|end_turn|>"],
-                          use_cache=True, past_key_values=past_key_values,
-                          do_sample=False, temperature=None, top_k=None, top_p=None, min_p=None,
-                          return_dict_in_generate=True,
-                          eos_token_id=65505,bos_token_id=65504,pad_token_id=65509)
-                          # Note: num_steps and other model arguments CANNOT be included here, they will shadow model args at runtime
-input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)[:, :-1]
-outputs = model.generate(input_ids, config, tokenizer=tokenizer)
 ```
 ### KV-cache Sharing

 model = AutoModelForCausalLM.from_pretrained("tomg-group-umd/huginn-0125", torch_dtype=torch.bfloat16, trust_remote_code=True)
 tokenizer = AutoTokenizer.from_pretrained("tomg-group-umd/huginn-0125")
 ```
+### Modifying the Model's Depth at Test Time:
+By providing the argument `num_steps`, the model will execute a forward pass with that amount of compute:
 ```python
+input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)
 model.eval()
 model.to(device)
 model(input_ids, num_steps=32)
 ```
 The model has about 1.5B parameters in non-recurrent code, 0.5B parameters in the embedding, and 1.5B recurrent parameters, so, as a guideline,
+the number of materialized parameters is `num_steps * 1.5B + 2B`. Playing with this parameter is what makes this model interesting, and different from fixed-depth transformers!
 The model is trained to accept an arbitrary number of steps. However, using fewer than 4 steps will result in very coarse answers. If given enough context to reason about, benchmarks show the model improving up to around `num_steps=64`. Beyond that, more steps generally do not hurt, but we see no further improvements.
                           eos_token_id=65505,bos_token_id=65504,pad_token_id=65509)
+input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)
 outputs = model.generate(input_ids, config, tokenizer=tokenizer, num_steps=16)
 ```
 ### KV-cache Details
 The model requires its own KV-cache implementation `HuginnDynamicCache`, otherwise the KV-caches of later calls to the recurrent block will overwrite the earlier ones.
+The current implementation will always try to inject this Cache implementation, but that may break with huggingface updates. If you do not use generate, but implement your own generation, use a pattern like this:
 ```python
 # first step:
 ## Advanced Features
 ### Per-Token Adaptive Compute
+When generating, you can also a variable amount of compute per-token. The model is not trained for this, so this is a proof-of-concept, that can do this task zero-shot.
+You can pick between a few sane stopping rules, `entropy-diff`, `latent-diff`,`kl` and `argmax-stability`, via `criterion=kl`. The exit threshold can be modified via `exit_threshold=5e-4`.
+We suggest using `kl` for interesting exits and `argmax-stability` for conservative exits. Note that using these variables overrides the default generation function. Not all arguments that are valid for the normal `generate` call are valid here. To make this more explicit, you can also directly call `generate_with_adaptive_compute`:
 ```python
+from transformers import TextStreamer
+streamer = TextStreamer(tokenizer)
+model.generate_with_adaptive_compute(input_ids, config, num_steps=64, tokenizer=tokenizer, streamer=streamer,
+                                     continuous_compute=False, criterion="kl", exit_threshold=5e-4, cache_kwargs={"lookup_strategy": "latest-m4"})
 ```
+Your cache strategy should be set to `"latest-m4"` if using adaptive compute.
 ### KV-cache Sharing
+To reduce KV cache memory requirements, the model can be run with fewer KV-caches, with later iterations in the recurrence overwriting earlier caches. To use this feature, set
+the cache argument `lookup_strategy` to include `compress-s16` (where the last number determine the size of the cache).
+```
+model.generate_with_adaptive_compute(input_ids, config, num_steps=64, tokenizer=tokenizer, streamer=streamer,
+                                     continuous_compute=False, cache_kwargs={"lookup_strategy": "compress-s16"})
+```
+You can combine this per-token adaptive compute. In that case your lookup strategy should be `latest-m4-compress-s16`.
+### Warmstart / Continuous CoT
+At each generation step, the recurrence can be warmstarted with the final state from the previous token by setting `continuous_compute=True`, like so
+```
+model.generate_with_adaptive_compute(input_ids, config, num_steps=64, tokenizer=tokenizer, streamer=streamer, continuous_compute=True)
+```