eralFlare
/

gemma-2b-it-gptq

Text Generation

text-generation-inference

4-bit precision

Model card Files Files and versions

eralFlare commited on Mar 4, 2024

Commit

2adbc8c

·

verified ·

1 Parent(s): 8fa5748

Update README.md

Files changed (1) hide show

README.md +4 -21

README.md CHANGED Viewed

@@ -18,7 +18,7 @@ license_link: https://ai.google.dev/gemma/terms
 ---
 # Gemma Model Card
 **Model Page**: [Gemma](https://ai.google.dev/gemma/docs)
@@ -54,32 +54,15 @@ state of the art AI models and helping foster innovation for everyone.
 Below we share some code snippets on how to get quickly started with running the model. First make sure to `pip install -U transformers`, then copy the snippet from the section that is relevant for your usecase.
-#### Running the model on a CPU
-```python
-from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
-model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
-input_text = "Write me a poem about Machine Learning."
-input_ids = tokenizer(input_text, return_tensors="pt")
-outputs = model.generate(**input_ids)
-print(tokenizer.decode(outputs[0]))
-```
 #### Running the model on a single / multi GPU
 ```python
-# pip install accelerate
 from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
-model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", device_map="auto")
 input_text = "Write me a poem about Machine Learning."
 input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

 ---
 # Gemma Model Card
+This model card is copied from the original google/gemma-2b-it with edits on running this auto-gptq quantized version of the model. This auto-gptq quantized version of the model had only been tested to work on cuda GPU.
 **Model Page**: [Gemma](https://ai.google.dev/gemma/docs)
 Below we share some code snippets on how to get quickly started with running the model. First make sure to `pip install -U transformers`, then copy the snippet from the section that is relevant for your usecase.
 #### Running the model on a single / multi GPU
 ```python
+# !pip install --upgrade -q transformers accelerate auto-gptq optimum
 from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("eralFlare/gemma-2b-it")
+model = AutoModelForCausalLM.from_pretrained("eralFlare/gemma-2b-it", device_map="auto")
 input_text = "Write me a poem about Machine Learning."
 input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")