Text Generation
Transformers
ONNX
English
gpt_neox

Update usage to be specific to ORT+DML

#2
by pavignol2 - opened
Files changed (1) hide show
  1. README.md +12 -79
README.md CHANGED
@@ -49,96 +49,29 @@ The ONNX model above was processed with the [Olive](https://github.com/microsoft
49
  [EleutherAI’s](https://www.eleuther.ai/) [Pythia-6.9b](https://huggingface.co/EleutherAI/pythia-6.9b) and fine-tuned
50
  on a [~15K record instruction corpus](https://github.com/databrickslabs/dolly/tree/master/data) generated by Databricks employees and released under a permissive license (CC-BY-SA)
51
 
52
- ## Usage
53
-
54
- To use the model with the `transformers` library on a machine with GPUs, first make sure you have the `transformers` and `accelerate` libraries installed.
55
- In a Databricks notebook you could run:
56
-
57
- ```python
58
- %pip install "accelerate>=0.16.0,<1" "transformers[torch]>=4.28.1,<5" "torch>=1.13.1,<2"
59
- ```
60
 
61
- The instruction following pipeline can be loaded using the `pipeline` function as shown below. This loads a custom `InstructionTextGenerationPipeline`
62
- found in the model repo [here](https://huggingface.co/databricks/dolly-v2-3b/blob/main/instruct_pipeline.py), which is why `trust_remote_code=True` is required.
63
- Including `torch_dtype=torch.bfloat16` is generally recommended if this type is supported in order to reduce memory usage. It does not appear to impact output quality.
64
- It is also fine to remove it if there is sufficient memory.
65
-
66
- ```python
67
- import torch
68
- from transformers import pipeline
69
-
70
- generate_text = pipeline(model="databricks/dolly-v2-7b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
71
- ```
72
 
73
- You can then use the pipeline to answer instructions:
74
 
75
  ```python
76
- res = generate_text("Explain to me the difference between nuclear fission and fusion.")
77
- print(res[0]["generated_text"])
78
  ```
79
 
80
- Alternatively, if you prefer to not use `trust_remote_code=True` you can download [instruct_pipeline.py](https://huggingface.co/databricks/dolly-v2-3b/blob/main/instruct_pipeline.py),
81
- store it alongside your notebook, and construct the pipeline yourself from the loaded model and tokenizer:
82
 
83
  ```python
84
- import torch
 
85
  from instruct_pipeline import InstructionTextGenerationPipeline
86
- from transformers import AutoModelForCausalLM, AutoTokenizer
87
-
88
- tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-7b", padding_side="left")
89
- model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-7b", device_map="auto", torch_dtype=torch.bfloat16)
90
-
91
- generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)
92
- ```
93
-
94
- ### LangChain Usage
95
-
96
- To use the pipeline with LangChain, you must set `return_full_text=True`, as LangChain expects the full text to be returned
97
- and the default for the pipeline is to only return the new text.
98
-
99
- ```python
100
- import torch
101
- from transformers import pipeline
102
 
103
- generate_text = pipeline(model="databricks/dolly-v2-7b", torch_dtype=torch.bfloat16,
104
- trust_remote_code=True, device_map="auto", return_full_text=True)
105
- ```
106
-
107
- You can create a prompt that either has only an instruction or has an instruction with context:
108
-
109
- ```python
110
- from langchain import PromptTemplate, LLMChain
111
- from langchain.llms import HuggingFacePipeline
112
-
113
- # template for an instrution with no input
114
- prompt = PromptTemplate(
115
- input_variables=["instruction"],
116
- template="{instruction}")
117
-
118
- # template for an instruction with input
119
- prompt_with_context = PromptTemplate(
120
- input_variables=["instruction", "context"],
121
- template="{instruction}\n\nInput:\n{context}")
122
-
123
- hf_pipeline = HuggingFacePipeline(pipeline=generate_text)
124
-
125
- llm_chain = LLMChain(llm=hf_pipeline, prompt=prompt)
126
- llm_context_chain = LLMChain(llm=hf_pipeline, prompt=prompt_with_context)
127
- ```
128
-
129
- Example predicting using a simple instruction:
130
-
131
- ```python
132
- print(llm_chain.predict(instruction="Explain to me the difference between nuclear fission and fusion.").lstrip())
133
- ```
134
-
135
- Example predicting using an instruction with context:
136
-
137
- ```python
138
- context = """George Washington (February 22, 1732[b] – December 14, 1799) was an American military officer, statesman,
139
- and Founding Father who served as the first president of the United States from 1789 to 1797."""
140
 
141
- print(llm_context_chain.predict(instruction="When was George Washington president?", context=context).lstrip())
 
 
142
  ```
143
 
144
 
 
49
  [EleutherAI’s](https://www.eleuther.ai/) [Pythia-6.9b](https://huggingface.co/EleutherAI/pythia-6.9b) and fine-tuned
50
  on a [~15K record instruction corpus](https://github.com/databrickslabs/dolly/tree/master/data) generated by Databricks employees and released under a permissive license (CC-BY-SA)
51
 
52
+ `dolly-v2-7b-olive-optimized` is an optimized ONNX model of `dolly-v2-7b` generated by [Olive](https://github.com/microsoft/Olive) that is meant to be used with ONNX Runtime and DirectML.
 
 
 
 
 
 
 
53
 
54
+ ## Usage
 
 
 
 
 
 
 
 
 
 
55
 
56
+ To use the model with the `transformers` library on a machine with ONNX Runtime and DirectML, first make sure you have the `transformers`, `accelerate`, `optimum`, `onnxruntime-directml` and `onnx` libraries installed:
57
 
58
  ```python
59
+ pip install "accelerate>=0.16.0,<1" "transformers[torch]>=4.28.1,<5" "torch>=1.13.1,<2" "optimum>=1.8.8,<2" "onnxruntime-directml>=1.15.1,<2" "onnx>=1.14.0<2"
 
60
  ```
61
 
62
+ You can then download [instruct_pipeline.py](https://huggingface.co/microsoft/dolly-v2-7b-olive-optimized/raw/main/instruct_pipeline.py) and construct the pipeline from the loaded model and tokenizer:
 
63
 
64
  ```python
65
+ from transformers import AutoTokenizer, TextStreamer
66
+ from optimum.onnxruntime import ORTModelForCausalLM
67
  from instruct_pipeline import InstructionTextGenerationPipeline
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
+ tokenizer = AutoTokenizer.from_pretrained("microsoft/dolly-v2-7b-olive-optimized", padding_side="left")
70
+ model = ORTModelForCausalLM.from_pretrained("microsoft/dolly-v2-7b-olive-optimized", provider="DmlExecutionProvider", use_cache=True, use_merged=True, use_io_binding=False)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
 
72
+ streamer = TextStreamer(tokenizer, skip_prompt=True)
73
+ generate_text = InstructionTextGenerationPipeline(model=model, streamer=streamer, tokenizer=tokenizer, max_new_tokens=128)
74
+ generate_text("Explain to me the difference between nuclear fission and fusion.")
75
  ```
76
 
77