jennyyyi commited on
Commit
93ffbd3
·
verified ·
1 Parent(s): db21c70

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +198 -0
README.md CHANGED
@@ -19,6 +19,17 @@ tags:
19
  - llama-3
20
  license: llama3.3
21
  ---
 
 
 
 
 
 
 
 
 
 
 
22
  ## Model Information
23
  **Built with Llama**
24
 
@@ -58,6 +69,193 @@ Where to send questions or comments about the model Instructions on how to provi
58
 
59
  This repository contains two versions of Llama-3.3-70B-Instruct, for use with transformers and with the original `llama` codebase.
60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  ### Use with transformers
62
 
63
  Starting with `transformers >= 4.45.0` onward, you can run conversational inference using the Transformers `pipeline` abstraction or by leveraging the Auto classes with the `generate()` function.
 
19
  - llama-3
20
  license: llama3.3
21
  ---
22
+
23
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
24
+ Llama-3.3-70B-Instruct
25
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
26
+ </h1>
27
+
28
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
29
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
30
+ </a>
31
+
32
+
33
  ## Model Information
34
  **Built with Llama**
35
 
 
69
 
70
  This repository contains two versions of Llama-3.3-70B-Instruct, for use with transformers and with the original `llama` codebase.
71
 
72
+ ## Deployment
73
+
74
+ This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
75
+
76
+ Deploy on <strong>vLLM</strong>
77
+
78
+ ```python
79
+ from vllm import LLM, SamplingParams
80
+
81
+ from transformers import AutoTokenizer
82
+
83
+ model_id = "RedHatAI/Llama-3.3-70B-Instruct"
84
+ number_gpus = 4
85
+
86
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
87
+
88
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
89
+
90
+ prompt = "Give me a short introduction to large language model."
91
+
92
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
93
+
94
+ outputs = llm.generate(prompt, sampling_params)
95
+
96
+ generated_text = outputs[0].outputs[0].text
97
+ print(generated_text)
98
+ ```
99
+
100
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
101
+
102
+ <details>
103
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
104
+
105
+ ```bash
106
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
107
+ --ipc=host \
108
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
109
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
110
+ --name=vllm \
111
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
112
+ vllm serve \
113
+ --tensor-parallel-size 8 \
114
+ --max-model-len 32768 \
115
+ --enforce-eager --model RedHatAI/Llama-3.3-70B-Instruct
116
+ ```
117
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
118
+ </details>
119
+
120
+ <details>
121
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
122
+
123
+ ```bash
124
+ # Download model from Red Hat Registry via docker
125
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
126
+ ilab model download --repository docker://registry.redhat.io/rhelai1/llama-3-3-70b-instruct:1.5
127
+ ```
128
+
129
+ ```bash
130
+ # Serve model via ilab
131
+ ilab model serve --model-path ~/.cache/instructlab/models/llama-3-3-70b-instruct
132
+
133
+ # Chat with model
134
+ ilab model chat --model ~/.cache/instructlab/models/llama-3-3-70b-instruct
135
+ ```
136
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
137
+ </details>
138
+
139
+ <details>
140
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
141
+
142
+ ```python
143
+ # Setting up vllm server with ServingRuntime
144
+ # Save as: vllm-servingruntime.yaml
145
+ apiVersion: serving.kserve.io/v1alpha1
146
+ kind: ServingRuntime
147
+ metadata:
148
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
149
+ annotations:
150
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
151
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
152
+ labels:
153
+ opendatahub.io/dashboard: 'true'
154
+ spec:
155
+ annotations:
156
+ prometheus.io/port: '8080'
157
+ prometheus.io/path: '/metrics'
158
+ multiModel: false
159
+ supportedModelFormats:
160
+ - autoSelect: true
161
+ name: vLLM
162
+ containers:
163
+ - name: kserve-container
164
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
165
+ command:
166
+ - python
167
+ - -m
168
+ - vllm.entrypoints.openai.api_server
169
+ args:
170
+ - "--port=8080"
171
+ - "--model=/mnt/models"
172
+ - "--served-model-name={{.Name}}"
173
+ env:
174
+ - name: HF_HOME
175
+ value: /tmp/hf_home
176
+ ports:
177
+ - containerPort: 8080
178
+ protocol: TCP
179
+ ```
180
+
181
+ ```python
182
+ # Attach model to vllm server. This is an NVIDIA template
183
+ # Save as: inferenceservice.yaml
184
+ apiVersion: serving.kserve.io/v1beta1
185
+ kind: InferenceService
186
+ metadata:
187
+ annotations:
188
+ openshift.io/display-name: Llama-3.3-70B-Instruct # OPTIONAL CHANGE
189
+ serving.kserve.io/deploymentMode: RawDeployment
190
+ name: Llama-3.3-70B-Instruct # specify model name. This value will be used to invoke the model in the payload
191
+ labels:
192
+ opendatahub.io/dashboard: 'true'
193
+ spec:
194
+ predictor:
195
+ maxReplicas: 1
196
+ minReplicas: 1
197
+ model:
198
+ modelFormat:
199
+ name: vLLM
200
+ name: ''
201
+ resources:
202
+ limits:
203
+ cpu: '2' # this is model specific
204
+ memory: 8Gi # this is model specific
205
+ nvidia.com/gpu: '1' # this is accelerator specific
206
+ requests: # same comment for this block
207
+ cpu: '1'
208
+ memory: 4Gi
209
+ nvidia.com/gpu: '1'
210
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
211
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-llama-3-3-70b-instruct:1.5
212
+ tolerations:
213
+ - effect: NoSchedule
214
+ key: nvidia.com/gpu
215
+ operator: Exists
216
+ ```
217
+
218
+ ```bash
219
+ # make sure first to be in the project where you want to deploy the model
220
+ # oc project <project-name>
221
+
222
+ # apply both resources to run model
223
+
224
+ # Apply the ServingRuntime
225
+ oc apply -f vllm-servingruntime.yaml
226
+
227
+ # Apply the InferenceService
228
+ oc apply -f qwen-inferenceservice.yaml
229
+ ```
230
+
231
+ ```python
232
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
233
+ # - Run `oc get inferenceservice` to find your URL if unsure.
234
+
235
+ # Call the server using curl:
236
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
237
+ -H "Content-Type: application/json" \
238
+ -d '{
239
+ "model": "Llama-3.3-70B-Instruct",
240
+ "stream": true,
241
+ "stream_options": {
242
+ "include_usage": true
243
+ },
244
+ "max_tokens": 1,
245
+ "messages": [
246
+ {
247
+ "role": "user",
248
+ "content": "How can a bee fly when its wings are so small?"
249
+ }
250
+ ]
251
+ }'
252
+
253
+ ```
254
+
255
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
256
+ </details>
257
+
258
+
259
  ### Use with transformers
260
 
261
  Starting with `transformers >= 4.45.0` onward, you can run conversational inference using the Transformers `pipeline` abstraction or by leveraging the Auto classes with the `generate()` function.