Files changed (1) hide show
  1. README.md +165 -2
README.md CHANGED
@@ -21,8 +21,14 @@ tags:
21
  - quantized
22
  license: llama3.3
23
  ---
24
-
25
- # Llama-3.3-70B-Instruct-FP8-dynamic
 
 
 
 
 
 
26
 
27
  ## Model Overview
28
  - **Model Architecture:** Meta-Llama-3.1
@@ -80,6 +86,163 @@ print(generated_text)
80
 
81
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
82
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
  ## Creation
84
 
85
  <details>
 
21
  - quantized
22
  license: llama3.3
23
  ---
24
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
25
+ Llama-3.3-70B-Instruct-FP8-dynamic
26
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
27
+ </h1>
28
+
29
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
30
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
31
+ </a>
32
 
33
  ## Model Overview
34
  - **Model Architecture:** Meta-Llama-3.1
 
86
 
87
  vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
88
 
89
+ <details>
90
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
91
+
92
+ ```bash
93
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
94
+ --ipc=host \
95
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
96
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
97
+ --name=vllm \
98
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
99
+ vllm serve \
100
+ --tensor-parallel-size 8 \
101
+ --max-model-len 32768 \
102
+ --enforce-eager --model RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic
103
+ ```
104
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
105
+ </details>
106
+
107
+ <details>
108
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
109
+
110
+ ```bash
111
+ # Download model from Red Hat Registry via docker
112
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
113
+ ilab model download --repository docker://registry.redhat.io/rhelai1/llama-3-3-70b-instruct-fp8-dynamic:1.5
114
+ ```
115
+
116
+ ```bash
117
+ # Serve model via ilab
118
+ ilab model serve --model-path ~/.cache/instructlab/models/llama-3-3-70b-instruct-fp8-dynamic
119
+
120
+ # Chat with model
121
+ ilab model chat --model ~/.cache/instructlab/models/llama-3-3-70b-instruct-fp8-dynamic
122
+ ```
123
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
124
+ </details>
125
+
126
+ <details>
127
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
128
+
129
+ ```python
130
+ # Setting up vllm server with ServingRuntime
131
+ # Save as: vllm-servingruntime.yaml
132
+ apiVersion: serving.kserve.io/v1alpha1
133
+ kind: ServingRuntime
134
+ metadata:
135
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
136
+ annotations:
137
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
138
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
139
+ labels:
140
+ opendatahub.io/dashboard: 'true'
141
+ spec:
142
+ annotations:
143
+ prometheus.io/port: '8080'
144
+ prometheus.io/path: '/metrics'
145
+ multiModel: false
146
+ supportedModelFormats:
147
+ - autoSelect: true
148
+ name: vLLM
149
+ containers:
150
+ - name: kserve-container
151
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
152
+ command:
153
+ - python
154
+ - -m
155
+ - vllm.entrypoints.openai.api_server
156
+ args:
157
+ - "--port=8080"
158
+ - "--model=/mnt/models"
159
+ - "--served-model-name={{.Name}}"
160
+ env:
161
+ - name: HF_HOME
162
+ value: /tmp/hf_home
163
+ ports:
164
+ - containerPort: 8080
165
+ protocol: TCP
166
+ ```
167
+
168
+ ```python
169
+ # Attach model to vllm server. This is an NVIDIA template
170
+ # Save as: inferenceservice.yaml
171
+ apiVersion: serving.kserve.io/v1beta1
172
+ kind: InferenceService
173
+ metadata:
174
+ annotations:
175
+ openshift.io/display-name: llama-3-3-70b-instruct-fp8-dynamic # OPTIONAL CHANGE
176
+ serving.kserve.io/deploymentMode: RawDeployment
177
+ name: llama-3-3-70b-instruct-fp8-dynamic # specify model name. This value will be used to invoke the model in the payload
178
+ labels:
179
+ opendatahub.io/dashboard: 'true'
180
+ spec:
181
+ predictor:
182
+ maxReplicas: 1
183
+ minReplicas: 1
184
+ model:
185
+ modelFormat:
186
+ name: vLLM
187
+ name: ''
188
+ resources:
189
+ limits:
190
+ cpu: '2' # this is model specific
191
+ memory: 8Gi # this is model specific
192
+ nvidia.com/gpu: '1' # this is accelerator specific
193
+ requests: # same comment for this block
194
+ cpu: '1'
195
+ memory: 4Gi
196
+ nvidia.com/gpu: '1'
197
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
198
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-llama-3-3-70b-instruct-fp8-dynamic:1.5
199
+ tolerations:
200
+ - effect: NoSchedule
201
+ key: nvidia.com/gpu
202
+ operator: Exists
203
+ ```
204
+
205
+ ```bash
206
+ # make sure first to be in the project where you want to deploy the model
207
+ # oc project <project-name>
208
+
209
+ # apply both resources to run model
210
+
211
+ # Apply the ServingRuntime
212
+ oc apply -f vllm-servingruntime.yaml
213
+
214
+ # Apply the InferenceService
215
+ oc apply -f qwen-inferenceservice.yaml
216
+ ```
217
+
218
+ ```python
219
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
220
+ # - Run `oc get inferenceservice` to find your URL if unsure.
221
+
222
+ # Call the server using curl:
223
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
224
+ -H "Content-Type: application/json" \
225
+ -d '{
226
+ "model": "llama-3-3-70b-instruct-fp8-dynamic",
227
+ "stream": true,
228
+ "stream_options": {
229
+ "include_usage": true
230
+ },
231
+ "max_tokens": 1,
232
+ "messages": [
233
+ {
234
+ "role": "user",
235
+ "content": "How can a bee fly when its wings are so small?"
236
+ }
237
+ ]
238
+ }'
239
+
240
+ ```
241
+
242
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
243
+ </details>
244
+
245
+
246
  ## Creation
247
 
248
  <details>