2imi9 commited on
Commit
da575df
·
verified ·
1 Parent(s): 44368c3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -9
README.md CHANGED
@@ -52,11 +52,6 @@ model-index:
52
 
53
  This model is a quantized version of OpenAI's GPT-OSS-20B using NVIDIA's advanced NVFP4 format. It follows the official NVIDIA TensorRT Model Optimizer methodology, providing superior accuracy retention compared to MXFP4 quantization while maintaining significant memory efficiency gains.
54
 
55
- ## Blog
56
- Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training:
57
- https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/
58
-
59
-
60
  ## Key Features
61
 
62
  - **Advanced Quantization**: Uses NVFP4 format with FP8 E4M3 scaling for enhanced precision
@@ -207,6 +202,12 @@ This approach can achieve up to 98% task-specific performance recovery.
207
  - **Memory**: 24GB+ VRAM recommended
208
  - **CUDA**: 12.0+
209
 
 
 
 
 
 
 
210
  ### Framework Support Status
211
 
212
  - **TensorRT-LLM**: NVFP4 support in active development
@@ -216,10 +217,13 @@ This approach can achieve up to 98% task-specific performance recovery.
216
 
217
  ## Model Format Details
218
 
219
- - **Storage Format**: "Fake-quantized" NVFP4 (BF16 representation with quantization metadata)
220
- - **Deployment Format**: True NVFP4 when used with compatible inference engines
221
- - **File Format**: SafeTensors with quantization configuration
222
- - **Size**: ~39GB (fake-quantized), ~10GB (deployed NVFP4)
 
 
 
223
 
224
  ## Use Cases
225
 
@@ -235,6 +239,23 @@ This approach can achieve up to 98% task-specific performance recovery.
235
  - **Memory**: ~75% reduction in deployment memory requirements
236
  - **Compatibility**: Works with standard transformers, optimized for NVIDIA frameworks
237
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
238
  ## License
239
 
240
  This model inherits the Apache 2.0 license from the base openai/gpt-oss-20b model. Commercial use is permitted under the same terms.
 
52
 
53
  This model is a quantized version of OpenAI's GPT-OSS-20B using NVIDIA's advanced NVFP4 format. It follows the official NVIDIA TensorRT Model Optimizer methodology, providing superior accuracy retention compared to MXFP4 quantization while maintaining significant memory efficiency gains.
54
 
 
 
 
 
 
55
  ## Key Features
56
 
57
  - **Advanced Quantization**: Uses NVFP4 format with FP8 E4M3 scaling for enhanced precision
 
202
  - **Memory**: 24GB+ VRAM recommended
203
  - **CUDA**: 12.0+
204
 
205
+ ### Compatible Hardware (Software Emulation)
206
+ - **RTX 4090**: Ada Lovelace architecture (no native NVFP4 acceleration)
207
+ - **RTX 4080/4070**: Compatible via software emulation
208
+ - **Data Center**: H100, A100 (software emulation)
209
+ - **Memory**: 20GB+ VRAM for model loading
210
+
211
  ### Framework Support Status
212
 
213
  - **TensorRT-LLM**: NVFP4 support in active development
 
217
 
218
  ## Model Format Details
219
 
220
+ - **Storage Format**: BF16 with NVFP4 quantization metadata
221
+ - **File Size**: ~39GB (BF16 precision with quantization instructions)
222
+ - **Deployment Format**: Runtime conversion to NVFP4 by compatible inference engines
223
+ - **Deployed Size**: ~10GB when converted to 4-bit NVFP4 format
224
+ - **File Format**: SafeTensors with embedded quantization configuration
225
+
226
+ This model contains the full BF16 weights along with quantization parameters that enable inference engines like TensorRT-LLM to convert weights to true 4-bit NVFP4 format during model loading. The memory savings and performance benefits are realized at inference time, not during storage.
227
 
228
  ## Use Cases
229
 
 
239
  - **Memory**: ~75% reduction in deployment memory requirements
240
  - **Compatibility**: Works with standard transformers, optimized for NVIDIA frameworks
241
 
242
+ ## Limitations and Considerations
243
+
244
+ - **Current State**: Model saved in fake-quantized format for compatibility
245
+ - **Real Benefits**: Achieved only when deployed with NVFP4-compatible engines
246
+ - **Hardware Dependency**: Optimal performance requires NVIDIA Blackwell architecture
247
+ - **Framework Support**: Limited until inference engines implement NVFP4 support
248
+ - **Model Size**: Large storage footprint until deployment conversion
249
+
250
+ ## Evaluation and Benchmarking
251
+
252
+ This model maintains the capabilities of the original GPT-OSS-20B while providing memory efficiency benefits. For comprehensive evaluation, test against:
253
+
254
+ - **Language Modeling**: Perplexity on standard datasets
255
+ - **Downstream Tasks**: Task-specific accuracy measurements
256
+ - **Generation Quality**: Human evaluation of output coherence
257
+ - **Memory Usage**: Deployment memory requirements vs. accuracy trade-offs
258
+
259
  ## License
260
 
261
  This model inherits the Apache 2.0 license from the base openai/gpt-oss-20b model. Commercial use is permitted under the same terms.