h2o-danube3-500m-chat-GGUF

Description

This repo contains GGUF format model files for h2o-danube3-500m-chat quantized using llama.cpp framework.

Table below summarizes different quantized versions of h2o-danube3-500m-chat. It shows the trade-off between size, speed and quality of the models.

Name Quant method Model size MT-Bench AVG Perplexity Tokens per second
h2o-danube3-500m-chat-F16.gguf F16 1.03 GB 3.34 9.46 1870
h2o-danube3-500m-chat-Q8_0.gguf Q8_0 0.55 GB 3.76 9.46 2144
h2o-danube3-500m-chat-Q6_K.gguf Q6_K 0.42 GB 3.77 9.46 2418
h2o-danube3-500m-chat-Q5_K_M.gguf Q5_K_M 0.37 GB 3.20 9.55 2430
h2o-danube3-500m-chat-Q4_K_M.gguf Q4_K_M 0.32 GB 3.16 9.96 2427

Columns in the table are:

  • Name -- model name and link
  • Quant method -- quantization method
  • Model size -- size of the model in gigabytes
  • MT-Bench AVG -- MT-Bench benchmark score. The score is from 1 to 10, the higher, the better
  • Perplexity -- perplexity metric on WikiText-2 dataset. It's reported in a perplexity test from llama.cpp. The lower, the better
  • Tokens per second -- generation speed in tokens per second, as reported in a perplexity test from llama.cpp. The higher, the better. Speed tests are done on a single H100 GPU

Prompt template

<|prompt|>Why is drinking water so healthy?</s><|answer|>
Downloads last month
400
GGUF
Model size
514M params
Architecture
llama

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including h2oai/h2o-danube3-500m-chat-GGUF