snowflake-arctic-embed-m-v2.0-ONNX-quant

This is a mostly uint8 quantized version of this model: https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0/blob/main/onnx/model.onnx

However, some of the ONNX graph nodes are left unquantized for better accuracy. This model, on average, roughly equals the accuracy of the full precision ONNX model. The int8 sized model here is a bit less accurate.

I'll be using this model as the base for one with uint8 output.

Quantization method

I instrumented the model and measured the dynamic range of all the graph nodes while processing tokens from my dataset here: https://github.com/electroglyph/dataset_build

It took quite a while, because I had to do this model in batches. It used 200+ GB of RAM to attempt to instrument the entire model (which I don't actually have).

Some languages were automatically removed from my dataset because the tokenizer couldn't completely handle them:

Unknown token (<unk>) found for lang: jpn, skipping this language.
Unknown token (<unk>) found for lang: kor, skipping this language.
Unknown token (<unk>) found for lang: zho, skipping this language.
Unknown token (<unk>) found for lang: math, skipping this language.

This left almost 1.4m tokens for analysis.

After the analysis I exempted 121 of the most sensitive graph nodes from quantization. This increased the model size to 366MB. When purely int8 or uint8 the model typically sits around 296MB.

Benchmarks

I included results for one of my other models: https://huggingface.co/electroglyph/embeddinggemma-300m-ONNX-uint8

The results for this model are a bit better than the default int8 sized onnx model, and the total of all the scores matches the f32 model.

License

Arctic is licensed under the Apache-2. The released models can be used for commercial purposes free of charge.

Downloads last month: 14