Binary performance?
#7
by
williambarberjr
- opened
Curious if you've considered benchmarking GritLM-7B against the approach taken here: https://huggingface.co/blog/embedding-quantization
It's essentially the only way I could practically use 4096 dimensions in production for retrieval over a large corpus. The model size itself would still be an issue when embedding queries in production so pruning and/or quantization of the model itself may also still be needed. But not all models play nicely with this quantize and re-rank with int8 quantized vectors from disk approach. If you all were already planning to benchmark this I'd be very interested in the results.