Help: How to inference with this model in FP8

#43
by YukiTomita-CC - opened

I rewriting the code in the Instruct following section of the Mistral Inference as follows and executed it in Google Colab.

import torch # Added

from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest

tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json")
model = Transformer.from_folder(mistral_models_path, dtype=torch.float8_e4m3fn) # Changed

prompt = "How expensive would it be to ask a window cleaner to clean all windows in Paris. Make a reasonable guess in US Dollar."

completion_request = ChatCompletionRequest(messages=[UserMessage(content=prompt)])

tokens = tokenizer.encode_chat_completion(completion_request).tokens

out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.35, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.decode(out_tokens[0])

print(result)

When the model was loaded, VRAM usage was 11.6GB, but the following error occurs in the generate():

/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2262         # remove once script supports set_grad_enabled
   2263         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2264     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   2265 
   2266 

RuntimeError: "index_select_cuda" not implemented for 'Float8_e4m3fn'
  • mistral_inference==1.3.1
  • torch==2.3.1+cu121
  • safetensors==0.4.3

Am I doing something wrong? If I could inference in FP8, it would fit in my 16GB VRAM. I would really appreciate if someone could teach me how to inference in FP8.

I have the same problem. I can load the model in FP8 with transformers, but it does not work with mistral_inference.
Is there anyone knows about how to loading model with mistral_inference and FP8?

I have the same problem. I can load the model in FP8 with transformers, but it does not work with mistral_inference.
Is there anyone knows about how to loading model with mistral_inference and FP8?

What GPU are you using if you can load the quantized model with the transformers lib? A100 or upwards?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment