Meta-Llama-3-8B-Instruct-ct2-int8

This is a ctranslate2 v4.5.0 int8 conversion of meta-llama/Meta-Llama-3-8B-Instruct created with:

ct2-transformers-converter --model meta-llama/Meta-Llama-3-8B-Instruct --output_dir Meta-Llama-3-8B-Instruct-ct2-int8 --quantization int8

Downloading

ct2 doesn't have hf-hub integration, so you'll need to manually download the model files:

huggingface-cli download mike-ravkine/Meta-Llama-3-8B-Instruct-ct2-int8 --local-dir Meta-Llama-3-8B-Instruct-ct2-int8/

Using

Install dependencies:

pip install transformers[torch] ctranslate2

Sample inference code:

import sys
import ctranslate2
from transformers import AutoTokenizer

model_dir = sys.argv[1] # download dir
tokenizer_dir = meta-llama/Meta-Llama-3-8B-Instruct

print("Loading the model...")
generator = ctranslate2.Generator(model_dir, device="cuda")
tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)

dialog = [{"role": "user", "content": "What is the meaning of life, the universe and everything?"}]
max_generation_length = 512

prompt_string = tokenizer.apply_chat_template(dialog, add_generation_prompt=True, tokenize=False)
# It seems silly to tokenize=False and then call tokenize, but tokenize=True returns just ids; we need actual tokens
prompt_tokens = tokenizer.tokenize(prompt_string)

step_results = generator.generate_tokens(
    prompt_tokens,
    max_length=max_generation_length,
    sampling_temperature=0.6,
    sampling_topk=20,
    sampling_topp=1,
)
for step_result in step_results:
    word = tokenizer.decode([step_result.token_id])
    print(word, end="", flush=True)