Instructions to use codellama/CodeLlama-7b-hf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use codellama/CodeLlama-7b-hf with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="codellama/CodeLlama-7b-hf")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf") model = AutoModelForCausalLM.from_pretrained("codellama/CodeLlama-7b-hf") - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use codellama/CodeLlama-7b-hf with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "codellama/CodeLlama-7b-hf" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "codellama/CodeLlama-7b-hf", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/codellama/CodeLlama-7b-hf
- SGLang
How to use codellama/CodeLlama-7b-hf with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "codellama/CodeLlama-7b-hf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "codellama/CodeLlama-7b-hf", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "codellama/CodeLlama-7b-hf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "codellama/CodeLlama-7b-hf", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use codellama/CodeLlama-7b-hf with Docker Model Runner:
docker model run hf.co/codellama/CodeLlama-7b-hf
Issue with using the codellama-7b model
I have set up the codellama-7b model locally and used the official example, but the final result does not meet expectations. Here is the code:
codeLlama_tokenizer = CodeLlamaTokenizer.from_pretrained("./CodeLlama-7b-hf", padding_side='left')
codeLlama_model = LlamaForCausalLM.from_pretrained("./CodeLlama-7b-hf")
codeLlama_model.to(device='cuda:0', dtype=torch.bfloat16)
text = '''def remove_non_ascii(s: str) -> str:
""" <FILL_ME>
return result
'''
start_time = time.time()
input_ids = codeLlama_tokenizer(text, return_tensors="pt")["input_ids"]
input_ids = input_ids.to('cuda')
generated_ids = codeLlama_model.generate(input_ids, max_new_tokens=200, do_sample=True, top_p=0.9, temperature=0.1, num_return_sequences=1, repetition_penalty=1.05, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id)
filling = codeLlama_tokenizer.batch_decode(generated_ids[:, input_ids.shape[1]:], skip_special_tokens=True)[0]
print(filling)
The output of the code is:
Remove non-ascii characters from a string. """
result = ""
for c in s:
if ord(c) < 128:
result += c
}
public void setId(String id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getDescription() {
return description;
}
public void setDescription(String description) {
this.description = description;
}
public String getType() {
return type;
}
There are two issues with the generated code that don't meet expectations:
1、It doesn't consider suffixes and seems to ignore everything after <FILL_ME>.
2、After completing the desired part of the code, it adds a lot of unnecessary additional code.
Is this behavior normal? Is there any way to improve it?
i have the same questions
A few things to note here.
- to check if the
<FILL_ME>is taken into account, you need to make sure the input ids are properly formatted. - the outputs we have match 1-1 with the original outputs. But when you generate with sampling and custom temperature etc, you should expect some hallucination. Especially if the eos token is not properly set, the model will not stop early enough :/
Thanks for opening the issue!
Regarding the unnecessary additional code, In my case it was helpful to use a repetition penalty of 0.9. Maybe that helps in your case as well! :)