Instructions to use codellama/CodeLlama-7b-hf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use codellama/CodeLlama-7b-hf with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="codellama/CodeLlama-7b-hf")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")
model = AutoModelForCausalLM.from_pretrained("codellama/CodeLlama-7b-hf")

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use codellama/CodeLlama-7b-hf with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "codellama/CodeLlama-7b-hf"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "codellama/CodeLlama-7b-hf",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/codellama/CodeLlama-7b-hf

SGLang

How to use codellama/CodeLlama-7b-hf with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "codellama/CodeLlama-7b-hf" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "codellama/CodeLlama-7b-hf",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "codellama/CodeLlama-7b-hf" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "codellama/CodeLlama-7b-hf",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use codellama/CodeLlama-7b-hf with Docker Model Runner:
```
docker model run hf.co/codellama/CodeLlama-7b-hf
```

Issue with using the codellama-7b model

#17

by RyanAX - opened Sep 7, 2023

Discussion

RyanAX

Sep 7, 2023

I have set up the codellama-7b model locally and used the official example, but the final result does not meet expectations. Here is the code:

codeLlama_tokenizer = CodeLlamaTokenizer.from_pretrained("./CodeLlama-7b-hf", padding_side='left')
codeLlama_model = LlamaForCausalLM.from_pretrained("./CodeLlama-7b-hf")
codeLlama_model.to(device='cuda:0', dtype=torch.bfloat16)

text = '''def remove_non_ascii(s: str) -> str:
        """ <FILL_ME>
        return result
    '''

start_time = time.time()
input_ids = codeLlama_tokenizer(text, return_tensors="pt")["input_ids"]
input_ids = input_ids.to('cuda')
generated_ids = codeLlama_model.generate(input_ids, max_new_tokens=200, do_sample=True, top_p=0.9, temperature=0.1, num_return_sequences=1, repetition_penalty=1.05, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id)
filling = codeLlama_tokenizer.batch_decode(generated_ids[:, input_ids.shape[1]:], skip_special_tokens=True)[0]
print(filling)

The output of the code is:

Remove non-ascii characters from a string. """
        result = ""
        for c in s:
            if ord(c) < 128:
                result += c
    }

    public void setId(String id) {
        this.id = id;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public String getDescription() {
        return description;
    }

    public void setDescription(String description) {
        this.description = description;
    }

    public String getType() {
        return type;
}

There are two issues with the generated code that don't meet expectations:
1、It doesn't consider suffixes and seems to ignore everything after <FILL_ME>.
2、After completing the desired part of the code, it adds a lot of unnecessary additional code.

Is this behavior normal? Is there any way to improve it?

zxyscz

Sep 12, 2023

i have the same questions

ArthurZ

Code Llama org Sep 20, 2023

A few things to note here.

to check if the <FILL_ME> is taken into account, you need to make sure the input ids are properly formatted.
the outputs we have match 1-1 with the original outputs. But when you generate with sampling and custom temperature etc, you should expect some hallucination. Especially if the eos token is not properly set, the model will not stop early enough :/

Thanks for opening the issue!

maximotus

Dec 6, 2023

Regarding the unnecessary additional code, In my case it was helpful to use a repetition penalty of 0.9. Maybe that helps in your case as well! :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment