ilsp
/

LVouk's picture
Update README.md
cea07d0 verified
|
raw
history blame
4.69 kB
metadata
license: llama3.1
language:
  - el
  - en
pipeline_tag: text-generation
library_name: transformers
tags:
  - text-generation-inference
base_model:
  - ilsp/Llama-Krikri-8B-Base

Llama-Krikri-8B-Instruct: An Instruction-tuned Large Language Model for the Greek language

Following the release of Meltemi-7B on the 26th March 2024, we are happy to welcome Krikri to the family of ILSP open Greek LLMs. Krikri is built on top of Llama-3.1-8B, extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present Llama-Krikri-8B-Instruct, along with the base model, Llama-Krikri-8B-Base.

image/png

Model Information

  • Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
  • 128k context length (approximately 80,000 Greek words)
  • We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus.
    • This corpus includes 56.7 billion monolingual Greek tokens, constructed from publicly available resources.
    • Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (21 billion tokens) and Greek-English parallel data (5.5 billion tokens).
    • The training corpus also contains 7.8 billion math and code tokens.
    • This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below:
Sub-corpus # Tokens Percentage
Greek 56.7 B 62.3 %
English 21.0 B 23.1 %
Parallel 5.5 B 6.0 %
Math/Code 7.8 B 8.6 %
Total 91 B 100%

Chosen subsets of the 91 billion corpus were upsampled resulting in a size of 110 billion tokens.

🚨 More information of the post-training corpus and methdology coming soon. 🚨

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

model = AutoModelForCausalLM.from_pretrained("ilsp/Llama-Krikri-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("ilsp/Llama-Krikri-8B-Instruct")

model.to(device)

system_prompt = "Είσαι το Κρικρί, ένα εξαιρετικά ανεπτυγμένο μοντέλο Τεχνητής Νοημοσύνης για τα ελληνικα και εκπαιδεύτηκες από το ΙΕΛ του Ε.Κ. \"Αθηνά\"."
user_prompt = "Σε τι διαφέρει ένα κρικρί από ένα λάμα;"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
input_prompt = tokenizer(prompt, return_tensors='pt').to(device)
outputs = model.generate(input_prompt['input_ids'], max_new_tokens=256, do_sample=True)

print(tokenizer.batch_decode(outputs)[0])

How to serve with OpenAI compatible server via vLLM

vllm serve ilsp/Llama-Krikri-8B-Instruct \
  --enforce-eager \
  --dtype 'bfloat16' \
  --api-key token-abc123

Then, the model can be used through Python using:

from openai import OpenAI

api_key = "token-abc123"
base_url = "http://localhost:8000/v1"

client = OpenAI(
    api_key=api_key,
    base_url=base_url,
)

system_prompt = "Είσαι ένα ανεπτυγμένο μεταφραστικό σύστημα που απαντάει με λίστες Python."
user_prompt = "Δώσε μου την παρακάτω λίστα με μεταφρασμένο κάθε string της στα ελληνικά: ['Ethics of duty', 'Postmodern ethics', 'Consequentialist ethics', 'Utilitarian ethics', 'Deontological ethics', 'Virtue ethics', 'Relativist ethics']"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]

response = client.chat.completions.create(model="ilsp/Llama-Krikri-8B-Instruct",
                                          messages=messages)
print(response.choices[0].message.content)

Evaluation

🚨 Instruction following and chat capability evaluation benchmarks coming soon. 🚨

Acknowledgements

The ILSP team utilized Amazon's cloud computing services, which were made available via GRNET under the OCRE Cloud framework, providing Amazon Web Services for the Greek Academic and Research Community.