BAAI
/

ldwang's picture
Create README.md
3ea6e6b verified
|
raw
history blame
9.32 kB
metadata
license: apache-2.0
language:
  - en
  - zh
library_name: transformers
datasets:
  - BAAI/Infinity-Instruct
  - BAAI/CCI3-HQ
  - mlfoundations/dclm-baseline-1.0
  - HuggingFaceFW/fineweb-edu
  - HuggingFaceTB/cosmopedia
pipeline_tag: text-generation

Introduction

The Aquila-135M model is a small language model trained using a pre-training and annealing paradigm. This model used 1.66TB bilingual tokens in Chinese and English for pre-training and 100B tokens for annealing training. During annealing stage, we selected 100B tokens of high-quality bilingual (Chinese and English) data for annealing training, finally got our model.

We have open-sourced all bilingual datasets during both pre-training and annealing phrases. Also we have open-sourced all intermediate checkpoints.

The Aquila-135M-Instuct model is finetuned using Infinity Instruct.

Excluding the parameter count of the vocabulary, Aquila-135M and SmolLM2-135M share an identical structure. The parameter count excludes the embedding part.

The entire training process was conducted using our self-developed Triton operator library, FlagGems, and parallel training framework, FlagScale.

News

  • 2024/12/24: We have released Aquila-135M and Aquila-135M-Instruct.
  • 2024/12/24: We have released all datasets and intermediate checkpoints during training. Please feel free to use these models for analysis and experimentation.

Evaluation

We followed evaluation setting of SmolLM models and evaluated the model using the lighteval tool.

While their performance on English benchmarks is comparable, Aquila-135M demonstrates significantly better results on Chinese benchmarks.

Among small models with a total parameter count below and around 400M, Aquila-135M maintains a leading position in processing capabilities while significantly enhancing Chinese language proficiency.

Metrics (0-shot) Aquila-135M (Trition) Aquila-135M (CUDA) SmolLM-135M SmolLM2-135M gpt2-medium-360M TinyMistral-248M TinyMistral-248M-2.5 OpenELM-270M Wide-Sheared-LLaMA-290M opt-350m MobileLLM-350M pythia-410m SmolLM-360M SmolLM2-360M
HellaSwag 41.19 41.12 41.15 42.10 37.08 27.06 26.80 45.74 24.94 36.08 26.28 39.22 51.73 54.66
ARC (Average) 44.76 44.15 42.34 43.93 34.34 29.71 27.63 35.74 26.20 31.91 27.72 35.14 49.95 53.24
PIQA 66.38 67.52 68.28 68.44 66.38 57.40 53.92 69.75 50.60 64.36 50.27 67.19 71.55 71.98
MMLU (cloze) 31.07 30.67 30.26 31.58 27.75 25.82 25.59 27.89 24.75 26.58 24.86 28.88 34.32 36.09
CommonsenseQA 32.10 31.70 32.02 32.92 31.70 24.57 21.46 35.71 16.54 32.10 17.53 31.45 36.61 38.74
TriviaQA 6.65 7.02 4.24 4.03 2.36 0.50 0.08 1.34 0.00 1.38 0.00 2.06 9.19 16.92
Winograde 51.07 51.70 51.22 50.99 49.49 49.25 49.01 52.41 49.72 51.54 49.41 49.96 53.12 52.49
OpenBookQA 34.40 34.40 33.80 34.60 31.40 29.40 27.40 30.60 26.00 27.80 24.80 28.40 37.20 37.00
GSM8K (5-shot) 2.12 2.12 1.00 1.52 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.81
SIQA 41.81 42.32 41.15 41.45 41.30 41.86 39.71 42.73 39.76 42.37 37.10 42.02 43.45 41.61
CEval 29.22 29.82 28.28 26.41 25.40 25.38 26.89 26.69 26.37 26.67 25.68 27.97 27.66 28.51
CMMLU 29.48 29.63 26.01 26.66 27.20 26.67 25.57 26.25 26.33 26.93 25.61 26.91 27.06 27.39
Average-English 35.16 35.27 34.55 35.16 32.18 28.56 27.16 34.19 25.85 31.41 25.80 32.43 38.71 40.55
Average-Chinese 29.35 29.73 27.15 26.54 26.30 26.03 26.23 26.47 26.35 26.80 25.65 27.44 27.36 27.95
Average 32.25 32.50 30.85 30.85 29.24 27.29 26.70 30.33 26.10 29.11 25.72 29.94 33.04 34.25

For comparison models, evaluations were conducted in a local environment, so the scores may differ slightly from those reported in papers.

How to use

Base Model

from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "BAAI/Aquila-135M"

device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

input_text = "什么是引力?"
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0]))

input_text = "What is gravity?"
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0]))

Instruct Model

from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "BAAI/Aquila-135M-Instruct"

device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

messages = [{"role": "user", "content": "什么是引力?"}]
input_text=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(input_text)
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0]))

messages = [{"role": "user", "content": "What is gravity?"}]
input_text=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(input_text)
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0]))

Future Plan

  • We plan to optimize the selection of better datasets and their proportions.

Citation

If you find this useful, please cite the following work

@misc{aquila-280m,
      title={Aquila-135M: A Bilingual Small Language Model in Chinese and English}, 
      author={BAAI},
      year={},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={}, 
}