kogpt_reduced_vocab / README.md
sxxyxn's picture
Upload 8 files
e796387
metadata
language: ko
tags:
  - KakaoBrain
  - KoGPT
  - GPT
  - GPT3
license: cc-by-nc-4.0

KoGPT

KakaoBrain's Pre-Trained Language Models.

Model Descriptions

KoGPT6B-ryan1.5b

Hyperparameter Value
nparametersn_{parameters} 6,166,502,400
nlayersn_{layers} 28
dmodeld_{model} 4,096
dffd_{ff} 16,384
nheadsn_{heads} 16
dheadd_{head} 256
nctxn_{ctx} 2,048
nvocabn_{vocab} 64,512
Positional Encoding Rotary Position Embedding (RoPE)
RoPE Dimensions 64

Hardware requirements

KoGPT6B-ryan1.5b

GPU

The following is the recommended minimum GPU hardware guidance for a handful of example KoGPT.

  • 32GB GPU RAM in the required minimum memory size

KoGPT6B-ryan1.5b-float16

GPU

The following is the recommended minimum GPU hardware guidance for a handful of example KoGPT.

  • half-precision requires NVIDIA GPUS based on Volta, Turing or Ampere
  • 16GB GPU RAM in the required minimum memory size

Usage

prompt

python -m kogpt --help
usage: KoGPT inference [-h] [--model MODEL] [--revision {KoGPT6B-ryan1.5b}]
                       [--device {cpu,cuda}] [-d]

KakaoBrain Korean(hangul) Generative Pre-Training Model

optional arguments:
  -h, --help            show this help message and exit
  --model MODEL         huggingface repo (default:kakaobrain/kogpt)
  --revision {KoGPT6B-ryan1.5b}
  --device {cpu,cuda}   (default:cuda)
  -d, --debug
python -m kogpt
prompt> μΈκ°„μ²˜λŸΌ μƒκ°ν•˜κ³ , ν–‰λ™ν•˜λŠ” '지λŠ₯'을 톡해 인λ₯˜κ°€ μ΄μ œκΉŒμ§€ 풀지 λͺ»ν–ˆλ˜
temperature(0.8)> 
max_length(128)> 64
μΈκ°„μ²˜λŸΌ μƒκ°ν•˜κ³ , ν–‰λ™ν•˜λŠ” '지λŠ₯'을 톡해 인λ₯˜κ°€ μ΄μ œκΉŒμ§€ 풀지 λͺ»ν–ˆλ˜ 문제의 해닡을 찾을 수 μžˆμ„ 것이닀. κ³Όν•™κΈ°μˆ μ΄ κ³ λ„λ‘œ λ°œλ‹¬ν•œ 21μ„ΈκΈ°λ₯Ό μ‚΄μ•„κ°ˆ 우리 μ•„μ΄λ“€μ—κ²Œ κ°€μž₯ ν•„μš”ν•œ 것은 사고λ ₯ ν›ˆλ ¨μ΄λ‹€. 사고λ ₯ ν›ˆλ ¨μ„ 톡해, 세상

prompt>  
...

python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM 

tokenizer = AutoTokenizer.from_pretrained(
  'kakaobrain/kogpt', revision='KoGPT6B-ryan1.5b-float16',  # or float32 version: revision=KoGPT6B-ryan1.5b
  bos_token='[BOS]', eos_token='[EOS]', unk_token='[UNK]', pad_token='[PAD]', mask_token='[MASK]'
)
model = AutoModelForCausalLM.from_pretrained(
  'kakaobrain/kogpt', revision='KoGPT6B-ryan1.5b-float16',  # or float32 version: revision=KoGPT6B-ryan1.5b
  pad_token_id=tokenizer.eos_token_id,
  torch_dtype='auto', low_cpu_mem_usage=True
).to(device='cuda', non_blocking=True)
_ = model.eval()

prompt = 'μΈκ°„μ²˜λŸΌ μƒκ°ν•˜κ³ , ν–‰λ™ν•˜λŠ” \'지λŠ₯\'을 톡해 인λ₯˜κ°€ μ΄μ œκΉŒμ§€ 풀지 λͺ»ν–ˆλ˜'
with torch.no_grad():
  tokens = tokenizer.encode(prompt, return_tensors='pt').to(device='cuda', non_blocking=True)
  gen_tokens = model.generate(tokens, do_sample=True, temperature=0.8, max_length=64)
  generated = tokenizer.batch_decode(gen_tokens)[0]
  
print(generated)  # print: μΈκ°„μ²˜λŸΌ μƒκ°ν•˜κ³ , ν–‰λ™ν•˜λŠ” '지λŠ₯'을 톡해 인λ₯˜κ°€ μ΄μ œκΉŒμ§€ 풀지 λͺ»ν–ˆλ˜ 문제의 해닡을 찾을 수 μžˆμ„ 것이닀. κ³Όν•™κΈ°μˆ μ΄ κ³ λ„λ‘œ λ°œλ‹¬ν•œ 21μ„ΈκΈ°λ₯Ό μ‚΄μ•„κ°ˆ 우리 μ•„μ΄λ“€μ—κ²Œ κ°€μž₯ ν•„μš”ν•œ 것은 사고λ ₯ ν›ˆλ ¨μ΄λ‹€. 사고λ ₯ ν›ˆλ ¨μ„ 톡해, 세상

Experiments

In-context Few-Shots

Models #params NSMC (Acc.) YNAT (F1) KLUE-STS (F1)
HyperCLOVA[1] 1.3B 83.9 58.7 60.9
HyperCLOVA[1] 6.9B 83.8 67.5 59.3
HyperCLOVA[1] 13.0B 87.9 67.9 60.0
HyperCLOVA[1] 39.0B 88.0 71.4 61.6
HyperCLOVA[1] 82.0B 88.2 72.7 65.1
Ours 6.0B 87.8 78.0 64.3

Finetuning / P-Tuning

We have been reported to have issues(https://github.com/kakaobrain/kogpt/issues/17) with our downstream evaluation.

The previously published performance evaluation table was deleted because it was difficult to see it as a fair comparison because the comparison target algorithm was different and the performance measurement method could not be confirmed.

You can refer to the above issue link for the existing performance evaluation table and troubleshooting results.

Limitations

KakaoBrain KoGPT was trained on rayn dataset, a dataset known to contain profanity, lewd, political changed, and other harsh language. Therefore, KoGPT can generate socially unacceptable texts. As with all language models, It is difficult to predict in advance how KoGPT will response to particular prompts and offensive content without warning.

Primarily Korean: KoGPT is primarily trained on Korean texts, and is best for classifying, searching, summarizing or generating such texts. KoGPT by default perform worse on inputs that are different from the data distribution it is trained on, including non-Korean as well as specific dialects of Korean that are not well represented in the training data.

카카였브레인 KoGPTλŠ” μš•μ„€, μŒλž€, μ •μΉ˜μ  λ‚΄μš© 및 기타 거친 언어에 λŒ€ν•œ 처리λ₯Ό ν•˜μ§€ μ•Šμ€ rayn dataset으둜 ν•™μŠ΅ν•˜μ˜€μŠ΅λ‹ˆλ‹€. λ”°λΌμ„œ KoGPTλŠ” μ‚¬νšŒμ μœΌλ‘œ μš©μΈλ˜μ§€ μ•Šμ€ ν…μŠ€νŠΈλ₯Ό 생성할 수 μžˆμŠ΅λ‹ˆλ‹€. λ‹€λ₯Έ μ–Έμ–΄ λͺ¨λΈκ³Ό λ§ˆμ°¬κ°€μ§€λ‘œ νŠΉμ • ν”„λ‘¬ν”„νŠΈμ™€ 곡격적인 μ½˜ν…μΈ μ— μ–΄λ– ν•œ κ²°κ³Όλ₯Ό 생성할지 사전에 νŒŒμ•…ν•˜κΈ° μ–΄λ ΅μŠ΅λ‹ˆλ‹€.

KoGPTλŠ” 주둜 ν•œκ΅­μ–΄ ν…μŠ€νŠΈλ‘œ ν•™μŠ΅μ„ ν•˜μ˜€μœΌλ©° μ΄λŸ¬ν•œ ν…μŠ€νŠΈλ₯Ό λΆ„λ₯˜, 검색, μš”μ•½ λ˜λŠ” μƒμ„±ν•˜λŠ”λ° κ°€μž₯ μ ν•©ν•©λ‹ˆλ‹€. 기본적으둜 KoGPTλŠ” ν•™μŠ΅ 데이터에 잘 λ‚˜νƒ€λ‚˜μ§€ μ•ŠλŠ” λ°©μ–ΈλΏλ§Œμ•„λ‹ˆλΌ ν•œκ΅­μ–΄κ°€ μ•„λ‹Œ κ²½μš°μ™€ 같이 ν•™μŠ΅ λ°μ΄ν„°μ—μ„œ λ°œκ²¬ν•˜κΈ° μ–΄λ €μš΄ μž…λ ₯μ—μ„œ 쒋지 μ•Šμ€ μ„±λŠ₯을 λ³΄μž…λ‹ˆλ‹€.

Citation

If you apply this library or model to any project and research, please cite our code:

@misc{kakaobrain2021kogpt,
  title         = {KoGPT: KakaoBrain Korean(hangul) Generative Pre-trained Transformer},
  author        = {Ildoo Kim and Gunsoo Han and Jiyeon Ham and Woonhyuk Baek},
  year          = {2021},
  howpublished  = {\url{https://github.com/kakaobrain/kogpt}},
}

Contact

This is released as an open source in the hope that it will be helpful to many research institutes and startups for research purposes. We look forward to contacting us from various places who wish to cooperate with us.

[email protected]

License

The source code of KakaoBrain KoGPT are licensed under Apache 2.0 License.
The pretrained wieghts of KakaoBrain KoGPT are licensed under CC-BY-NC-ND 4.0 License License.

카카였브레인 KoGPT의 μ†ŒμŠ€μ½”λ“œ(source code)λŠ” Apache 2.0 λΌμ΄μ„ μŠ€ ν•˜μ— κ³΅κ°œλ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.
카카였브레인 KoGPT의 μ‚¬μ „ν•™μŠ΅λœ κ°€μ€‘μΉ˜(pretrained weights)λŠ” CC-BY-NC-ND 4.0 λΌμ΄μ„ μŠ€ λΌμ΄μ„ μŠ€ ν•˜μ— κ³΅κ°œλ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.
λͺ¨λΈ 및 μ½”λ“œ, μ‚¬μ „ν•™μŠ΅λœ κ°€μ€‘μΉ˜λ₯Ό μ‚¬μš©ν•  경우 λΌμ΄μ„ μŠ€ λ‚΄μš©μ„ μ€€μˆ˜ν•΄ μ£Όμ‹­μ‹œμ˜€. λΌμ΄μ„ μŠ€ 전문은 Apache 2.0, LICENSE.cc-by-nc-nd-4.0 νŒŒμΌμ—μ„œ ν™•μΈν•˜μ‹€ 수 μžˆμŠ΅λ‹ˆλ‹€.

References

[1] HyperCLOVA: Kim, Boseop, et al. "What changes can large-scale language models bring? intensive study on hyperclova: Billions-scale korean generative pretrained transformers." arXiv preprint arXiv:2109.04650 (2021).