Need help with KcBERT?
Click the โ€œchatโ€ button below for chat support from the developer who created it, or find similar developers for support.

About the developer

Beomi
297 Stars 26 Forks MIT License 28 Commits 1 Opened issues

Description

๐Ÿค— Pretrained BERT model & WordPiece tokenizer trained on Korean Comments ํ•œ๊ตญ์–ด ๋Œ“๊ธ€๋กœ ํ”„๋ฆฌํŠธ๋ ˆ์ด๋‹ํ•œ BERT ๋ชจ๋ธ

Services available

!
?

Need anything else?

Contributors list

# 67,131
vuejs2
HTML
Python
phantom...
26 commits

KcBERT: Korean comments BERT

** Updates on 2021.04.07 **

  • KcELECTRA๊ฐ€ ๋ฆด๋ฆฌ์ฆˆ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค!๐Ÿค—
  • KcELECTRA๋Š” ๋ณด๋‹ค ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ์…‹, ๊ทธ๋ฆฌ๊ณ  ๋” ํฐ General vocab์„ ํ†ตํ•ด KcBERT ๋Œ€๋น„ ๋ชจ๋“  ํƒœ์Šคํฌ์—์„œ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
  • ์•„๋ž˜ ๊นƒํ—™ ๋งํฌ์—์„œ ์ง์ ‘ ์‚ฌ์šฉํ•ด๋ณด์„ธ์š”!
  • https://github.com/Beomi/KcELECTRA

** Updates on 2021.03.14 **

  • KcBERT Paper ์ธ์šฉ ํ‘œ๊ธฐ๋ฅผ ์ถ”๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค.(bibtex)
  • KcBERT-finetune Performance score๋ฅผ ๋ณธ๋ฌธ์— ์ถ”๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

** Updates on 2020.12.04 **

Huggingface Transformers๊ฐ€ v4.0.0์œผ๋กœ ์—…๋ฐ์ดํŠธ๋จ์— ๋”ฐ๋ผ Tutorial์˜ ์ฝ”๋“œ๊ฐ€ ์ผ๋ถ€ ๋ณ€๊ฒฝ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์—…๋ฐ์ดํŠธ๋œ KcBERT-Large NSMC Finetuning Colab: Open In Colab

** Updates on 2020.09.11 **

KcBERT๋ฅผ Google Colab์—์„œ TPU๋ฅผ ํ†ตํ•ด ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ํŠœํ† ๋ฆฌ์–ผ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค! ์•„๋ž˜ ๋ฒ„ํŠผ์„ ๋ˆŒ๋Ÿฌ๋ณด์„ธ์š”.

Colab์—์„œ TPU๋กœ KcBERT Pretrain ํ•ด๋ณด๊ธฐ: Open In Colab

ํ…์ŠคํŠธ ๋ถ„๋Ÿ‰๋งŒ ์ „์ฒด 12G ํ…์ŠคํŠธ ์ค‘ ์ผ๋ถ€(144MB)๋กœ ์ค„์—ฌ ํ•™์Šต์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ์…‹/์ฝ”ํผ์Šค๋ฅผ ์ข€๋” ์‰ฝ๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” Korpora ํŒจํ‚ค์ง€๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

** Updates on 2020.09.08 **

Github Release๋ฅผ ํ†ตํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์—…๋กœ๋“œํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๋‹ค๋งŒ ํ•œ ํŒŒ์ผ๋‹น 2GB ์ด๋‚ด์˜ ์ œ์•ฝ์œผ๋กœ ์ธํ•ด ๋ถ„ํ• ์••์ถ•๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค.

์•„๋ž˜ ๋งํฌ๋ฅผ ํ†ตํ•ด ๋ฐ›์•„์ฃผ์„ธ์š”. (๊ฐ€์ž… ์—†์ด ๋ฐ›์„ ์ˆ˜ ์žˆ์–ด์š”. ๋ถ„ํ• ์••์ถ•)

๋งŒ์•ฝ ํ•œ ํŒŒ์ผ๋กœ ๋ฐ›๊ณ ์‹ถ์œผ์‹œ๊ฑฐ๋‚˜/Kaggle์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ดํŽด๋ณด๊ณ  ์‹ถ์œผ์‹œ๋‹ค๋ฉด ์•„๋ž˜์˜ ์บ๊ธ€ ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•ด์ฃผ์„ธ์š”.

  • Github๋ฆด๋ฆฌ์ฆˆ: https://github.com/Beomi/KcBERT/releases/tag/TrainData_v1

** Updates on 2020.08.22 **

Pretrain Dataset ๊ณต๊ฐœ

  • ์บ๊ธ€: https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments (ํ•œ ํŒŒ์ผ๋กœ ๋ฐ›์„ ์ˆ˜ ์žˆ์–ด์š”. ๋‹จ์ผํŒŒ์ผ)

Kaggle์— ํ•™์Šต์„ ์œ„ํ•ด ์ •์ œํ•œ(์•„๋ž˜

clean
์ฒ˜๋ฆฌ๋ฅผ ๊ฑฐ์นœ) Dataset์„ ๊ณต๊ฐœํ•˜์˜€์Šต๋‹ˆ๋‹ค!

์ง์ ‘ ๋‹ค์šด๋ฐ›์œผ์…”์„œ ๋‹ค์–‘ํ•œ Task์— ํ•™์Šต์„ ์ง„ํ–‰ํ•ด๋ณด์„ธ์š” :)


๊ณต๊ฐœ๋œ ํ•œ๊ตญ์–ด BERT๋Š” ๋Œ€๋ถ€๋ถ„ ํ•œ๊ตญ์–ด ์œ„ํ‚ค, ๋‰ด์Šค ๊ธฐ์‚ฌ, ์ฑ… ๋“ฑ ์ž˜ ์ •์ œ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ํ•œํŽธ, ์‹ค์ œ๋กœ NSMC์™€ ๊ฐ™์€ ๋Œ“๊ธ€ํ˜• ๋ฐ์ดํ„ฐ์…‹์€ ์ •์ œ๋˜์ง€ ์•Š์•˜๊ณ  ๊ตฌ์–ด์ฒด ํŠน์ง•์— ์‹ ์กฐ์–ด๊ฐ€ ๋งŽ์œผ๋ฉฐ, ์˜คํƒˆ์ž ๋“ฑ ๊ณต์‹์ ์ธ ๊ธ€์“ฐ๊ธฐ์—์„œ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š๋Š” ํ‘œํ˜„๋“ค์ด ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค.

KcBERT๋Š” ์œ„์™€ ๊ฐ™์€ ํŠน์„ฑ์˜ ๋ฐ์ดํ„ฐ์…‹์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด, ๋„ค์ด๋ฒ„ ๋‰ด์Šค์—์„œ ๋Œ“๊ธ€๊ณผ ๋Œ€๋Œ“๊ธ€์„ ์ˆ˜์ง‘ํ•ด, ํ† ํฌ๋‚˜์ด์ €์™€ BERT๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•œ Pretrained BERT ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

KcBERT๋Š” Huggingface์˜ Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ๊ฐ„ํŽธํžˆ ๋ถˆ๋Ÿฌ์™€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (๋ณ„๋„์˜ ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.)

KcBERT Performance

  • Finetune ์ฝ”๋“œ๋Š” https://github.com/Beomi/KcBERT-finetune ์—์„œ ์ฐพ์•„๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

| | Size
(์šฉ๋Ÿ‰) | NSMC
(acc) | Naver NER
(F1) | PAWS
(acc) | KorNLI
(acc) | KorSTS
(spearman) | Question Pair
(acc) | KorQuaD (Dev)
(EM/F1) | | :-------------------- | :---: | :----------------: | :--------------------: | :----------------: | :------------------: | :-----------------------: | :-------------------------: | :---------------------------: | | KcBERT-Base | 417M | 89.62 | 84.34 | 66.95 | 74.85 | 75.57 | 93.93 | 60.25 / 84.39 | | KcBERT-Large | 1.2G | 90.68 | 85.53 | 70.15 | 76.99 | 77.49 | 94.06 | 62.16 / 86.64 | | KoBERT | 351M | 89.63 | 86.11 | 80.65 | 79.00 | 79.64 | 93.93 | 52.81 / 80.27 | | XLM-Roberta-Base | 1.03G | 89.49 | 86.26 | 82.95 | 79.92 | 79.09 | 93.53 | 64.70 / 88.94 | | HanBERT | 614M | 90.16 | 87.31 | 82.40 | 80.89 | 83.33 | 94.19 | 78.74 / 92.02 | | KoELECTRA-Base | 423M | 90.21 | 86.87 | 81.90 | 80.85 | 83.21 | 94.20 | 61.10 / 89.59 | | KoELECTRA-Base-v2 | 423M | 89.70 | 87.02 | 83.90 | 80.61 | 84.30 | 94.72 | 84.34 / 92.58 | | DistilKoBERT | 108M | 88.41 | 84.13 | 62.55 | 70.55 | 73.21 | 92.48 | 54.12 / 77.80 |

*HanBERT์˜ Size๋Š” Bert Model๊ณผ Tokenizer DB๋ฅผ ํ•ฉ์นœ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

*config์˜ ์„ธํŒ…์„ ๊ทธ๋Œ€๋กœ ํ•˜์—ฌ ๋Œ๋ฆฐ ๊ฒฐ๊ณผ์ด๋ฉฐ, hyperparameter tuning์„ ์ถ”๊ฐ€์ ์œผ๋กœ ํ•  ์‹œ ๋” ์ข‹์€ ์„ฑ๋Šฅ์ด ๋‚˜์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

How to use

Requirements

  • pytorch <= 1.8.0
  • transformers ~= 3.0.1
    • transformers ~= 4.0.0
      ๋„ ํ˜ธํ™˜๋ฉ๋‹ˆ๋‹ค.
  • emoji ~= 0.6.0
  • soynlp ~= 0.0.493
from transformers import AutoTokenizer, AutoModelWithLMHead

Base Model (108M)

tokenizer = AutoTokenizer.from_pretrained("beomi/kcbert-base")

model = AutoModelWithLMHead.from_pretrained("beomi/kcbert-base")

Large Model (334M)

tokenizer = AutoTokenizer.from_pretrained("beomi/kcbert-large")

model = AutoModelWithLMHead.from_pretrained("beomi/kcbert-large")

Pretrain & Finetune Colab ๋งํฌ ๋ชจ์Œ

Pretrain Data

Pretrain Code

Colab์—์„œ TPU๋กœ KcBERT Pretrain ํ•ด๋ณด๊ธฐ: Open In Colab

Finetune Samples

KcBERT-Base NSMC Finetuning with PyTorch-Lightning (Colab) Open In Colab

KcBERT-Large NSMC Finetuning with PyTorch-Lightning (Colab) Open In Colab

์œ„ ๋‘ ์ฝ”๋“œ๋Š” Pretrain ๋ชจ๋ธ(base, large)์™€ batch size๋งŒ ๋‹ค๋ฅผ ๋ฟ, ๋‚˜๋จธ์ง€ ์ฝ”๋“œ๋Š” ์™„์ „ํžˆ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

Train Data & Preprocessing

Raw Data

ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” 2019.01.01 ~ 2020.06.15 ์‚ฌ์ด์— ์ž‘์„ฑ๋œ ๋Œ“๊ธ€ ๋งŽ์€ ๋‰ด์Šค ๊ธฐ์‚ฌ๋“ค์˜ ๋Œ“๊ธ€๊ณผ ๋Œ€๋Œ“๊ธ€์„ ๋ชจ๋‘ ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ๋Š” ํ…์ŠคํŠธ๋งŒ ์ถ”์ถœ์‹œ ์•ฝ 15.4GB์ด๋ฉฐ, 1์–ต1์ฒœ๋งŒ๊ฐœ ์ด์ƒ์˜ ๋ฌธ์žฅ์œผ๋กœ ์ด๋ค„์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

Preprocessing

PLM ํ•™์Šต์„ ์œ„ํ•ด์„œ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ง„ํ–‰ํ•œ ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  1. ํ•œ๊ธ€ ๋ฐ ์˜์–ด, ํŠน์ˆ˜๋ฌธ์ž, ๊ทธ๋ฆฌ๊ณ  ์ด๋ชจ์ง€(๐Ÿฅณ)๊นŒ์ง€!

์ •๊ทœํ‘œํ˜„์‹์„ ํ†ตํ•ด ํ•œ๊ธ€, ์˜์–ด, ํŠน์ˆ˜๋ฌธ์ž๋ฅผ ํฌํ•จํ•ด Emoji๊นŒ์ง€ ํ•™์Šต ๋Œ€์ƒ์— ํฌํ•จํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•œํŽธ, ํ•œ๊ธ€ ๋ฒ”์œ„๋ฅผ

ใ„ฑ-ใ…Ž๊ฐ€-ํžฃ
์œผ๋กœ ์ง€์ •ํ•ด
ใ„ฑ-ํžฃ
๋‚ด์˜ ํ•œ์ž๋ฅผ ์ œ์™ธํ–ˆ์Šต๋‹ˆ๋‹ค.
  1. ๋Œ“๊ธ€ ๋‚ด ์ค‘๋ณต ๋ฌธ์ž์—ด ์ถ•์•ฝ

ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹
์™€ ๊ฐ™์ด ์ค‘๋ณต๋œ ๊ธ€์ž๋ฅผ
ใ…‹ใ…‹
์™€ ๊ฐ™์€ ๊ฒƒ์œผ๋กœ ํ•ฉ์ณค์Šต๋‹ˆ๋‹ค.
  1. Cased Model

KcBERT๋Š” ์˜๋ฌธ์— ๋Œ€ํ•ด์„œ๋Š” ๋Œ€์†Œ๋ฌธ์ž๋ฅผ ์œ ์ง€ํ•˜๋Š” Cased model์ž…๋‹ˆ๋‹ค.

  1. ๊ธ€์ž ๋‹จ์œ„ 10๊ธ€์ž ์ดํ•˜ ์ œ๊ฑฐ

10๊ธ€์ž ๋ฏธ๋งŒ์˜ ํ…์ŠคํŠธ๋Š” ๋‹จ์ผ ๋‹จ์–ด๋กœ ์ด๋ค„์ง„ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„ ํ•ด๋‹น ๋ถ€๋ถ„์„ ์ œ์™ธํ–ˆ์Šต๋‹ˆ๋‹ค.

  1. ์ค‘๋ณต ์ œ๊ฑฐ

์ค‘๋ณต์ ์œผ๋กœ ์“ฐ์ธ ๋Œ“๊ธ€์„ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด ์ค‘๋ณต ๋Œ“๊ธ€์„ ํ•˜๋‚˜๋กœ ํ•ฉ์ณค์Šต๋‹ˆ๋‹ค.

์ด๋ฅผ ํ†ตํ•ด ๋งŒ๋“  ์ตœ์ข… ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” 12.5GB, 8.9์ฒœ๋งŒ๊ฐœ ๋ฌธ์žฅ์ž…๋‹ˆ๋‹ค.

์•„๋ž˜ ๋ช…๋ น์–ด๋กœ pip๋กœ ์„ค์น˜ํ•œ ๋’ค, ์•„๋ž˜ cleanํ•จ์ˆ˜๋กœ ํด๋ฆฌ๋‹์„ ํ•˜๋ฉด Downstream task์—์„œ ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์•„์ง‘๋‹ˆ๋‹ค. (

[UNK]
๊ฐ์†Œ)
pip install soynlp emoji

์•„๋ž˜

clean
ํ•จ์ˆ˜๋ฅผ Text data์— ์‚ฌ์šฉํ•ด์ฃผ์„ธ์š”.
import re
import emoji
from soynlp.normalizer import repeat_normalize

emojis = list({y for x in emoji.UNICODE_EMOJI.values() for y in x.keys()}) emojis = ''.join(emojis) pattern = re.compile(f'[^ .,?!/@$%๏ผ…ยทโˆผ()\x00-\x7Fใ„ฑ-ใ…ฃ๊ฐ€-ํžฃ{emojis}]+') url_pattern = re.compile( r'https?://(www.)?[[email protected]:%._+#=]{1,256}.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_+.~#?&//=]*)')

def clean(x): x = pattern.sub(' ', x) x = url_pattern.sub('', x) x = x.strip() x = repeat_normalize(x, num_repeats=2) return x

Cleaned Data (Released on Kaggle)

์›๋ณธ ๋ฐ์ดํ„ฐ๋ฅผ ์œ„

clean
ํ•จ์ˆ˜๋กœ ์ •์ œํ•œ 12GB๋ถ„๋Ÿ‰์˜ txt ํŒŒ์ผ์„ ์•„๋ž˜ Kaggle Dataset์—์„œ ๋‹ค์šด๋ฐ›์œผ์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค :)

https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments

Tokenizer Train

Tokenizer๋Š” Huggingface์˜ Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ทธ ์ค‘

BertWordPieceTokenizer
๋ฅผ ์ด์šฉํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , Vocab Size๋Š”
30000
์œผ๋กœ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

Tokenizer๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์—๋Š”

1/10
๋กœ ์ƒ˜ํ”Œ๋งํ•œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , ๋ณด๋‹ค ๊ณจ๊ณ ๋ฃจ ์ƒ˜ํ”Œ๋งํ•˜๊ธฐ ์œ„ํ•ด ์ผ์ž๋ณ„๋กœ stratify๋ฅผ ์ง€์ •ํ•œ ๋’ค ํ–‘์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

BERT Model Pretrain

  • KcBERT Base config
{
    "max_position_embeddings": 300,
    "hidden_dropout_prob": 0.1,
    "hidden_act": "gelu",
    "initializer_range": 0.02,
    "num_hidden_layers": 12,
    "type_vocab_size": 2,
    "vocab_size": 30000,
    "hidden_size": 768,
    "attention_probs_dropout_prob": 0.1,
    "directionality": "bidi",
    "num_attention_heads": 12,
    "intermediate_size": 3072,
    "architectures": [
        "BertForMaskedLM"
    ],
    "model_type": "bert"
}
  • KcBERT Large config
{
    "type_vocab_size": 2,
    "initializer_range": 0.02,
    "max_position_embeddings": 300,
    "vocab_size": 30000,
    "hidden_size": 1024,
    "hidden_dropout_prob": 0.1,
    "model_type": "bert",
    "directionality": "bidi",
    "pad_token_id": 0,
    "layer_norm_eps": 1e-12,
    "hidden_act": "gelu",
    "num_hidden_layers": 24,
    "num_attention_heads": 16,
    "attention_probs_dropout_prob": 0.1,
    "intermediate_size": 4096,
    "architectures": [
        "BertForMaskedLM"
    ]
}

BERT Model Config๋Š” Base, Large ๊ธฐ๋ณธ ์„ธํŒ…๊ฐ’์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. (MLM 15% ๋“ฑ)

TPU

v3-8
์„ ์ด์šฉํ•ด ๊ฐ๊ฐ 3์ผ, N์ผ(Large๋Š” ํ•™์Šต ์ง„ํ–‰ ์ค‘)์„ ์ง„ํ–‰ํ–ˆ๊ณ , ํ˜„์žฌ Huggingface์— ๊ณต๊ฐœ๋œ ๋ชจ๋ธ์€ 1m(100๋งŒ) step์„ ํ•™์Šตํ•œ ckpt๊ฐ€ ์—…๋กœ๋“œ ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ ํ•™์Šต Loss๋Š” Step์— ๋”ฐ๋ผ ์ดˆ๊ธฐ 200k์— ๊ฐ€์žฅ ๋น ๋ฅด๊ฒŒ Loss๊ฐ€ ์ค„์–ด๋“ค๋‹ค 400k์ดํ›„๋กœ๋Š” ์กฐ๊ธˆ์”ฉ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • Base Model Loss

KcBERT-Base Pretraining Loss

  • Large Model Loss

KcBERT-Large Pretraining Loss

ํ•™์Šต์€ GCP์˜ TPU v3-8์„ ์ด์šฉํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , ํ•™์Šต ์‹œ๊ฐ„์€ Base Model ๊ธฐ์ค€ 2.5์ผ์ •๋„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. Large Model์€ ์•ฝ 5์ผ์ •๋„ ์ง„ํ–‰ํ•œ ๋’ค ๊ฐ€์žฅ ๋‚ฎ์€ loss๋ฅผ ๊ฐ€์ง„ ์ฒดํฌํฌ์ธํŠธ๋กœ ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

Example

HuggingFace MASK LM

HuggingFace kcbert-base ๋ชจ๋ธ ์—์„œ ์•„๋ž˜์™€ ๊ฐ™์ด ํ…Œ์ŠคํŠธ ํ•ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ "์ข‹๋„ค์š”", KcBERT-Base

๋ฌผ๋ก  kcbert-large ๋ชจ๋ธ ์—์„œ๋„ ํ…Œ์ŠคํŠธ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

image-20200806160624340

NSMC Binary Classification

๋„ค์ด๋ฒ„ ์˜ํ™”ํ‰ ์ฝ”ํผ์Šค ๋ฐ์ดํ„ฐ์…‹์„ ๋Œ€์ƒ์œผ๋กœ Fine Tuning์„ ์ง„ํ–‰ํ•ด ์„ฑ๋Šฅ์„ ๊ฐ„๋‹จํžˆ ํ…Œ์ŠคํŠธํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

Base Model์„ Fine Tuneํ•˜๋Š” ์ฝ”๋“œ๋Š” Open In Colab ์—์„œ ์ง์ ‘ ์‹คํ–‰ํ•ด๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Large Model์„ Fine Tuneํ•˜๋Š” ์ฝ”๋“œ๋Š” Open In Colab ์—์„œ ์ง์ ‘ ์‹คํ–‰ํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • GPU๋Š” P100 x1๋Œ€ ๊ธฐ์ค€ 1epoch์— 2-3์‹œ๊ฐ„, TPU๋Š” 1epoch์— 1์‹œ๊ฐ„ ๋‚ด๋กœ ์†Œ์š”๋ฉ๋‹ˆ๋‹ค.
  • GPU RTX Titan x4๋Œ€ ๊ธฐ์ค€ 30๋ถ„/epoch ์†Œ์š”๋ฉ๋‹ˆ๋‹ค.
  • ์˜ˆ์‹œ ์ฝ”๋“œ๋Š” pytorch-lightning์œผ๋กœ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค.

์‹คํ—˜๊ฒฐ๊ณผ

  • KcBERT-Base Model ์‹คํ—˜๊ฒฐ๊ณผ: Val acc
    .8905

KcBERT Base finetune on NSMC

  • KcBERT-Large Model ์‹คํ—˜ ๊ฒฐ๊ณผ: Val acc
    .9089

image-20200806190242834

๋” ๋‹ค์–‘ํ•œ Downstream Task์— ๋Œ€ํ•ด ํ…Œ์ŠคํŠธ๋ฅผ ์ง„ํ–‰ํ•˜๊ณ  ๊ณต๊ฐœํ•  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

์ธ์šฉํ‘œ๊ธฐ/Citation

KcBERT๋ฅผ ์ธ์šฉํ•˜์‹ค ๋•Œ๋Š” ์•„๋ž˜ ์–‘์‹์„ ํ†ตํ•ด ์ธ์šฉํ•ด์ฃผ์„ธ์š”.

@inproceedings{lee2020kcbert,
  title={KcBERT: Korean Comments BERT},
  author={Lee, Junbum},
  booktitle={Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology},
  pages={437--440},
  year={2020}
}
  • ๋…ผ๋ฌธ์ง‘ ๋‹ค์šด๋กœ๋“œ ๋งํฌ: http://hclt.kr/dwn/?v=bG5iOmNvbmZlcmVuY2U7aWR4OjMy (*ํ˜น์€ http://hclt.kr/symp/?lnb=conference )

Acknowledgement

KcBERT Model์„ ํ•™์Šตํ•˜๋Š” GCP/TPU ํ™˜๊ฒฝ์€ TFRC ํ”„๋กœ๊ทธ๋žจ์˜ ์ง€์›์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ ํ•™์Šต ๊ณผ์ •์—์„œ ๋งŽ์€ ์กฐ์–ธ์„ ์ฃผ์‹  Monologg ๋‹˜ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค :)

Reference

Github Repos

Papers

Blogs

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.