Need help with bert-vocab-builder?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

kwonmha
201 Stars 47 Forks 32 Commits 5 Opened issues

Description

Builds wordpiece(subword) vocabulary compatible for Google Research's BERT

Services available

!
?

Need anything else?

Contributors list

# 114,704
Jupyter...
JavaScr...
CSS
pytorch
20 commits
# 321,005
React
JavaScr...
Storybo...
CSS
10 commits

Vocabulary builder for BERT

Modified, simplified version of textencoderbuild_subword.py and its dependencies included in tensor2tensor library, making its output fits to google research's open-sourced BERT project.


Although google opened pre-trained BERT and training scripts, they didn't open source to generate wordpiece(subword) vocabulary matches to

vocab.txt
in opened model.
And the libraries they suggested to use were not compatible with their
tokenization.py
of BERT as they mentioned.
So I modified textencoderbuild_subword.py of tensor2tensor library that is one of the suggestions google mentioned to generate wordpiece vocabulary.

Modifications

  • Original SubwordTextEncoder adds \"_\" at the end of subwords appear on the first position of words. So I changed to add \"_\" at the beginning of subwords that follow other subwords, using myescape_token() function, and later substitued \"_\" with "##"

  • Generated vocabulary contains all characters and all characters having "##" in front of them. For example,

    a
    and
    ##a
    .
  • Made standard special characters like

    [email protected]~
    and special tokens used for BERT, ex :
    [SEP], [CLS], [MASK], [UNK]
    to be added.
  • Removed irrelevant classes in textencoder.py, commented unused functions some of which seem to exist for decoding, and removed mlperflog module to make this project independent to tensor2tensor library.

Requirement

The environment I made this project in consists of : - python3.6 - tensorflow 1.11

Basic usage

python subword_builder.py \
--corpus_filepattern "{corpus_for_vocab}" \
--output_filename {name_of_vocab}
--min_count {minimum_subtoken_counts}

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.