AutoPhrase: Automated Phrase Mining from Massive Text Corpora
Please cite the following two papers if you are using our tools. Thanks!
Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, Jiawei Han, "Automated Phrase Mining from Massive Text Corpora", accepted by IEEE Transactions on Knowledge and Data Engineering, Feb. 2018.
Jialu Liu*, Jingbo Shang*, Chi Wang, Xiang Ren and Jiawei Han, "Mining Quality Phrases from Massive Text Corpora”, Proc. of 2015 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'15), Melbourne, Australia, May 2015. (* equally contributed, slides)
Tokeninzer.java. Previously, when the corpus contains characters like
/, the results could be wrong or errors may occur.
wiki_quality.txt), the score is set as
1.0. Previously, it was kind of infinite.
// define LARGEin the beginning of
src/utils/parameters.hbefore you run AutoPhrase on such a large corpus.
(compared to SegPhrase)
Linux or MacOS with g++ and Java installed.
$ sudo apt-get install g++-4.8
$ sudo apt-get install openjdk-8-jdk
$ sudo apt-get install curl
$ brew install gcc6
$ brew update; brew tap caskroom/cask; brew install Caskroom/cask/java
The default run will download an English corpus from the server of our data mining group and run AutoPhrase to get 3 ranked lists of phrases as well as 2 segmentation model files under the
models/DBLP) directory. *
AutoPhrase.txt: the unified ranked list for both single-word phrases and multi-word phrases. *
AutoPhrase_multi-words.txt: the sub-ranked list for multi-word phrases only. *
AutoPhrase_single-word.txt: the sub-ranked list for single-word phrases only. *
segmentation.model: AutoPhrase's segmentation model (saved for later use). *
token_mapping.txt: the token mapping file for the tokenizer (saved for later use).
You can change
RAW_TRAINto your own corpus and you may also want change
MODELto a different name.
We also provide an auxiliary function to highlight the phrases in context based on our phrasal segmentation model. There are two thresholds you can tune in the top of the script. The model can also handle unknown tokens (i.e., tokens which are not occurred in the phrase mining step's corpus).
In the beginning, you need to specify AutoPhrase's segmentation model, i.e.,
MODEL. The default value is set to be consistent with
The segmentation results will be put under the
MODELdirectory as well (i.e.,
model/DBLP/segmentation.txt). The highlighted phrases will be enclosed by the phrase tags (e.g.,
If domain-specific knowledge bases are available, such as MeSH terms, there are two ways to incorporate them. * (recommended) Append your known quality phrases to the file
data/EN/wiki_quality.txt. * Replace the file
data/EN/wiki_quality.txtby your known quality phrases.
In fact, our tokenizer supports many different languages, including Arabics (AR), German (DE), English (EN), Spanish (ES), French (FR), Italian (IT), Japanese (JA), Portuguese (PT), Russian (RU), and Chinese (CN). If the language detection is wrong, you can also manually specify the language by modify the
TOKENIZERcommand in the bash script
auto_phrase.shusing the two-letter code for that language. For example, the following one forces the language to be English.
TOKENIZER="-cp .:tools/tokenizer/lib/*:tools/tokenizer/resources/:tools/tokenizer/build/ Tokenizer -l EN"
We also provide a default tokenizer together with a dummy POS tagger in the
tools/tokenizer. It uses the StandardTokenizer in Lucene, and always assign a tag
UNKNOWNto each token. To enable this feature, please add the
-l OTHER"to the
TOKENIZERcommand in the bash script
TOKENIZER="-cp .:tools/tokenizer/lib/*:tools/tokenizer/resources/:tools/tokenizer/build/ Tokenizer -l OTHER"
If you want to incorporate your own tokenizer and/or POS tagger, please create a new class extending SpecialTagger in the
tools/tokenizer. You may refer to StandardTagger as an example.
You may try to search online or create your own list.
Meanwhile, you have to add two lists of quality phrases in the
data/OTHER/wiki_all.txt. The quality of phrases in wikiquality should be very confident, while wikiall, as its superset, could be a little noisy. For more details, please refer to the tools/wiki_enities.
sudo docker run -v $PWD/models:/autophrase/models -it \ -e ENABLE_POS_TAGGING=1 \ -e MIN_SUP=30 -e THREAD=10 \ remenberl/autophrase
The results will be available in the
modelsfolder. Note that all of the environment variables above have their default values--leaving the assigments out here would produce exactly the same results. (However, in this case, using default values, the results of
phrasal_segmentation.txtwould be saved to the internal
default_modelsdirectory--this is unavoidable, since the phrasal segmentation app reads from and writes to the same model directory.)
Assuming the path to input file is ./data/input.txt. ``` sudo docker run -v $PWD/data:/autophrase/data -v $PWD/models:/autophrase/models -it \ -e RAWTRAIN=data/input.txt \ -e ENABLEPOSTAGGING=1 \ -e MINSUP=30 -e THREAD=10 \ -e MODEL=models/MyModel \ -e TEXTTOSEG=data/input.txt \ remenberl/autophrase
"RAWTRAIN" is the training corpus, and "TEXTTOSEG" is a corpus whose phrases are to be highlighted--typically, this is the same corpus, but training and phrasal segmentation use two different scripts. When the user wants to segment a new corpus with an existing model, only the latter script need be used (and setting "RAWTRAIN" isn't necessary).
Note that, in a Docker deployment, the (default)
modelsdirectories are renamed to
default_models, respectively, to avoid conflicts with mounted external directories with the same names. It should be noted as well that there's litle point in saving a model to the default models directory, since all new files are erased when the container is exited (and if an external directory is mounted as "models", and no value is specified for "MODEL", the results will be saved in the "models/DBP" subdirectory). The same wrinkle also means that there's little point to running a container with the "FIRST_RUN" variable set to 0.
Because the original data directory will have been been renamed, it's perfectly fine for the user to mount an external directory called "data" and read the corpus from there--and in most cases, there's no need for a user to change the supplied files stored in the default data directory. If such a change is necessary, though, the environment variable that specifies the directory in question is "DATA_DIR".
sudocommand won't work in a Windows bash shell, and in any case isn't needed in an elevated window--replace it with
In addition, the
PWDvariable works a little oddly in MinGW (the Git bash shell), appending ";C" to the end of the path. To prevent this, replace