日语
- 单个句子 分词
% echo "MeCabで形態素解析を行うとこうなる." | /Users/admin/Documents/mecab/bin/mecab -Owakati
- 整个文件 分词
% /Users/admin/Documents/mecab/bin/mecab INPUT -o OUTPUT -O wakati
mecab参数配置
mecab安装
很棒的总结(日文)
mecab配置文件
中文
Execute Tokenization.py to perform segmentation by using Jieba.
Common Methods of segmentation:
Methods of Chinese Segmentation | Algorithm | Related Link | |
---|---|---|---|
Jieba | Based on a prefix dictionary structure to achieve efficient word graph scanning. Build a directed acyclic graph (DAG) for all possible word combinations.Use dynamic programming to find the most probable combination based on the word frequency.For unknown words, a HMM-based model is used with the Viterbi algorithm. | Github | Sun, J. "‘Jieba’Chinese word segmentation tool." (2012). |
THULAC(THU Lexical Analyzer for Chinese) | Based on Structured Perceptron | Github paper(2009) | Maosong Sun, Xinxiong Chen, Kaixu Zhang, Zhipeng Guo, Zhiyuan Liu. THULAC: An Efficient Lexical Analyzer for Chinese. 2016. |
StanfordSegmenter | Based on CRF | Github Tutorials paper(2005) paper(2008) |
get the code from here.