https://web.stanford.edu/class/cs276/pa/pa2.pdf
语料:
lm corpus: 99,904 documents
query corpus: 819,722 编辑距离最多为1
Levenshtein automaton
比较清晰的ppt:
http://web.stanford.edu/class/cs276/handouts/spell_correction.pdf
目前的做法:
image.png
提升的方法:
image.png
额外的加分项:
- 考虑编辑距离在1以上的情况
- 除了斯坦福网站的语料,尝试其他语料
- 训练语言模型的时候,考虑其他的平滑方式,例如 Kneser-Ney smoothing
- K-gram index
- Levenshtein Automata:uses a finite state automata for fuzzy matching of words
git:https://gist.github.com/Arachnid/491973
blog:http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata
https://github.com/aitounejjar/pa2-Spell-Corrector
https://github.com/pangolulu/spelling-corrector
Moore 文章语音+拼写model
- Toutanova K, Moore R C. Pronunciation modeling for improved spelling correction[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2002: 144-151.
Moore 之前的文章:拼写model
- Brill E, Moore R C. An improved error model for noisy channel spelling correction[C]//Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2000: 286-293.
引用了Moore的文章
- Martins B, Silva M J. Spelling correction for search engine queries[M]//Advances in Natural Language Processing. Springer, Berlin, Heidelberg, 2004: 372-383.
- Sun X, Gao J, Micol D, et al. Learning phrase-based spelling error models from clickthrough data[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2010: 266-274.
-
Multi-level feature extraction for spelling correction
- Wilcox-O’Hearn A, Hirst G, Budanitsky A. Real-word spelling correction with trigrams: A reconsideration of the Mays, Damerau, and Mercer model[C]//International conference on intelligent text processing and computational linguistics. Springer, Berlin, Heidelberg, 2008: 605-616.
Gao J, Li X, Micol D, et al. A large scale ranker-based system for search query spelling correction[C]//Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, 2010: 358-366.