FuzzyWuzzy 简介
FuzzyWuzzy 是一个简单易用的模糊字符串匹配工具包。它依据 Levenshtein Distance 算法 计算两个序列之间的差异。
Levenshtein Distance
算法,又叫Edit Distance
算法,是指两个字符串之间,由一个转成另一个所需的最少编辑操作次数。许可的编辑操作包括将一个字符替换成另一个字符,插入一个字符,删除一个字符。一般来说,编辑距离越小,两个串的相似度越大。
项目地址:https://github.com/seatgeek/fuzzywuzzy
环境依赖
- Python 2.7 以上
- difflib
- python-Levenshtein(可选, 在字符串匹配时可提供4-10x 的加速, 但在某些特定情况下可能会导致不同的结果)
支持的测试工具
- pycodestyle
- hypothesis
- pytest
安装
使用 PIP 通过 PyPI 安装
pip install fuzzywuzzy
or the following to install python-Levenshtein
too
pip install fuzzywuzzy[speedup]
使用 PIP 通过 Github 安装
pip install git+git://github.com/seatgeek/fuzzywuzzy.git@0.17.0#egg=fuzzywuzzy
或者添加你的 requirements.txt
文件 (然后运行 pip install -r requirements.txt
)
git+ssh://git@github.com/seatgeek/fuzzywuzzy.git@0.17.0#egg=fuzzywuzzy
使用 GIT 手工安装
git clone git://github.com/seatgeek/fuzzywuzzy.git fuzzywuzzy
cd fuzzywuzzy
python setup.py install
用法
>>> from fuzzywuzzy import fuzz
>>> from fuzzywuzzy import process
简单匹配(Simple Ratio)
>>> fuzz.ratio("this is a test", "this is a test!")
97
非完全匹配(Partial Ratio)
>>> fuzz.partial_ratio("this is a test", "this is a test!")
100
忽略顺序匹配(Token Sort Ratio)
>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
100
去重子集匹配(Token Set Ratio)
>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
84
>>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
100
Process
用来返回模糊匹配的字符串和相似度。
>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
>>> process.extract("new york jets", choices, limit=2)
[('New York Jets', 100), ('New York Giants', 78)]
>>> process.extractOne("cowboys", choices)
("Dallas Cowboys", 90)
你可以传入附加参数到 extractOne
方法来设置使用特定的匹配模式。一个典型的用法是来匹配文件路径:
>>> process.extractOne("System of a down - Hypnotize - Heroin", songs)
('/music/library/good/System of a Down/2005 - Hypnotize/01 - Attack.mp3', 86)
>>> process.extractOne("System of a down - Hypnotize - Heroin", songs, scorer=fuzz.token_sort_ratio)
("/music/library/good/System of a Down/2005 - Hypnotize/10 - She's Like Heroin.mp3", 61)
已知移植
FuzzyWuzzy 已经被移植到其他语言环境,我们已知的有:
- Java: xpresso's fuzzywuzzy implementation
- Java: fuzzywuzzy (java port)
- Rust: fuzzyrusty (Rust port)
- JavaScript: fuzzball.js (JavaScript port)
- C++: Tmplt/fuzzywuzzy
- C#: fuzzysharp (.Net port)
- Go: go-fuzzywuzz (Go port)