字符串前缀说明
- u前缀
Unicode编码
- b 前缀
Ascll编码
- 无前缀
默认编码
出现问题现象
- 两个字符串列表取交集耗时长
两个列表字符串内容一样,大小为3k+
以遍历的方式取交集耗时达到近5s
- 字符串格式
其中一个列表中的字符串带有前缀u,另一个则没有
- 代码样例
a12 = [s for s in a1 if s in a2]
初步方案
a12 = [s for s in a1 if s.encode("utf-8") in a2]
('diff|loop speed ', 4998.8330078125ms, '| encode loop speed ', 123.25ms)
使用intersection函数
a13 = list(set(a1).intersection(set(a2)))
('diff|loop speed ', 4998.8330078125ms, '| encode loop speed ', 123.25ms, '|intersection speed : ', 3.626953125ms)
初步结论
- 两个列表取交集建议使用intersection函数
几种方式对比
- 对比数据
相同编码的列表,不同编码的列表,遍历取交集,遍历编码后取交集,使用intersection函数
- 测试代码
def loopComp(a,b):
c=[s for s in a if s in b]
print('loopComp ret size : ',len(c))
def intersectionComp(a,b):
c=list(set(a).intersection(set(b)))
print('intersectionComp ret size : ',len(c))
def encodeIntersectionComp(a,b):
a1=[s.encode("utf-8") for s in a]
c=list(set(a1).intersection(set(b)))
print('encodeIntersectionComp ret size : ',len(c))
def encodeloopComp(a,b):
c=[s for s in a if s.encode("utf-8") in b]
print('encodeloopComp ret size : ',len(c))
print('==========same encode list==========')
%time loopComp(a1,a2)
%time encodeloopComp(a1,a2)
%time intersectionComp(a1,a2)
%time encodeIntersectionComp(a1,a2)
print('==========diff encode list==========')
%time loopComp(a1,a3)
%time encodeloopComp(a1,a3)
%time intersectionComp(a1,a3)
%time encodeIntersectionComp(a1,a3)
==========same encode list==========
('loopComp ret size : ', 3559)
CPU times: user 172 ms, sys: 3.1 ms, total: 175 ms
Wall time: 167 ms
('encodeloopComp ret size : ', 3559)
CPU times: user 4.79 s, sys: 4.86 ms, total: 4.8 s
Wall time: 4.82 s
('intersectionComp ret size : ', 3559)
CPU times: user 920 µs, sys: 0 ns, total: 920 µs
Wall time: 851 µs
('encodeIntersectionComp ret size : ', 3559)
CPU times: user 4.97 ms, sys: 0 ns, total: 4.97 ms
Wall time: 4.88 ms
==========diff encode list==========
('loopComp ret size : ', 3559)
CPU times: user 4.81 s, sys: 7.46 ms, total: 4.82 s
Wall time: 4.83 s
('encodeloopComp ret size : ', 3559)
CPU times: user 125 ms, sys: 0 ns, total: 125 ms
Wall time: 126 ms
('intersectionComp ret size : ', 3559)
CPU times: user 3.53 ms, sys: 0 ns, total: 3.53 ms
Wall time: 3.54 ms
('encodeIntersectionComp ret size : ', 3559)
CPU times: user 2.34 ms, sys: 0 ns, total: 2.34 ms
Wall time: 2.32 ms
结论
- 相同编码的列表,使用intersection函数取交集性能最好
- 在不确定列表编码的情况下,必须使用intersection函数取交集
扩展
- 环境为Python3
有无u前缀的字符串列表取交集性能表现一致
使用intersection函数取交集性能最好
- 运行结果数据
==========same encode list==========
loopComp ret size : 3559
CPU times: user 129 ms, sys: 1.32 ms, total: 130 ms
Wall time: 129 ms
encodeloopComp ret size : 0
CPU times: user 253 ms, sys: 122 µs, total: 253 ms
Wall time: 253 ms
intersectionComp ret size : 3559
CPU times: user 605 µs, sys: 0 ns, total: 605 µs
Wall time: 706 µs
encodeIntersectionComp ret size : 0
CPU times: user 1.31 ms, sys: 0 ns, total: 1.31 ms
Wall time: 1.32 ms
==========diff encode list==========
loopComp ret size : 3559
CPU times: user 123 ms, sys: 0 ns, total: 123 ms
Wall time: 122 ms
encodeloopComp ret size : 0
CPU times: user 248 ms, sys: 0 ns, total: 248 ms
Wall time: 249 ms
intersectionComp ret size : 3559
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 689 µs
encodeIntersectionComp ret size : 0
CPU times: user 1.47 ms, sys: 0 ns, total: 1.47 ms
Wall time: 1.3 ms