top 10的算法:我们只需要维护一个10个大小的数组,初始化放入10Query,按照每个Query的统计次数由大到小排序,然后遍历这300万条记录,每读一条记录list后进行从大到小排序。如果list长度为11,则pop()默认删除最后一个元素。
不难分析出,这样的算法的时间复杂度是N*K, 其中K是指top多少。
#!/usr/bin/python
"""
Your mapper function should print out 10 lines containing longest posts, sorted in
ascending order from shortest to longest.
Please do not use global variables and do not change the "main" function.
"""
import sys
import csv
def mapper():
reader = csv.reader(sys.stdin, delimiter='\t')
writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
a= []
for line in reader:
a.append(line)# YOUR CODE HERE
a.sort(key=lambda x: len(x[4]),reverse=True)
if len(a) == 11:
a.pop()
for line in reversed(a[0:10]):
writer.writerow(line)
test_text = """\"\"\t\"\"\t\"\"\t\"\"\t\"333\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"88888888\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"1\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"11111111111\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"1000000000\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"22\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"4444\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"666666\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"55555\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"999999999\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"7777777\"\t\"\"
"""
# This function allows you to test the mapper with the provided test string
def main():
import StringIO
sys.stdin = StringIO.StringIO(test_text)
mapper()
sys.stdin = sys.__stdin__
main()
这里top10代码的核心是:
def mapper():
reader = csv.reader(sys.stdin, delimiter='\t')
writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
a = []
for line in reader:
a.append(line)# YOUR CODE HERE
a.sort(key=lambda x: len(x[4]),reverse=True)
if len(a) == 11:
a.pop()
for line in reversed(a[0:10]):
writer.writerow(line)
python中的lambda表达式可以减少代码量。参数为列表x,返回x[4]的长度作为排序的key,reverse=True表示降序排列。 a.sort(key=lambda x: len(x[4]),reverse=True)