Python Counter() 的实现

collections.Counter 源码实现<a id="sec-1" name="sec-1"></a>

Counter 的相关源码在lib下的collections.py里,本文所提及的源码是python2.7版本, 可参见github

__init__<a id="sec-1-1" name="sec-1-1"></a>

class Counter(dict):
    '''Dict subclass for counting hashable items.  Sometimes called a bag
    or multiset.  Elements are stored as dictionary keys and their counts
    are stored as dictionary values.
    '''
    def __init__(*args, **kwds):
        '''Create a new, empty Counter object.  And if given, count elements
        from an input iterable.  Or, initialize the count from another mapping
        of elements to their counts.

        >>> c = Counter()                           # a new, empty counter
        >>> c = Counter('gallahad')                 # a new counter from an iterable
        >>> c = Counter({'a': 4, 'b': 2})           # a new counter from a mapping
        >>> c = Counter(a=4, b=2)                   # a new counter from keyword args

        '''
        if not args:
            raise TypeError("descriptor '__init__' of 'Counter' object "
                            "needs an argument")
        self = args[0]
        args = args[1:]
        if len(args) > 1:
            raise TypeError('expected at most 1 arguments, got %d' % len(args))
        super(Counter, self).__init__()
        self.update(*args, **kwds)

Counter 继承字典类来实现,初始化中对参数进行有效性校验,其中 args 接受除了 self 外最多一个未知参数。校验完成后调用自身的 update 方法来具体创建数据结构。

update<a id="sec-1-2" name="sec-1-2"></a>

def update(*args, **kwds):
    '''Like dict.update() but add counts instead of replacing them.
    '''
    if not args:
        raise TypeError("descriptor 'update' of 'Counter' object "
                        "needs an argument")
    self = args[0]
    args = args[1:]
    if len(args) > 1:
        raise TypeError('expected at most 1 arguments, got %d' % len(args))
    iterable = args[0] if args else None
    if iterable is not None:
        if isinstance(iterable, Mapping):
            if self:
                self_get = self.get
                for elem, count in iterable.iteritems():
                    self[elem] = self_get(elem, 0) + count
            else:
                super(Counter, self).update(iterable) # fast path when counter is empty
        else:
            self_get = self.get
            for elem in iterable:
                self[elem] = self_get(elem, 0) + 1
    if kwds:
        self.update(kwds)

update 方法先检查参数,位置参数除了self外只允许有一个。然后对传入的参数进行判断,如果是以 Counter(a=1,b=2) 的方式调用的,这时候取出 kwds({'a':1,'b'=2}) 再调用自身,将关键字参数转化为位置参数处理。
如果传入的位置参数是一个mapping类型的,对应于 Counter({'a':1,'b':2}) 这样的方式调用,这种情况会判断self是否为空,在初始化状态下self总是空的,这边加上判断是因为update 方法不仅近在 __init__() 里调用,还可以这样调用:

x1 = collections.Counter({'a': 1, 'b': 2})
x2 = collections.Counter(a=1, b=2)
x1.update(x2)      # Counter()类型 isinstance(iterable, Mapping) 也返回 True

# 或者这样调用
x1 = collections.Counter({'a': 1, 'b': 2})
x1.update('aab')

如果传入的不是一个mapping类型,那么会迭代该参数的每一项作为key添加到Counter中

most_common<a id="sec-1-3" name="sec-1-3"></a>

def most_common(self, n=None):
    '''List the n most common elements and their counts from the most
    common to the least.  If n is None, then list all element counts.

    >>> Counter('abcdeabcdabcaba').most_common(3)
    [('a', 5), ('b', 4), ('c', 3)]

    '''
    # Emulate Bag.sortedByCount from Smalltalk
    if n is None:
        return sorted(self.iteritems(), key=_itemgetter(1), reverse=True)
    return _heapq.nlargest(n, self.iteritems(), key=_itemgetter(1))

如果调用 most_common 不指定参数n则默认返回全部(key, value)组成的列表,按照value降序排列。

itemgetter<a id="sec-1-3-1" name="sec-1-3-1"></a>

这里用到了有趣的 itemgetter(代码里用了别名_itemgetter) , 它是来自 operator 模块中的方法,可以从下面的代码感受一下:

# 例子来源python文档
# 举例:
After f = itemgetter(1), the call f(r) returns r[1].
After g = itemgetter(2, 5, 3), the call g(r) returns (r[2], r[5], r[3]).
# 实现:
def itemgetter(*items):
    if len(items) == 1:
        item = items[0]
        def g(obj):
            return obj[item]
    else:
        def g(obj):
            return tuple(obj[item] for item in items)
    return g

# 常见用法:
>>> itemgetter(1)('ABCDEFG')
'B'
>>> itemgetter(1,3,5)('ABCDEFG')
('B', 'D', 'F')
>>> itemgetter(slice(2,None))('ABCDEFG')
'CDEFG'

>>> inventory = [('apple', 3), ('banana', 2), ('pear', 5), ('orange', 1)]
>>> getcount = itemgetter(1)
>>> map(getcount, inventory)
[3, 2, 5, 1]
>>> sorted(inventory, key=getcount)
[('orange', 1), ('banana', 2), ('apple', 3), ('pear', 5)]

heapquue<a id="sec-1-3-2" name="sec-1-3-2"></a>

heap queue是“queue algorithm”算法的python实现,调用 _heapq.nlargest() 返回了根据每个value排序前n个大的(key, value)元组组成的列表。具体heap queue使用参见文档

elements<a id="sec-1-4" name="sec-1-4"></a>

elements 方法实现了按照value的数值重复返回key。它的实现很精妙,只有一行:

def elements(self):
    '''Iterator over elements repeating each as many times as its count.

    >>> c = Counter('ABCABC')
    >>> sorted(c.elements())
    ['A', 'A', 'B', 'B', 'C', 'C']
    '''
    return _chain.from_iterable(_starmap(_repeat, self.iteritems()))

该实现里用到了 itertools 里的 repeat starmap chain 三个方法, 直接按照每项计数的次数重复返回每项内容,拼成一个列表。

repeat<a id="sec-1-4-1" name="sec-1-4-1"></a>

repeat生成一个迭代器,根据第二个参数不停滴返回接受的第一个参数。直接看实现,很好理解, 类似实现如下:

def repeat(object, times=None):
    # repeat(10, 3) --> 10 10 10
    if times is None:
        while True:
            yield object
    else:
        for i in xrange(times):
            yield object

starmap<a id="sec-1-4-2" name="sec-1-4-2"></a>

starmap接受的第一个参数是一个函数,生成一个迭代器,不停滴将该函数以第二个参数传来的每一项为参数进行调用(说得抽象,看例子好理解),类似实现如下:

def starmap(function, iterable):
    # starmap(pow, [(2,5), (3,2), (10,3)]) --> 32 9 1000
    for args in iterable:
        yield function(*args)

chain.from_iterable<a id="sec-1-4-3" name="sec-1-4-3"></a>

chain.from_iterable 接受一个可迭代对象,返回一个迭代器,不停滴返回可迭代对象的每一项,类似实现如下:

def from_iterable(iterables):
    # chain.from_iterable(['ABC', 'DEF']) --> A B C D E F
    for it in iterables:
        for element in it:
            yield element

substract<a id="sec-1-5" name="sec-1-5"></a>

substract的实现和update实现很像,不同之处在counter()相同的项的计数相加改成了相减。

def subtract(*args, **kwds):
    '''Like dict.update() but subtracts counts instead of replacing them.
    Counts can be reduced below zero.  Both the inputs and outputs are
    allowed to contain zero and negative counts.

    Source can be an iterable, a dictionary, or another Counter instance.

    >>> c = Counter('which')
    >>> c.subtract('witch')             # subtract elements from another iterable
    >>> c.subtract(Counter('watch'))    # subtract elements from another counter
    >>> c['h']                          # 2 in which, minus 1 in witch, minus 1 in watch
    0
    >>> c['w']                          # 1 in which, minus 1 in witch, minus 1 in watch
    -1

    '''
    if not args:
        raise TypeError("descriptor 'subtract' of 'Counter' object "
                        "needs an argument")
    self = args[0]
    args = args[1:]
    if len(args) > 1:
        raise TypeError('expected at most 1 arguments, got %d' % len(args))
    iterable = args[0] if args else None
    if iterable is not None:
        self_get = self.get
        if isinstance(iterable, Mapping):
            for elem, count in iterable.items():
                self[elem] = self_get(elem, 0) - count
        else:
            for elem in iterable:
                self[elem] = self_get(elem, 0) - 1
    if kwds:
        self.subtract(kwds)

**

+, -, &, |<a id="sec-1-6" name="sec-1-6"></a>

通过对 __add__, __sub__, __or__, __and__ 的定义,重写了 +, -, &, | ,实现了Counter间类似于集合的操作, 代码不难理解,值得注意的是,将非正的结果略去了:

def __add__(self, other):
        '''Add counts from two counters.

        >>> Counter('abbb') + Counter('bcc')
        Counter({'b': 4, 'c': 2, 'a': 1})

        '''
        if not isinstance(other, Counter):
            return NotImplemented
        result = Counter()
        for elem, count in self.items():
            newcount = count + other[elem]
            if newcount > 0:
                result[elem] = newcount
        for elem, count in other.items():
            if elem not in self and count > 0:
                result[elem] = count
        return result

    def __sub__(self, other):
        ''' Subtract count, but keep only results with positive counts.

        >>> Counter('abbbc') - Counter('bccd')
        Counter({'b': 2, 'a': 1})

        '''
        if not isinstance(other, Counter):
            return NotImplemented
        result = Counter()
        for elem, count in self.items():
            newcount = count - other[elem]
            if newcount > 0:
                result[elem] = newcount
        for elem, count in other.items():
            if elem not in self and count < 0:
                result[elem] = 0 - count
        return result

    def __or__(self, other):
        '''Union is the maximum of value in either of the input counters.

        >>> Counter('abbb') | Counter('bcc')
        Counter({'b': 3, 'c': 2, 'a': 1})

        '''
        if not isinstance(other, Counter):
            return NotImplemented
        result = Counter()
        for elem, count in self.items():
            other_count = other[elem]
            newcount = other_count if count < other_count else count
            if newcount > 0:
                result[elem] = newcount
        for elem, count in other.items():
            if elem not in self and count > 0:
                result[elem] = count
        return result

    def __and__(self, other):
        ''' Intersection is the minimum of corresponding counts.

        >>> Counter('abbb') & Counter('bcc')
        Counter({'b': 1})

        '''
        if not isinstance(other, Counter):
            return NotImplemented
        result = Counter()
        for elem, count in self.items():
            other_count = other[elem]
            newcount = count if count < other_count else other_count
            if newcount > 0:
                result[elem] = newcount
        return result

其它<a id="sec-1-7" name="sec-1-7"></a>

# 当用Pickler序列化时,遇到不知道怎么序列化时,查找__reduce__方法
def __reduce__(self):
    return self.__class__, (dict(self),)


# 重写删除方法,当Counter有这个key再删除,避免KeyError
def __delitem__(self, elem):
    'Like dict.__delitem__() but does not raise KeyError for missing values.'
    if elem in self:
        super(Counter, self).__delitem__(elem)

# %s : String (converts any Python object using str()).
# %r : String (converts any Python object using repr()).
def __repr__(self):
    if not self:
        return '%s()' % self.__class__.__name__
    items = ', '.join(map('%r: %r'.__mod__, self.most_common()))
    return '%s({%s})' % (self.__class__.__name__, items)

@classmethod
def fromkeys(cls, iterable, v=None):
    # There is no equivalent method for counters because setting v=1
    # means that no element can have a count greater than one.
    raise NotImplementedError(
        'Counter.fromkeys() is undefined.  Use Counter(iterable) instead.')


 #  实现__missing__方法,当Couter['no_field'] => 0, 字典默认的__missing__ 方法不实现会报错(KeyError)
def __missing__(self, key):
    'The count of elements not in the Counter is zero.'
    # Needed so that self[missing_item] does not raise KeyError
    return 0

总结<a id="sec-2" name="sec-2"></a>

总体来说,Counter通过对内置字典类型的继承重写来的实现,比较简洁,逻辑也很清楚,从源码中可以学到很多标准库里提供的很多的不常见的方法的使用,可以使代码更加简洁,思路更加流畅。

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,884评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,755评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,369评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,799评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,910评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,096评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,159评论 3 411
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,917评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,360评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,673评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,814评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,509评论 4 334
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,156评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,882评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,123评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,641评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,728评论 2 351

推荐阅读更多精彩内容