Python Tricks - Common Data Structures in Python（3）

Records, Structs, and Data Transfer Objects

Compared to arrays, record data structures provide a fixed number of fields, where each field can have a name and may also have a different type.
与数组相比，记录数据结构提供了固定数量的字段，其中每个字段可以有一个名称，也可以有不同的类型。

In this chapter, you’ll see how to implement records, structs, and “plain old data objects” in Python, using only built-in data types and classes from the standard library.

By the way, I’m using the definition of a record loosely here. For example, I’m also going to discuss types like Python’s built-in tuple that may or may not be considered records in a strict sense because they don’t provide named fields.

Python offers several data types you can use to implement records, structs, and data transfer objects. In this chapter, you’ll get a quick look at each implementation and its unique characteristics. At the end, you’ll find a summary and a decision-making guide that will help you make your own picks.

Alright, let’s get started!

dict – Simple Data Objects

Python dictionaries store an arbitrary number of objects, each identified by a unique key. Dictionaries are also often called maps or associative arrays and allow for the efficient lookup, insertion, and deletion of any object associated with a given key.

Using dictionaries as a record data type or data object in Python is possible. Dictionaries are easy to create in Python, as they have their own syntactic sugar built into the language in the form of dictionary literals. The dictionary syntax is concise and quite convenient to type.

Data objects created using dictionaries are mutable, and there’s little protection against misspelled field names, as fields can be added and removed freely at any time. Both of these properties can introduce surprising bugs, and there’s always a trade-off to be made between convenience and error resilience.
方便和错误恢复中总是要做点选择。

car1 = {
  'color': 'red',
  'mileage': 3812.4,
  'automatic': True,
}

car2 = {
  'color': 'blue',
  'mileage': 40231,
  'automatic': False,
}

# Dicts have a nice repr:
>>> car2
{'color': 'blue', 'automatic': False, 'mileage': 40231}
# Get mileage:
>>> car2['mileage']
40231

# Dicts are mutable:
>>> car2['mileage'] = 12
>>> car2['windshield'] = 'broken'
>>> car2
{'windshield': 'broken', 'color': 'blue',
'automatic': False, 'mileage': 12}

# No protection against wrong field names,
# or missing/extra fields:
car3 = {
  'colr': 'green',
  'automatic': False,
  'windshield': 'broken',
}

tuple – Immutable Groups of Objects

Python’s tuples are simple data structures for grouping arbitrary objects. Tuples are immutable—they cannot be modified once they’ve been created.

Performance-wise, tuples take up slightly less memory than lists in CPython, and they’re also faster to construct.
元组比列表占用更少的内存，而且元组能更快的创建。

As you can see in the bytecode disassembly below, constructing a tuple constant takes a single LOAD_CONST opcode, while constructing a list object with the same contents requires several more operations:

>>> import dis
>>> dis.dis(compile("(23, 'a', 'b', 'c')", '', 'eval'))
      0 LOAD_CONST 4 ((23, 'a', 'b', 'c'))
      3 RETURN_VALUE

>>> dis.dis(compile("[23, 'a', 'b', 'c']", '', 'eval'))
      0 LOAD_CONST 0 (23)
      3 LOAD_CONST 1 ('a')
      6 LOAD_CONST 2 ('b')
      9 LOAD_CONST 3 ('c')
      12 BUILD_LIST 4
      15 RETURN_VALUE

从上面看，创建元组的时候是直接创建的，而创建列表的时候是挨个元素创建，然后在组建列表。

However, you shouldn’t place too much emphasis on these differences. In practice, the performance difference will often be negligible, and trying to squeeze extra performance out of a program by switching from lists to tuples will likely be the wrong approach.
然而，在实际操作中，这种区别在运行表现上经常微乎其微。

A potential downside of plain tuples is that the data you store in them can only be pulled out by accessing it through integer indexes. You can’t give names to individual properties stored in a tuple. This can impact code readability.
意思是元组只能通过整数索引访问到。（但是从我们之前学习的内容来看，我们可以使用namedtuple来使用变量名来访问相应的元组数据）

Also, a tuple is always an ad-hoc structure: It’s difficult to ensure that two tuples have the same number of fields and the same properties stored on them.
另外，一个元组总是一个特殊的结构：很难确保两个元组具有相同数量的字段和存储在它们上面的相同属性。

This makes it easy to introduce “slip-of-the-mind” bugs, such as mixing up the field order. Therefore, I would recommend that you keep the number of fields stored in a tuple as low as possible.
作者建议我们尽量存入少量的字段在元组中。

# Fields: color, mileage, automatic
>>> car1 = ('red', 3812.4, True)
>>> car2 = ('blue', 40231.0, False)

# Tuple instances have a nice repr:
>>> car1
('red', 3812.4, True)
>>> car2
('blue', 40231.0, False)

# Get mileage:
>>> car2[1]
40231.0

# Tuples are immutable:
>>> car2[1] = 12
TypeError:
"'tuple' object does not support item assignment"

# No protection against missing/extra fields
# or a wrong order:
>>> car3 = (3431.5, 'green', True, 'silver')

元组对缺失和额外的字段，以及错误的字段顺序没有保护。

Writing a Custom Class – More Work, More Control

更好的工作，更多的要求
Classes allow you to define reusable “blueprints” for data objects to ensure each object provides the same set of fields.

Using regular Python classes as record data types is feasible, but it also takes manual work to get the convenience features of other implementations. For example, adding new fields to the __init__ constructor is verbose and takes time.

Also, the default string representation for objects instantiated from custom classes is not very helpful. To fix that you may have to add your own __repr__ method, which again is usually quite verbose and must be updated every time you add a new field.

Fields stored on classes are mutable, and new fields can be added freely, which you may or may not like. It’s possible to provide more access control and to create read-only fields using the @property decorator, but once again, this requires writing more glue code.
我们可以利用@property装饰器让这个类只读。

Writing a custom class is a great option whenever you’d like to add business logic and behavior to your record objects using methods. However, this means that these objects are technically no longer plain data objects.
当你想利用方法添加商业逻辑和行为到你的记录对象的时候，写一个自定义的类是一个很好的选择。但是这也就意味着这些对象不再是纯数据对象。

class Car:
  def __init__(self, color, mileage, automatic):
    self.color = color
    self.mileage = mileage
    self.automatic = automatic

>>> car1 = Car('red', 3812.4, True)
>>> car2 = Car('blue', 40231.0, False)

# Get the mileage:
>>> car2.mileage
40231.0

# Classes are mutable:
>>> car2.mileage = 12
>>> car2.windshield = 'broken'

# String representation is not very useful
# (must add a manually written __repr__ method):
>>> car1
<Car object at 0x1081e69e8>

collections.namedtuple – Convenient Data Objects

The namedtuple class available in Python 2.6+ provides an extension of the built-in tuple data type. Similar to defining a custom class, using namedtuple allows you to define reusable “blueprints” for your records that ensure the correct field names are used.

Namedtuples are immutable, just like regular tuples. This means you cannot add new fields or modify existing fields after the namedtuple instance was created.
创建后无法修改。

Besides that, namedtuples are, well… named tuples. Each object stored in them can be accessed through a unique identifier. This frees you from having to remember integer indexes, or resort to workarounds like defining integer constants as mnemonics for your indexes.
defining integer constants as mnemonics for your indexes - 定义整型变量的索引助记键。

Namedtuple objects are implemented as regular Python classes internally. When it comes to memory usage, they are also “better” than regular classes and just as memory efficient as regular tuples:

>>> from collections import namedtuple
>>> from sys import getsizeof

>>> p1 = namedtuple('Point', 'x y z')(1, 2, 3)
>>> p2 = (1, 2, 3)

>>> getsizeof(p1)
72
>>> getsizeof(p2)
72

从上面的结果看，命名元组并不多占用内存。

Namedtuples can be an easy way to clean up your code and make it more readable by enforcing a better structure for your data.

I find that going from ad-hoc data types, like dictionaries with a fixed format, to namedtuples helps me express the intent of my code more clearly. Often when I apply this refactoring, I magically come up with a better solution for the problem I’m facing.

Using namedtuples over regular (unstructured) tuples and dicts can also make my coworkers’ lives easier: Namedtuples make the data that’s being passed around “self-documenting”, at least to a degree.

>>> from collections import namedtuple
>>> Car = namedtuple('Car' , 'color mileage automatic')
>>> car1 = Car('red', 3812.4, True)

# Instances have a nice repr:
>>> car1
Car(color='red', mileage=3812.4, automatic=True)

# Accessing fields:
>>> car1.mileage
3812.4

# Fields are immtuable:
>>> car1.mileage = 12
AttributeError: "can't set attribute"
>>> car1.windshield = 'broken'
AttributeError:
"'Car' object has no attribute 'windshield'"

typing.NamedTuple – Improved Namedtuples

This class added in Python 3.6 is the younger sibling of the namedtuple class in the collections module. It is very similar to namedtuple, the main difference being an updated syntax for defining new record types and added support for type hints.

Please note that type annotations are not enforced without a separate type-checking tool like mypy. But even without tool support, they can provide useful hints for other programmers (or be terribly confusing if the type hints become out-of-date.)

>>> from typing import NamedTuple

class Car(NamedTuple):
  color: str
  mileage: float
  automatic: bool

>>> car1 = Car('red', 3812.4, True)

# Instances have a nice repr:
>>> car1
Car(color='red', mileage=3812.4, automatic=True)

# Accessing fields:
>>> car1.mileage
3812.4

# Fields are immutable:
>>> car1.mileage = 12
AttributeError: "can't set attribute"
>>> car1.windshield = 'broken'
AttributeError:
"'Car' object has no attribute 'windshield'"

# Type annotations are not enforced without
# a separate type checking tool like mypy:
>>> Car('red', 'NOT_A_FLOAT', 99)
Car(color='red', mileage='NOT_A_FLOAT', automatic=99)

意思就是不会强制执行必须要的那个数据格式。

struct.Struct – Serialized C Structs

The struct.Struct class converts between Python values and C structs serialized into Python bytes objects. For example, it can be used to handle binary data stored in files or coming in from network connections.

Structs are defined using a format strings-like mini language that allows you to define the arrangement of various C data types like char, int, and long, as well as their unsigned variants.

Serialized structs are seldom used to represent data objects meant to be handled purely inside Python code. They’re intended primarily as a data exchange format, rather than as a way of holding data in memory that’s only used by Python code.
序列化结构主要用于数据交换格式，而不是一种将数据保存在内存中的方法。

In some cases, packing primitive data into structs may use less memory than keeping it in other data types. However, in most cases that would be quite an advanced (and probably unnecessary) optimization.

>>> from struct import Struct
>>> MyStruct = Struct('i?f')
>>> data = MyStruct.pack(23, False, 42.0)

# All you get is a blob of data:
>>> data
b'x17x00x00x00x00x00x00x00x00x00(B'

# Data blobs can be unpacked again:
>>> MyStruct.unpack(data)
(23, False, 42.0)

types.SimpleNamespace – Fancy Attribute Access

Here’s one more “esoteric” choice for implementing data objects in Python: types.SimpleNamespace. This class was added in Python 3.3 and it provides attribute access to its namespace.

This means SimpleNamespace instances expose all of their keys as class attributes. This means you can use obj.key “dotted” attribute access instead of the obj['key'] square-brackets indexing syntax that’s used by regular dicts. All instances also include a meaningful __repr__ by default.

As its name proclaims, SimpleNamespace is simple! It’s basically a glorified dictionary that allows attribute access and prints nicely. Attributes can be added, modified, and deleted freely.
简单命名域就像一个小字典一样，允许属性访问和很好的打印。属性可以被添加，修改和删除。

>>> from types import SimpleNamespace
>>> car1 = SimpleNamespace(color='red',
...                       mileage=3812.4,
...                       automatic=True)

# The default repr:
>>> car1
namespace(automatic=True, color='red', mileage=3812.4)

# Instances support attribute access and are mutable:
>>> car1.mileage = 12
>>> car1.windshield = 'broken'
>>> del car1.automatic
>>> car1
namespace(color='red', mileage=12, windshield='broken')

Key Takeaways

Now, which type should you use for data objects in Python? As you’ve seen, there’s quite a number of different options for implementing records or data objects. Generally your decision will depend on your use case:

You only have a few (2-3) fields: Using a plain tuple object may be okay if the field order is easy to remember or field names are superfluous. For example, think of an (x, y, z) point in 3D space.

You need immutable fields: In this case, plain tuples, collections.namedtuple, and typing.NamedTuple would all make good options for implementing this type of data object.

You need to lock down field names to avoid typos: collections.namedtuple and typing.NamedTuple are your friends here.

You want to keep things simple: A plain dictionary object might be a good choice due to the convenient syntax that closely resembles JSON.

You need full control over your data structure: It’s time to write a custom class with @property setters and getters.

You need to add behavior (methods) to the object: You should write a custom class, either from scratch or by extending collections.namedtuple or typing.NamedTuple.

You need to pack data tightly to serialize it to disk or to send it over the network: Time to read up on struct.Struct because this is a great use case for it.

If you’re looking for a safe default choice, my general recommendation for implementing a plain record, struct, or data object in Python would be to use collections.namedtuple in Python 2.x and its younger sibling, typing.NamedTuple in Python 3.

Python Tricks - Common Data Structures in Python（3）