引言
在使用开源的分布式任异步务调度框架Celery时, 每一个Celery task,都会返回一个128位的uuid, 那这个uuid是唯一的吗?
- 在同一台主机上,uuid是唯一的吗?(分同一个进程和不同进程)
- 在一个分布式系统中的不同的主机上,uuid是唯一的吗?
为了解决心中的疑问,我们来看看Celery中有关uuid生成的相关源代码。
- Github中Celery uuid的源代码:
celery/celery/utils/init.py
from kombu.utils.uuid import uuid # noqa
gen_unique_id = uuid
- celery使用到了kombu项目中utils提供的uuid:
kombu是celery使用的Python消息库, 也是整个celery项目中一个独立的模块。 我们再看看kombu的uuid.
"""UUID utilities."""
from __future__ import absolute_import, unicode_literals
from uuid import uuid4
def uuid(_uuid=uuid4):
"""Generate unique id in UUID4 format.
See Also:
For now this is provided by :func:`uuid.uuid4`.
"""
return str(_uuid())
发现,Celery最终使用到了python提供的自带模块uuid。并且是uuid4.
下面来看看python的uuid模块的实现。
python中uuid的实现
-
关于UUID
UUID 是 通用唯一识别码(Universally Unique Identifier)的缩写,是一种软件建构的标准,亦为开放软件基金会组织在分布式计算环境领域的一部分。其目的,是让分布式系统中的所有元素,都能有唯一的辨识信息,而不需要通过中央控制端来做辨识信息的指定。如此一来,每个人都可以创建不与其它人冲突的UUID。在这样的情况下,就不需考虑数据库创建时的名称重复问题。
UUID是128位的全局唯一标识符,通常由32字节的字符串表示。
它可以保证时间和空间的唯一性,也称为GUID,全称为:
UUID Universally unique IDentifier python 中叫UUID
GUID Globally Unique IDentifier C#中叫GUID
它是通过MAC地址、时间戳、命名空间、随机数、伪随机数来保证生成ID的唯一性。
UUID主要有5个算法,也就是5中方法来实现:
uuid1() --- 基于时间戳
由MAC地址、当前时间戳、随机数生成。可以保证全球范围内的唯一性,但MAC的使用同时带来了安全性问题,局域网中可以使用IP来代替MAC。
uuid2() -- 基于分布式计算环境DCE(python 中没有这个函数)
算法与uuid1相同,不同的是把时间戳的前4位置换为POSIX的UID。实际中很少使用
uuid3() --- 基于名字的MD5散列值
通过计算名字和命名空间的MD5散列值得到,保证了同一命名空间中不同名字的唯一性,和不同命名空间的唯一性,但同一命名空间的同一名字生成相同的uuid。
uuid4() --- 基于随机数
由伪随机数得到,有一定的重复概率,该概率可以计算出来。
uuid5() ---基于名字的SHA-1散列值
算法与uuid3相同,不同的是使用secure Hash Algorithm 1算法。
使用方面:
python 中没有基于DCE的,所有uuid2可以忽略;
uuid4存在概率性重复,由无映射性,最好不用;
在Global的分布式计算环境下,最好用uuid1;
有名字的唯一性要求,最好用uuid3或uuid5。
- 但是不知道为什么celery模块选择使用uuid4.
python3中uuid4的实现方法
先贴出python的源代码
class UUID(object):
"""Instances of the UUID class represent UUIDs as specified in RFC 4122.
UUID objects are immutable, hashable, and usable as dictionary keys.
Converting a UUID to a string with str() yields something in the form
'12345678-1234-1234-1234-123456789abc'. The UUID constructor accepts
five possible forms: a similar string of hexadecimal digits, or a tuple
of six integer fields (with 32-bit, 16-bit, 16-bit, 8-bit, 8-bit, and
48-bit values respectively) as an argument named 'fields', or a string
of 16 bytes (with all the integer fields in big-endian order) as an
argument named 'bytes', or a string of 16 bytes (with the first three
fields in little-endian order) as an argument named 'bytes_le', or a
single 128-bit integer as an argument named 'int'.
UUIDs have these read-only attributes:
bytes the UUID as a 16-byte string (containing the six
integer fields in big-endian byte order)
bytes_le the UUID as a 16-byte string (with time_low, time_mid,
and time_hi_version in little-endian byte order)
fields a tuple of the six integer fields of the UUID,
which are also available as six individual attributes
and two derived attributes:
time_low the first 32 bits of the UUID
time_mid the next 16 bits of the UUID
time_hi_version the next 16 bits of the UUID
clock_seq_hi_variant the next 8 bits of the UUID
clock_seq_low the next 8 bits of the UUID
node the last 48 bits of the UUID
time the 60-bit timestamp
clock_seq the 14-bit sequence number
hex the UUID as a 32-character i string
int the UUID as a 128-bit integer
urn the UUID as a URN as specified in RFC 4122
variant the UUID variant (one of the constants RESERVED_NCS,
RFC_4122, RESERVED_MICROSOFT, or RESERVED_FUTURE)
version the UUID version number (1 through 5, meaningful only
when the variant is RFC_4122)
"""
def __init__(self, hex=None, bytes=None, bytes_le=None, fields=None,
int=None, version=None):
r"""Create a UUID from either a string of 32 hexadecimal digits,
a string of 16 bytes as the 'bytes' argument, a string of 16 bytes
in little-endian order as the 'bytes_le' argument, a tuple of six
integers (32-bit time_low, 16-bit time_mid, 16-bit time_hi_version,
8-bit clock_seq_hi_variant, 8-bit clock_seq_low, 48-bit node) as
the 'fields' argument, or a single 128-bit integer as the 'int'
argument. When a string of hex digits is given, curly braces,
hyphens, and a URN prefix are all optional. For example, these
expressions all yield the same UUID:
UUID('{12345678-1234-5678-1234-567812345678}')
UUID('12345678123456781234567812345678')
UUID('urn:uuid:12345678-1234-5678-1234-567812345678')
UUID(bytes='\x12\x34\x56\x78'*4)
UUID(bytes_le='\x78\x56\x34\x12\x34\x12\x78\x56' +
'\x12\x34\x56\x78\x12\x34\x56\x78')
UUID(fields=(0x12345678, 0x1234, 0x5678, 0x12, 0x34, 0x567812345678))
UUID(int=0x12345678123456781234567812345678)
Exactly one of 'hex', 'bytes', 'bytes_le', 'fields', or 'int' must
be given. The 'version' argument is optional; if given, the resulting
UUID will have its variant and version set according to RFC 4122,
overriding the given 'hex', 'bytes', 'bytes_le', 'fields', or 'int'.
"""
if [hex, bytes, bytes_le, fields, int].count(None) != 4:
raise TypeError('one of the hex, bytes, bytes_le, fields, '
'or int arguments must be given')
if hex is not None:
hex = hex.replace('urn:', '').replace('uuid:', '')
hex = hex.strip('{}').replace('-', '')
if len(hex) != 32:
raise ValueError('badly formed hexadecimal UUID string')
int = int_(hex, 16)
if bytes_le is not None:
if len(bytes_le) != 16:
raise ValueError('bytes_le is not a 16-char string')
bytes = (bytes_le[4-1::-1] + bytes_le[6-1:4-1:-1] +
bytes_le[8-1:6-1:-1] + bytes_le[8:])
if bytes is not None:
if len(bytes) != 16:
raise ValueError('bytes is not a 16-char string')
assert isinstance(bytes, bytes_), repr(bytes)
int = int_.from_bytes(bytes, byteorder='big')
if fields is not None:
if len(fields) != 6:
raise ValueError('fields is not a 6-tuple')
(time_low, time_mid, time_hi_version,
clock_seq_hi_variant, clock_seq_low, node) = fields
if not 0 <= time_low < 1<<32:
raise ValueError('field 1 out of range (need a 32-bit value)')
if not 0 <= time_mid < 1<<16:
raise ValueError('field 2 out of range (need a 16-bit value)')
if not 0 <= time_hi_version < 1<<16:
raise ValueError('field 3 out of range (need a 16-bit value)')
if not 0 <= clock_seq_hi_variant < 1<<8:
raise ValueError('field 4 out of range (need an 8-bit value)')
if not 0 <= clock_seq_low < 1<<8:
raise ValueError('field 5 out of range (need an 8-bit value)')
if not 0 <= node < 1<<48:
raise ValueError('field 6 out of range (need a 48-bit value)')
clock_seq = (clock_seq_hi_variant << 8) | clock_seq_low
int = ((time_low << 96) | (time_mid << 80) |
(time_hi_version << 64) | (clock_seq << 48) | node)
if int is not None:
if not 0 <= int < 1<<128:
raise ValueError('int is out of range (need a 128-bit value)')
if version is not None:
if not 1 <= version <= 5:
raise ValueError('illegal version number')
# Set the variant to RFC 4122.
int &= ~(0xc000 << 48)
int |= 0x8000 << 48
# Set the version number.
int &= ~(0xf000 << 64)
int |= version << 76
self.__dict__['int'] = int
def uuid4():
"""Generate a random UUID."""
return UUID(bytes=os.urandom(16), version=4)
由源代码可以看出,python3的uuid4,是由一个16 bytes的随机数生成的。 最重要的一句代码就是bytes=os.urandom(16)。
Python中os.urandom()的详细说明:
Random numbers
--------------
function:: urandom(size)
Return a string of *size* random bytes suitable for cryptographic use.
This function returns random bytes from an OS-specific randomness source. The
returned data should be unpredictable enough for cryptographic applications,
though its exact quality depends on the OS implementation.
On Linux, if the ``getrandom()`` syscall is available, it is used in
blocking mode: block until the system urandom entropy pool is initialized
(128 bits of entropy are collected by the kernel). See the :pep:`524` for
the rationale. On Linux, the :func:`getrandom` function can be used to get
random bytes in non-blocking mode (using the :data:`GRND_NONBLOCK` flag) or
to poll until the system urandom entropy pool is initialized.
On a Unix-like system, random bytes are read from the ``/dev/urandom``
device. If the ``/dev/urandom`` device is not available or not readable, the
:exc:`NotImplementedError` exception is raised.
- 从python的官方文档可以看出,在linux系统中,os.urandom随机数的产生,依赖于/dev/urandom文件。
/dev/random和/dev/urandom是Linux系统中提供的随机伪设备,这两个设备的任务,是提供永不为空的随机字节数据流。很多解密程序与安全应用程序(如SSH Keys,SSL Keys等)需要它们提供的随机数据流。
这两个设备的差异在于:/dev/random的random pool依赖于系统中断,因此在系统的中断数不足时,/dev/random设备会一直封锁,尝试读取的进程就会进入等待状态,直到系统的中断数充分够用, /dev/random设备可以保证数据的随机性。/dev/urandom不依赖系统的中断,也就不会造成进程忙等待,但是数据的随机性也不高。
- 现在可以回答上面提到的2个问题了:
- 在同一台主机上,uuid是唯一的吗?(分同一个进程和不同进程)
- 在一个分布式系统中的不同的主机上,uuid是唯一的吗?
- 答案是: celery使用的uuid4与主机,进程都没有关系,是纯粹的由随机数算法得到的,与平台,机器,时间,命名空间都无关。
暂且相信16bytes的随机数,出现重复的概率极小,但是,在一个大规模的分布式系统中,仍然由概率相同,希望在使用celery的过程中,不要碰到这个坑。