加速Python: Numba, since 2026-03-31

(2026.03.31 Tues)
Numba是Python的即时编译器(just-in-time JIT compiler)，特别适用于使用了Numpy array的函数和循环，即computation bound的任务。

使用Numba最常见的方式是其内部的一系列装饰器，使被装饰的函数由Numba编译。Numba装饰器会读取被装饰函数的python字节码(bytecode)，并将其与函数输入变量类型相连(?)。经过分析和优化后，由LLVM编译器库(complier library)生成函数的机器码，绕过python解释器，并以原生机器速度运行。之后每次调用该函数时都使用该机器码。

案例1：计算圆周率

用Monte Carlo法计算圆周率pi，在计算过程中加入@numba.jit装饰器，会提高运行效率。

import functools, random
from numba import jit

# 设定计时器装饰器
def timer(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        starting_time = time.perf_counter()
        result = func(*args, **kwargs)
        ending_time = time.perf_counter()
        print('time consumption in seconds: ', ending_time - starting_time)
        return result
    return wrapper

@timer
@jit(nopython=True)
def mc_pi_jit(nsamples):
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

运行

>> mc_pi_jit(10000000)
time consumption in seconds:  0.31119772199963336

调用@jit装饰器时，指定了传递的参数nopython=True(默认设置)，以保证被装饰函数在编译时完全没需python解释器的接入。这个设定将极大提高代码运行效率。

作为对比，运行该代码并且不加入@jit装饰器的运行时长如下

@timer
def mc_pi_no_jit(nsamples):
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

>> mc_pi_no_jit(10000000)
time consumption in seconds:  3.635518551000132
3.1409716

时长对比：使用@jit装饰器，0.31s；不使用@jit装饰器，3.64s。效率提升达到10倍。

案例2：计算矩阵乘法

使用numba中的cuda，将矩阵乘法部署到GPU上。

# cuda_test.py
import numpy as np
import time
from tqdm import trange
from numba import cuda
cuda.select_device(1)

@cuda.jit
def CudaSquare(x):
    i, j = cuda.grid(2)
    x[i][j] *= x[i][j]

if __name__ == '__main__':
    numpy_time = 0
    numba_time = 0
    test_length = 1000
    for i in trange(test_length):
        np.random.seed(i)
        array_length = 2**12
        random_array = np.random.rand(array_length, array_length)
        random_array_cuda = cuda.to_device(random_array)
        time0 = time.time()
        square_array = np.square(random_array)
        time1 = time.time()
        CudaSquare[(array_length,array_length),(1,1)](random_array_cuda)
        time2 = time.time()
        numpy_time += time1-time0
        numba_time += time2-time1
    print ('The time cost of numpy is {}s for {} loops'.format(numpy_time, test_length))
    print ('The time cost of numba is {}s for {} loops'.format(numba_time, test_length))

运行，对比部署与否的效率，运行时间相差达到10倍以上。

>> python cuda_test.py
The time cost of numpy is 4.878739595413208s for 100 loops
The time cost of numba is 0.3255774974822998s for 100 loops

Reference

Numba官网
cnblog, DECHIN, 超过Numpy的速度有多难？试试Numba的GPU加速

加速Python: Numba, since 2026-03-31