(2026.03.31 Tues)
Numba是Python的即时编译器(just-in-time JIT compiler),特别适用于使用了Numpy array的函数和循环,即computation bound的任务。
使用Numba最常见的方式是其内部的一系列装饰器,使被装饰的函数由Numba编译。Numba装饰器会读取被装饰函数的python字节码(bytecode),并将其与函数输入变量类型相连(?)。经过分析和优化后,由LLVM编译器库(complier library)生成函数的机器码,绕过python解释器,并以原生机器速度运行。之后每次调用该函数时都使用该机器码。
案例1:计算圆周率
用Monte Carlo法计算圆周率pi,在计算过程中加入@numba.jit装饰器,会提高运行效率。
import functools, random
from numba import jit
# 设定计时器装饰器
def timer(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
starting_time = time.perf_counter()
result = func(*args, **kwargs)
ending_time = time.perf_counter()
print('time consumption in seconds: ', ending_time - starting_time)
return result
return wrapper
@timer
@jit(nopython=True)
def mc_pi_jit(nsamples):
acc = 0
for i in range(nsamples):
x = random.random()
y = random.random()
if (x ** 2 + y ** 2) < 1.0:
acc += 1
return 4.0 * acc / nsamples
运行
>> mc_pi_jit(10000000)
time consumption in seconds: 0.31119772199963336
调用@jit装饰器时,指定了传递的参数nopython=True(默认设置),以保证被装饰函数在编译时完全没需python解释器的接入。这个设定将极大提高代码运行效率。
作为对比,运行该代码并且不加入@jit装饰器的运行时长如下
@timer
def mc_pi_no_jit(nsamples):
acc = 0
for i in range(nsamples):
x = random.random()
y = random.random()
if (x ** 2 + y ** 2) < 1.0:
acc += 1
return 4.0 * acc / nsamples
>> mc_pi_no_jit(10000000)
time consumption in seconds: 3.635518551000132
3.1409716
时长对比:使用@jit装饰器,0.31s;不使用@jit装饰器,3.64s。效率提升达到10倍。
案例2:计算矩阵乘法
使用numba中的cuda,将矩阵乘法部署到GPU上。
# cuda_test.py
import numpy as np
import time
from tqdm import trange
from numba import cuda
cuda.select_device(1)
@cuda.jit
def CudaSquare(x):
i, j = cuda.grid(2)
x[i][j] *= x[i][j]
if __name__ == '__main__':
numpy_time = 0
numba_time = 0
test_length = 1000
for i in trange(test_length):
np.random.seed(i)
array_length = 2**12
random_array = np.random.rand(array_length, array_length)
random_array_cuda = cuda.to_device(random_array)
time0 = time.time()
square_array = np.square(random_array)
time1 = time.time()
CudaSquare[(array_length,array_length),(1,1)](random_array_cuda)
time2 = time.time()
numpy_time += time1-time0
numba_time += time2-time1
print ('The time cost of numpy is {}s for {} loops'.format(numpy_time, test_length))
print ('The time cost of numba is {}s for {} loops'.format(numba_time, test_length))
运行,对比部署与否的效率,运行时间相差达到10倍以上。
>> python cuda_test.py
The time cost of numpy is 4.878739595413208s for 100 loops
The time cost of numba is 0.3255774974822998s for 100 loops
Reference
- Numba官网
- cnblog, DECHIN, 超过Numpy的速度有多难?试试Numba的GPU加速