scikit-learn使用job lib持久化模型过程中的问题详解
在机器学习过程中,一般用来训练模型的过程比较长,所以我们一般会将训练的模型进行保存(持久化),然后进行评估,预测等等,这样便可以节省大量的时间。
在模型持久化过程中,我们使用scikit-learn提供的joblib.dump()方法,但是在使用过程中会出现很多问题。如我们使用如下语句:
[python]view plaincopy
joblib.dump(clf,'../../data/model/randomforest.pkl')
此语句将产生大量的模型文件,如下图所示
然后,我们再使用joblib.load(‘../../data/model/randomforest.pkl’)进行加载,会出现如下错误:
[python]view plaincopy
Traceback (most recent call last):
File"E:\workspace\forest\com\baihe\RandomForest_losing.py", line65,in
clf = joblib.load('../../data/model/randomforest.pkl')
File"D:\Program Files\python27\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py", line425,inload
obj = unpickler.load()
File"D:\Program Files\python27\lib\pickle.py", line858,inload
dispatch[key](self)
File"D:\Program Files\python27\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py", line285,inload_build
Unpickler.load_build(self)
File"D:\Program Files\python27\lib\pickle.py", line1217,inload_build
setstate(state)
File"_tree.pyx", line2280,insklearn.tree._tree.Tree.__setstate__ (sklearn\tree\_tree.c:18350)
ValueError: Didnotrecognise loaded array layout
正确使用joblib的方法是:设置dump中的compress参数,当设置参数时,模型持久化便会压缩成一个文件。源码中对compress参数的描述如下:
[python]view plaincopy
compress: integerfor0to9, optional
Optional compression levelforthe data.0isno compression.
Higher means more compression, but also slower readand
write times. Using a value of3isoften a good compromise.
See the notesformore details.
以下是我们进行模型持久化的正确操作语句:
[python]view plaincopy
#save model
joblib.dump(clf,'../../data/model/randomforest.pkl',compress=3)
#load model to clf
clf = joblib.load('../../data/model/randomforest.pkl')