PYTHON

安装anaconda以外的包

命令行中运行：conda install xxx
安装pydot(会有错误，最好不要使用): conda install -c https://conda.binstar.org/sstromberg pydot

得到默认编码：

import sys
sys.getdefaultencoding()
reload(sys)
sys.setdefaultencoding("utf-8")

pandas

查看：10 minutes to pandas
例子可以查看：官方cookbook
filter行：

new_df = df[(df["x1"]>2) | (df["x2"]=="abc")

HIVE

hive支持："b" in ("b","a","c")这样的语法
每个括号套着的select语句都需要在括号后起名，因为hive认为每个括号都代表一张表
null <> 1 返回null
union all不能连接带括号的查询，或者说不能直接连接两个表
"04151234">="0415"返回true，但"04151234"<="0415"返回false

建模

变量选择

无论是否使用一阶惩罚，有些变量必须要删除才能够得到较好的效果，包括：
- 取值特别多的char类型变量，比如：挖财记账APP的二级目录种类，有很多是人工定义的，根本没有统计意义
- 一阶惩罚下的变量系数显著性非常奇怪，很多接近1，不具有参考性

logistic regression

python statsmodels:

import statsmodels as sm
x = ins_features
x = sm.add_constant(x, prepend=False)
y = ins_target
LR_model = sm.Logit(y, x).fit_regularized(method='l1',alpha = 20)
print LR_model_result.params
print LR_model_result.summary
#score
y_predicted = LR_model.predict(test_X)
#save and load model
LR_model.save("abc.txt")
sm.load("abc.txt")

python sklearn

import sklearn
LR_model = sklearn.linear_model.LogisticRegression()
y = train_df["target_train"]
X = train_df[...]
LR_model.fit(X,y)
#pickle LR_model
#test
y_predicted = LR_model.predict_proba(test_dataframe)[:, 1]
#save and load model: using python pickle

decision tree

sklearn无法使用分类变量，需要使用DictVectorizer转换

兴兴的学习笔记

兴兴的学习笔记

PYTHON

安装anaconda以外的包

得到默认编码：

pandas

HIVE

建模

变量选择

logistic regression

decision tree

推荐阅读更多精彩内容