05-multi-category logistic regression

1.Read data and get unique value of a column

  • pd.get_dummies()
  • 可以把一列,按这一列当中的值,转化为好多列的二进制格式。
    比如,cars["year"].unique() = [1980, 1981, 1982, 1983] 这四个值
    而pd.get_dummies(cars["year"], prefix="year")会得到4列,每列的列名是year_1980, year_1981, year_1982, year_1983(增加了year作为prefix),这几列中的值是0或者1。
import pandas as pd
cars = pd.read_csv("auto.csv")
unique_regions = cars["origin"].unique()
print (unique_regions)

dummy_cylinders = pd.get_dummies(cars["cylinders"], prefix="cyl")
cars = pd.concat([cars, dummy_cylinders], axis=1)
dummy_years = pd.get_dummies(cars["year"], prefix="year")
cars = pd.concat([cars, dummy_years], axis=1)
cars = cars.drop("year", axis=1)
cars = cars.drop("cylinders", axis=1)
print(cars.head())

2.随机把index打乱,取train和test

shuffled_rows = np.random.permutation(cars.index)
shuffled_cars = cars.iloc[shuffled_rows]
#取70%作为training data
highest_train_row = int(cars.shape[0] * .70)
train = shuffled_cars.iloc[0:highest_train_row]
test = shuffled_cars.iloc[highest_train_row:]

3.根据origin的1,2,3分类,依次取origin=1的时候,训练出的model,origin=2的model以及origin=3的model

  • 取python column name的方法
    df.columns.tolist() 或者 df.columns.values.tolist() list(df) for c in df.columns if c.startswith("prefix") or c.startswith("prefix")
from sklearn.linear_model import LogisticRegression

unique_origins = cars["origin"].unique()
unique_origins.sort()

models = {}
features = [c for c in train.columns if c.startswith("cyl") or c.startswith("year")]

for origin in unique_origins:
    model = LogisticRegression()
    
    X_train = train[features]
    y_train = train["origin"] == origin

    model.fit(X_train, y_train)
    models[origin] = model

继续计算其test_proba

testing_probs = pd.DataFrame(columns=unique_origins)  

for origin in unique_origins:
    # Select testing features.
    X_test = test[features]   
    # Compute probability of observation being in the origin.
    testing_probs[origin] = models[origin].predict_proba(X_test)[:,1]

在三列当中,选概率最大的值,作为predicted origins

  • 方法:df.idxmax(axis = 1) --- 在dataframe的所有列中,选择第一个出现的最大值的那一列,返回那一列的列名
predicted_origins = testing_probs.idxmax(axis=1)
print(predicted_origins)
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容