1.Read data and get unique value of a column
- pd.get_dummies()
- 可以把一列,按这一列当中的值,转化为好多列的二进制格式。
比如,cars["year"].unique() = [1980, 1981, 1982, 1983] 这四个值
而pd.get_dummies(cars["year"], prefix="year")会得到4列,每列的列名是year_1980, year_1981, year_1982, year_1983(增加了year作为prefix),这几列中的值是0或者1。
import pandas as pd
cars = pd.read_csv("auto.csv")
unique_regions = cars["origin"].unique()
print (unique_regions)
dummy_cylinders = pd.get_dummies(cars["cylinders"], prefix="cyl")
cars = pd.concat([cars, dummy_cylinders], axis=1)
dummy_years = pd.get_dummies(cars["year"], prefix="year")
cars = pd.concat([cars, dummy_years], axis=1)
cars = cars.drop("year", axis=1)
cars = cars.drop("cylinders", axis=1)
print(cars.head())
2.随机把index打乱,取train和test
shuffled_rows = np.random.permutation(cars.index)
shuffled_cars = cars.iloc[shuffled_rows]
#取70%作为training data
highest_train_row = int(cars.shape[0] * .70)
train = shuffled_cars.iloc[0:highest_train_row]
test = shuffled_cars.iloc[highest_train_row:]
3.根据origin的1,2,3分类,依次取origin=1的时候,训练出的model,origin=2的model以及origin=3的model
-
取python column name的方法
df.columns.tolist() 或者 df.columns.values.tolist() list(df) for c in df.columns if c.startswith("prefix") or c.startswith("prefix")
from sklearn.linear_model import LogisticRegression
unique_origins = cars["origin"].unique()
unique_origins.sort()
models = {}
features = [c for c in train.columns if c.startswith("cyl") or c.startswith("year")]
for origin in unique_origins:
model = LogisticRegression()
X_train = train[features]
y_train = train["origin"] == origin
model.fit(X_train, y_train)
models[origin] = model
继续计算其test_proba
testing_probs = pd.DataFrame(columns=unique_origins)
for origin in unique_origins:
# Select testing features.
X_test = test[features]
# Compute probability of observation being in the origin.
testing_probs[origin] = models[origin].predict_proba(X_test)[:,1]
在三列当中,选概率最大的值,作为predicted origins
- 方法:df.idxmax(axis = 1) --- 在dataframe的所有列中,选择第一个出现的最大值的那一列,返回那一列的列名
predicted_origins = testing_probs.idxmax(axis=1)
print(predicted_origins)