Pandas常用语法

最近过了一遍kaggle上的pandas入门,感觉还是有些东西不熟，所以把内容做一笔记供以后查阅。

Index, Reference

Select a column 'description'

reviews['description']
reviews.description

Select the first elements of 'description' column
reviews.loc[0, 'description']

Select first row of dataframe

reviews.loc[0, :]
reviews.iloc[0, :]

Select the first 10 elements of dataframe

reviews.iloc[0:10, 0]
reviews.loc[0:10, 'description']

Select the row number 1, 2, 3, 5, 8
reviews.iloc[[1, 2, 3, 5, 8], :]

Select 'country' and 'variety' of the first 100 records
reviews.loc[0:100, ['country', 'variety']]

Select wines made from 'Italy'
reviews[reviews['country']='Italy']

Select entries whose 'region2' is not empty
reviews[reviews.region2.notnull()]

Select last 1000 entries from points
reviews.iloc[-1000:, 3]

Select points for wines made from Italy
reviews[reviews.country=='Italy']].points

Who produces more above-averagely good wines, France or Italy? Select the country column, but only when said country is one of those two options, and the points column is greater than or equal to 90.
reviews[reviews.country.isin(['France', 'Italy']) & reviews.points>=90].country

Summary and maps

What is the median of the points column?
reviews.points.median()

What countries are represented in the dataset?
reviews.country.unique()

What countries appear in the dataset most often?
reviews.country.value_counts()

Remap the price column by subtracting the median price. Use the Series.map method.

m_val = reviews.price.median()
reviews.price.map(lambd x:x-m_val)

Remap the price column by subtracting the median price. Use the DataFrame.apply method.

def f(x):
    return x - m_val
reviews.price.apply(f)

I"m an economical wine buyer. Which wine in is the "best bargain", e.g., which wine has the highest points-to-price ratio in the dataset?
reviews.loc[(reviews.points/reviews.price).idmax()].title

There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series counting how many times each of these two words appears in the description column in the dataset.

c_tropical = reviews.description.map(lambda r:'tropical' in r).value_counts()
c_fruity = reviews.description.map(lambda r:'fruity' in r).value_counts()
pd.Series([c_tropical[True], c_fruity[True]], index = ['tropical', 'fruity'])

What combination of countries and varieties are most common?
Create a Series whose index consists of strings of the form "<Country> - <Wine Variety>". For example, a pinot noir produced in the US should map to "US - Pinot Noir". The values should be counts of how many times the given wine appears in the dataset. Drop any reviews with incomplete country or variety data.

df1 = reviews[(reviews.country.notna()&reviews.variety.notna())]
df = df1.apply(lambda s: s.country+ " - "+s.variety, axis = 'columns')
df.value_counts()

Pandas常用语法