Pandas 100道通关题(二)

pandas

Series and DatetimeIndex

Exercises for creating and manipulating Series with datetime data

Difficulty: easy/medium

pandas is fantastic for working with dates and times. These puzzles explore some of this functionality.

33. Create a DatetimeIndex that contains each business day of 2015 and use it to index a Series of random numbers. Let's call this Series s.

dti = pd.date_range(start='2015-01-01', end='2015-12-31', freq='B') 
s = pd.Series(np.random.rand(len(dti)), index=dti)
s

34. Find the sum of the values in s for every Wednesday.

s[s.index.weekday == 2].sum() 

35. For each calendar month in s, find the mean of values.

s.resample('M').mean()

36. For each group of four consecutive calendar months in s, find the date on which the highest value occurred.

s.groupby(pd.Grouper(freq='4M')).idxmax()

37. Create a DateTimeIndex consisting of the third Thursday in each month for the years 2015 and 2016.

pd.date_range('2015-01-01', '2016-12-31', freq='WOM-3THU')

Cleaning Data

Making a DataFrame easier to work with

Difficulty: easy/medium

It happens all the time: someone gives you data containing malformed strings, Python, lists and missing data. How do you tidy it up so you can get on with the analysis?

Take this monstrosity as the DataFrame to use in the following puzzles:

df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 
                               'Budapest_PaRis', 'Brussels_londOn'],
              'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
              'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', 
                               '12\. Air France', '"Swiss Air"']})

38. Some values in the the FlightNumber column are missing (they are NaN). These numbers are meant to increase by 10 with each row so 10055 and 10075 need to be put in place. Modify df to fill in these missing numbers and make the column an integer column (instead of a float column).

df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 
                               'Budapest_PaRis', 'Brussels_londOn'],
              'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
              'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', 
                               '12. Air France', '"Swiss Air"']})
​
df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int)
df

39. The From_To column would be better as two separate columns! Split each string on the underscore delimiter _ to give a new temporary DataFrame called 'temp' with the correct values. Assign the correct column names 'From' and 'To' to this temporary DataFrame.

temp = df.From_To.str.split('_', expand=True)
temp.columns = ['From', 'To']
temp

40. Notice how the capitalisation of the city names is all mixed up in this temporary DataFrame 'temp'. Standardise the strings so that only the first letter is uppercase (e.g. "londON" should become "London".)

temp['From'] = temp['From'].str.capitalize()
temp['To'] = temp['To'].str.capitalize()
temp

41. Delete the From_To column from 41. Delete the From_To column from df and attach the temporary DataFrame 'temp' from the previous questions.df and attach the temporary DataFrame from the previous questions.

df = df.drop('From_To', axis=1)
df = df.join(temp)
df

42. In the Airline column, you can see some extra puctuation and symbols have appeared around the airline names. Pull out just the airline name. E.g. '(British Airways. )' should become 'British Airways'.


df['Airline'] = df['Airline'].str.extract('([a-zA-Z\s]+)', expand=False).str.strip()
# note: using .strip() gets rid of any leading/trailing spaces
df

43. In the RecentDelays column, the values have been entered into the DataFrame as a list. We would like each first value in its own column, each second value in its own column, and so on. If there isn't an Nth value, the value should be NaN.

Expand the Series of lists into a new DataFrame named 'delays', rename the columns 'delay_1', 'delay_2', etc. and replace the unwanted RecentDelays column in df with 'delays'.

# there are several ways to do this, but the following approach is possibly the simplest
​
delays = df['RecentDelays'].apply(pd.Series)
​
delays.columns = ['delay_{}'.format(n) for n in range(1, len(delays.columns)+1)]
​
df = df.drop('RecentDelays', axis=1).join(delays)
​
df

Using MultiIndexes

Go beyond flat DataFrames with additional index levels

Difficulty: medium

Previous exercises have seen us analysing data from DataFrames equipped with a single index level. However, pandas also gives you the possibilty of indexing your data using multiple levels. This is very much like adding new dimensions to a Series or a DataFrame. For example, a Series is 1D, but by using a MultiIndex with 2 levels we gain of much the same functionality as a 2D DataFrame.

The set of puzzles below explores how you might use multiple index levels to enhance data analysis.

To warm up, we'll look make a Series with two index levels.

44. Given the lists letters = ['A', 'B', 'C'] and numbers = list(range(10)), construct a MultiIndex object from the product of the two lists. Use it to index a Series of random numbers. Call this Series s.


letters = ['A', 'B', 'C']
numbers = list(range(10))
​
mi = pd.MultiIndex.from_product([letters, numbers])
s = pd.Series(np.random.rand(30), index=mi)
s

45. Check the index of s is lexicographically sorted (this is a necessary proprty for indexing to work correctly with a MultiIndex).

s.index.is_lexsorted()
​
# or more verbosely...
s.index.lexsort_depth == s.index.nlevels

46. Select the labels 1, 3 and 6 from the second level of the MultiIndexed Series.

s.loc[:, [1, 3, 6]]

47. Slice the Series s; slice up to label 'B' for the first level and from label 5 onwards for the second level.

s.loc[pd.IndexSlice[:'B', 5:]]
​
# or equivalently without IndexSlice...
s.loc[slice(None, 'B'), slice(5, None)]

48. Sum the values in s for each label in the first level (you should have Series giving you a total for labels A, B and C).

s.sum(level=0)

49. Suppose that sum() (and other methods) did not accept a level keyword argument. How else could you perform the equivalent of s.sum(level=1)?

# One way is to use .unstack()... 
# This method should convince you that s is essentially just a regular DataFrame in disguise!
s.unstack().sum(axis=0)

50. Exchange the levels of the MultiIndex so we have an index of the form (letters, numbers). Is this new Series properly lexsorted? If not, sort it.

new_s = s.swaplevel(0, 1)
​
if not new_s.index.is_lexsorted():
    new_s = new_s.sort_index()
​
new_s

https://github.com/ajcr/100-pandas-puzzles

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • pyspark.sql模块 模块上下文 Spark SQL和DataFrames的重要类: pyspark.sql...
    mpro阅读 9,505评论 0 13
  • 这次假期回来和很久未见的朋友见面,是格外高兴。上次见面还是过年,昨天约好去本地的一所大学送物品,然后聚聚...
    南山默少阅读 147评论 0 0
  • 当人们发现金矿时,对金子有两种界定的方法,一是谁发现了金矿,整个金矿就归谁;第二是谁挖到金子,金子就归谁。这...
    luckysusan1991阅读 967评论 2 2
  • 这个冬天太温暖, 感受了生命初衷的自然情绪, 了然孕育四季传奇的冬日守护。 历练时光后的深刻友爱, 惊奇在一个毫无...
    心羽自心阅读 67评论 0 0
  • 清晨自问(此部分可不写,心中自问即可):我今天的目标是什么? 我的终极人生目标是什么? √做一个静雅的女子,安守自...
    悦读的清风月影阅读 168评论 0 0