Pivoting Fully Explained Using Python and MySQL

When we need to reshape dataframe such that the rearranged dataframe can satisfy the needs for information dissemination or for data analytics, there are two direction of pivoting table:

  • Pivoting from rows in a datasetset into columns
  • Pivoting from columns in a dataset into rows

The former rearrangment of dataset is referred to Pivoting Long to Wide Format, and the latter one is referred to Pivoting Wide to Long Format.

So what is a wide table and what is a long table?

Long Table and Wide Table

Long Table

Long table is an unprocessed, original dataset, so it is the best candidate for data analytics.

Scores

student school class grade
Andy Z english 90
Bernie Y english 80
Cindy Z english 70
Deb Y english 65
Andy Z math 60
Bernie Y math 55
Cindy Z math 100
Deb Y math 98
Andy Z physics 87
Bernie Y physics 100
Cindy Z physics 45
Deb Y physics 99

Products

product_id store price
0 store1 95
0 store3 105
0 store2 100
1 store1 70
1 store3 80

Wide Table

Wide table is used for disseminating information because the data is usually processed in a way that can increase people's knowledge of something. Also, wide table is a particular form of dataframe containing one column by one dimension and indexed by another dimension in the date column. For example, there are two typical wide tables:

Scores

product_id English Math Physics
Andy 90 60 87
Bernie 80 55 100
Cindy 70 100 45
Deb 65 98 99

The table shows scores per subject indexed by each student.

Products

product_id store1 store2 store3
0 95 100 105
1 70 null 80

The Products table contains each column per store indexed by per product. Looking at the table, we can easily know what are the prices of each product across all stores.

Converting between Long Table and Wide Table

Let's say we have the two unprocessed dataframes mentioned above, products and scores. Now we can convert them into wide table at will and perform the reverse operation, coverting the wide table to long format.

Converting to Wide Format

First, Let's take the products table as an example to demonstrate how the transformation works in Python and MySQL.

MySQL

products

product_id store price
0 store1 95
0 store3 105
0 store2 100
1 store1 70
1 store3 80

Now we need to write an SQL query to find the price of each product in each store.

Frist let's quickly review on the essence of wide table, which shows information per column and indexed by a particular dimension. So our goal is to group the price data by product_id and create 3 new fileds for store1, store2, and store3.

So how to create fields that can automatically identify which record is store1 or store2?
The answer is write a control flow. In MySQL, we need to use case when / if to create the required fields. Either way works in this case.

First we group by the dataset by product_id so that we can index each record by product_id.

select
  xx
from products
group by product_id

Then, based on the code chunk above, we need to write control fllow to create new fields for store1, store2, and store3. Here I use case when as the demonstration:

select
  product_id,
  case when store="store1" then price else null end as `store1`,
  case when store="store2" then price else null end as `store2`,
  case when store="store3" then price else null end as `store3`,
from products
group by product_id

Are we done? Absolutely not, we must use aggregation function after we group by the dataset; otherwise, it only returns one record. Ant it does not matter what kinds of aggregate function we use, we can either write max or min.

select
  product_id,
  max(case when store="store1" then price else null end) as `store1`,
  max(case when store="store2" then price else null end) as `store2`,
  max(case when store="store3" then price else null end) as `store3`,
from products
group by product_id

The output is as follows:

product_id store1 store2 store3
0 95 100 105
1 70 null 80

Python

In python, the question becomes extremely simple because pandas has its built-in function especially for pivoting dataframe. We can either use the top level function pd.pivot_table or the instance function df.pivot_table. Either way works.

Here is the syntax for pivotting long to wide format.
pd.pivot_table(df,index,columns,values,margins,aggfunc,fill_value)

  • df is the dataframe needed to be transformed, that is a long table
  • index means what field we use to index the dataframe? It is equivalent to group by of MySQL
  • columns means which column we want to display the data by column. It is equivalent to the new fields we create by control flow in MySQL.
  • values means which columns of the dataframe we want to aggregate?
  • aggfunc specifies the aggregate function we want to use, such as max,min,first,mean.
  • fill_value specifies whether we want to use certain value to fill with NaN.
  • margins specifies whether we want to add the subtotal.

Now let's pivot the products table to a wide format using this function

pd.pivot(Products,"product_id","store","price")

[Output]

store store1 store2 store3
product_id
0 95.0 100.0 105.0
1 70.0 NaN 80.0

Converting to Long Format

Now the long table has been transformed to the wide format, we can also recover it if we want.

MySQL

Write an SQL query to rearrange the Products table so that each row has (product_id, store, price). If a product is not available in a store, do not include a row with that product_id and store combination in the result table.

In MySQL, the job is rather boring. We only need to use union to concatenate all possible store tables. I directly show how it works:

Based on the wide table above, first, we filter out the product price of products sold in store1 and name it as price, and we create store as a new field.


select
    product_id,
    "store1" as store,
    store1 as price
from products
where store1 is not null

Then we write all possible situations and concatenate them


select
    product_id,
    "store1" as store,
    store1 as price
from products
where store1 is not null
union all
select
    product_id,
    "store2" as store,
    store2 as price
from products
where store2 is not null
union all
select
    product_id,
    "store3" as store,
    store3 as price
from products
where store3 is not null;

Python

Python has a built-in function, pd.melt(), to convert a wide format table to long table.

Here's the syntax
pd.melt(df,id_vars,value_vars,var_name,value_name,ignore_index)

  • df is the wide table needed to be melted
  • id_vars specifies the row of the new dataframe
  • value_vars specifies which columns need to be unpivoted. If not specified, all other columns other than id_vars will be unpivoted.
  • value_name specifies the name of new value column
  • var_name specifies the name of new variable column
  • ignore_index specifies whether we ignore the orginal index. Default is True

Now Let's unpivot the dataframe

melted_df = pd.melt(Products,id_vars=["product_id"],
                    value_vars=["store1","store2","store3"],
                    var_name="store",value_name="price",
                   ignore_index=True)

Then we filter out any null values

melted_df = pd.melt(Products,id_vars=["product_id"],
                    value_vars=["store1","store2","store3"],
                    var_name="store",value_name="price",
                   ignore_index=True)
result = melted_df[melted_df["price"].notnull()].sort_values(by="product_id")
result

[OUTPUT]

product_id store price
0 store1 95.0
0 store2 100.0
0 store3 105.0
1 store1 70.0
1 store3 80.0

We can do the same transformation for the score table using the same way, which is rather tedious. Now, let's try much more complicated question involving advanced use of pandas and mysql.

Advanced Use of Pandas and MySQL in Pivotting

Let's try some difficult questions in Leetcode, and I will use Python and MySQL to solve these questions to show how MySQL and Pandas work for pivotting and unpivotting.

Question1: Students Report By Geography

MySQL

A U.S graduate school has students from Asia, Europe and America. The students' location information are stored in table student as below.
student

name continent
Jack America
Pascal Europe
Xi Asia
Jane America

We need to pivot the continent column in this table so that each name is sorted alphabetically and displayed underneath its corresponding continent. The output headers should be America, Asia and Europe respectively. It is guaranteed that the student number from America is no less than either Asia or Europe.

The result should be like this:

America Asia Europe
Jack Xi Pascal
Jane

Explanation
We can easily find that this is a typical question of converting long to wide format. So the method of solving this quesiont is absolutely group by and control flow.
But the problem is that the original table only has two columns, one used for displaying data by columns and the other one for aggregating. So we do not have a column that we can use to group the dataset. In this case, we need to construct such column ourselves.

So we need to first rank the name by continent in an ascending order, so we will have the result:
Jack is ranked as 1, while Jane is ranked as 2;
Pascal is ranked as 1 because Europe has only one student displayed;
Xi is ranked as 1 because Asia has only one student displayed.

Then, we can group the dataset by the rank column and use exactly same way we have practiced to solve this question.

Code

To get a rank field, we can either use window function or variables.

Window Function

select
  *,
  row_number() over (partition by continent order by name asc) as "rank"
from student

We can also define variables to add rank field

select
  *,
  case when @cont=continent then @rk:=@rk+1
          else @rk:=1
  end as "rank",
  @cont:=continent
from student,(select @rk:=0,@cont:=null) t
order by continent, name asc

Either way, we can have the following table:

name continent Rank
Jack America 1
Jane America 2
Pascal Europe 1
Xi Asia 1

Now the left job is just to reproduce the code I have written in demonstration part.

window function version

# build one temp table to store the table containing rank
with temp as (
select
*,
  row_number() over (partition by continent order by name asc) as "rank"
from student)

select
  max(case when continent="America" then name else null end) as America,
  max(case when continent="Asia" then name else null end) as Asia,
  max(case when continent="Europe" then name else null end) as Europe
from temp
group by rank

variable version

with temp as (select
                *,
                case when @continent=continent then @rk:=@rk+1 else @rk:=1 end as rk,
                @continent:=continent
            from student,(select @continent:=null,@rk:=0) t
            order by continent asc,name asc)
select
    max(case when continent="America" then name else null end) as America,
    max(case when continent="Asia" then name else null end) as Asia,
    max(case when continent="Europe" then name else null end) as Europe
from temp
group by rk;

Note: DO NOT FORGET TO ADD AGGREGATE FUNCTION

Python

Now, let me use python to solve this question, which can be simpler than solution of MySQL.

Same way. We need to use pd.pivot_table to pivot the dataframe.

# first we prepare the dataframe
Student = pd.DataFrame({"name":["Jack","Pascal","Xi","Jane"],
                        "continent":["America","Europe","Asia","America"]})

Now we copy the dataframe to avoid any manipulation affecting the original dataste

student = Student.copy()

Then, we also need create one column used to group the data. Here's one powerful function rank

# set ascending=True to ensure that we order name alphabetically
student["rank"] = student["name"].groupby(student["continent"]).\
rank(method="first",ascending=True) 

Next, we call pd.pivot_table to perform pivotting.

result = student.pivot_table(
    index="rank",
    columns="continent",
    values="name",
    aggfunc="max",
    fill_value="")

[Ouput]

continent America Asia Europe
rank
1.0 Jack Xi Pascal
2.0 Jane

Tweak the result a little bit

res = result.reset_index().drop("rank",axis=1)
res.columns.names = ""
res

[Output]

America Asia Europe
0 Jack Xi Pascal
1 Jane

Question 2:Sales by Day of the Week

Orders

order_id customer_id order_date item_id quantity
1 1 2020-06-01 1 10
2 1 2020-06-08 2 10
3 2 2020-06-02 1 5
4 3 2020-06-03 3 5
5 4 2020-06-04 4 1
6 4 2020-06-05 5 5
7 5 2020-06-05 1 10
8 5 2020-06-14 4 5
9 5 2020-06-21 3 5

items

item_id item_name item_category
1 LC Alg. Book Book
2 LC DB. Book Book
3 LC SmarthPhone Phone
4 LC Phone 2020 Phone
5 LC SmartGlass Glasses
6 LC T-Shirt XL T-Shirt

You are the business owner and would like to obtain a sales report for category items and day of the week. Write an SQL query to report how many units in each category have been ordered on each day of the week. Return the result table ordered by category. The query result format is in the following example:

Solution
The result table requires us to display the sales indexed by each category of items per day of a week. So this is a typical question of pivoting long table to wide table. So

  • First Setp: Group by the data by category**
select
  xx
from orders o 
right join items i 
on o.item_id = i.item_id
group by i.item_category
order by category
  • Second Step: Basd on the joined table, write a control flow to automatically calculate the sum of quantity for each category
select 
        i.item_category category,
        sum(if(weekday(o.order_date)=0,o.quantity,0)) 'Monday',
        sum(if(weekday(o.order_date)=1,o.quantity,0)) 'Tuesday',
        sum(if(weekday(o.order_date)=2,o.quantity,0)) 'Wednesday',
        sum(if(weekday(o.order_date)=3,o.quantity,0)) 'Thursday',
        sum(if(weekday(o.order_date)=4,o.quantity,0)) 'Friday',
        sum(if(weekday(o.order_date)=5,o.quantity,0)) 'Saturday',
        sum(if(weekday(o.order_date)=6,o.quantity,0)) 'Sunday'
from orders o 
right join items i 
on o.item_id = i.item_id
group by item_category
order by category
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 217,406评论 6 503
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,732评论 3 393
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 163,711评论 0 353
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,380评论 1 293
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,432评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,301评论 1 301
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,145评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,008评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,443评论 1 314
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,649评论 3 334
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,795评论 1 347
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,501评论 5 345
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,119评论 3 328
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,731评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,865评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,899评论 2 370
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,724评论 2 354

推荐阅读更多精彩内容