When we need to reshape dataframe such that the rearranged dataframe can satisfy the needs for information dissemination or for data analytics, there are two direction of pivoting table:
- Pivoting from rows in a datasetset into columns
- Pivoting from columns in a dataset into rows
The former rearrangment of dataset is referred to Pivoting Long to Wide Format, and the latter one is referred to Pivoting Wide to Long Format.
So what is a wide table and what is a long table?
Long Table and Wide Table
Long Table
Long table is an unprocessed, original dataset, so it is the best candidate for data analytics.
Scores
student | school | class | grade |
---|---|---|---|
Andy | Z | english | 90 |
Bernie | Y | english | 80 |
Cindy | Z | english | 70 |
Deb | Y | english | 65 |
Andy | Z | math | 60 |
Bernie | Y | math | 55 |
Cindy | Z | math | 100 |
Deb | Y | math | 98 |
Andy | Z | physics | 87 |
Bernie | Y | physics | 100 |
Cindy | Z | physics | 45 |
Deb | Y | physics | 99 |
Products
product_id | store | price |
---|---|---|
0 | store1 | 95 |
0 | store3 | 105 |
0 | store2 | 100 |
1 | store1 | 70 |
1 | store3 | 80 |
Wide Table
Wide table is used for disseminating information because the data is usually processed in a way that can increase people's knowledge of something. Also, wide table is a particular form of dataframe containing one column by one dimension and indexed by another dimension in the date column. For example, there are two typical wide tables:
Scores
product_id | English | Math | Physics |
---|---|---|---|
Andy | 90 | 60 | 87 |
Bernie | 80 | 55 | 100 |
Cindy | 70 | 100 | 45 |
Deb | 65 | 98 | 99 |
The table shows scores per subject indexed by each student.
Products
product_id | store1 | store2 | store3 |
---|---|---|---|
0 | 95 | 100 | 105 |
1 | 70 | null | 80 |
The Products table contains each column per store indexed by per product. Looking at the table, we can easily know what are the prices of each product across all stores.
Converting between Long Table and Wide Table
Let's say we have the two unprocessed dataframes mentioned above, products
and scores
. Now we can convert them into wide table at will and perform the reverse operation, coverting the wide table to long format.
Converting to Wide Format
First, Let's take the products
table as an example to demonstrate how the transformation works in Python and MySQL.
MySQL
products
product_id | store | price |
---|---|---|
0 | store1 | 95 |
0 | store3 | 105 |
0 | store2 | 100 |
1 | store1 | 70 |
1 | store3 | 80 |
Now we need to write an SQL query to find the price of each product in each store.
Frist let's quickly review on the essence of wide table, which shows information per column and indexed by a particular dimension. So our goal is to group the price data by product_id and create 3 new fileds for store1, store2, and store3.
So how to create fields that can automatically identify which record is store1 or store2?
The answer is write a control flow. In MySQL, we need to use case when / if
to create the required fields. Either way works in this case.
First we group by the dataset by product_id
so that we can index each record by product_id.
select
xx
from products
group by product_id
Then, based on the code chunk above, we need to write control fllow to create new fields for store1, store2, and store3. Here I use case when
as the demonstration:
select
product_id,
case when store="store1" then price else null end as `store1`,
case when store="store2" then price else null end as `store2`,
case when store="store3" then price else null end as `store3`,
from products
group by product_id
Are we done? Absolutely not, we must use aggregation function after we group by the dataset; otherwise, it only returns one record. Ant it does not matter what kinds of aggregate function
we use, we can either write max
or min
.
select
product_id,
max(case when store="store1" then price else null end) as `store1`,
max(case when store="store2" then price else null end) as `store2`,
max(case when store="store3" then price else null end) as `store3`,
from products
group by product_id
The output is as follows:
product_id | store1 | store2 | store3 |
---|---|---|---|
0 | 95 | 100 | 105 |
1 | 70 | null | 80 |
Python
In python, the question becomes extremely simple because pandas has its built-in function especially for pivoting dataframe. We can either use the top level function pd.pivot_table
or the instance function df.pivot_table
. Either way works.
Here is the syntax for pivotting long to wide format.
pd.pivot_table(df,index,columns,values,margins,aggfunc,fill_value)
-
df
is the dataframe needed to be transformed, that is a long table -
index
means what field we use to index the dataframe? It is equivalent togroup by
of MySQL -
columns
means which column we want to display the data by column. It is equivalent to the new fields we create by control flow in MySQL. -
values
means which columns of the dataframe we want to aggregate? -
aggfunc
specifies the aggregate function we want to use, such asmax
,min
,first
,mean
. -
fill_value
specifies whether we want to use certain value to fill with NaN. -
margins
specifies whether we want to add the subtotal.
Now let's pivot the products table to a wide format using this function
pd.pivot(Products,"product_id","store","price")
[Output]
store | store1 | store2 | store3 |
---|---|---|---|
product_id | |||
0 | 95.0 | 100.0 | 105.0 |
1 | 70.0 | NaN | 80.0 |
Converting to Long Format
Now the long table has been transformed to the wide format, we can also recover it if we want.
MySQL
Write an SQL query to rearrange the Products table so that each row has (product_id, store, price). If a product is not available in a store, do not include a row with that product_id and store combination in the result table.
In MySQL, the job is rather boring. We only need to use union
to concatenate all possible store tables. I directly show how it works:
Based on the wide table above, first, we filter out the product price of products sold in store1 and name it as price
, and we create store
as a new field.
select
product_id,
"store1" as store,
store1 as price
from products
where store1 is not null
Then we write all possible situations and concatenate them
select
product_id,
"store1" as store,
store1 as price
from products
where store1 is not null
union all
select
product_id,
"store2" as store,
store2 as price
from products
where store2 is not null
union all
select
product_id,
"store3" as store,
store3 as price
from products
where store3 is not null;
Python
Python has a built-in function, pd.melt()
, to convert a wide format table to long table.
Here's the syntax
pd.melt(df,id_vars,value_vars,var_name,value_name,ignore_index)
-
df
is the wide table needed to be melted -
id_vars
specifies the row of the new dataframe -
value_vars
specifies which columns need to be unpivoted. If not specified, all other columns other than id_vars will be unpivoted. -
value_name
specifies the name of new value column -
var_name
specifies the name of new variable column -
ignore_index
specifies whether we ignore the orginal index. Default isTrue
Now Let's unpivot the dataframe
melted_df = pd.melt(Products,id_vars=["product_id"],
value_vars=["store1","store2","store3"],
var_name="store",value_name="price",
ignore_index=True)
Then we filter out any null values
melted_df = pd.melt(Products,id_vars=["product_id"],
value_vars=["store1","store2","store3"],
var_name="store",value_name="price",
ignore_index=True)
result = melted_df[melted_df["price"].notnull()].sort_values(by="product_id")
result
[OUTPUT]
product_id | store | price |
---|---|---|
0 | store1 | 95.0 |
0 | store2 | 100.0 |
0 | store3 | 105.0 |
1 | store1 | 70.0 |
1 | store3 | 80.0 |
We can do the same transformation for the score
table using the same way, which is rather tedious. Now, let's try much more complicated question involving advanced use of pandas and mysql.
Advanced Use of Pandas and MySQL in Pivotting
Let's try some difficult questions in Leetcode, and I will use Python and MySQL to solve these questions to show how MySQL and Pandas work for pivotting and unpivotting.
Question1: Students Report By Geography
MySQL
A U.S graduate school has students from Asia, Europe and America. The students' location information are stored in table student as below.
student
name | continent |
---|---|
Jack | America |
Pascal | Europe |
Xi | Asia |
Jane | America |
We need to pivot the continent column in this table so that each name is sorted alphabetically and displayed underneath its corresponding continent. The output headers should be America, Asia and Europe respectively. It is guaranteed that the student number from America is no less than either Asia or Europe.
The result should be like this:
America | Asia | Europe |
---|---|---|
Jack | Xi | Pascal |
Jane |
Explanation
We can easily find that this is a typical question of converting long to wide format. So the method of solving this quesiont is absolutely group by
and control flow
.
But the problem is that the original table only has two columns, one used for displaying data by columns and the other one for aggregating. So we do not have a column that we can use to group the dataset. In this case, we need to construct such column ourselves.
So we need to first rank the name by continent in an ascending order, so we will have the result:
Jack is ranked as 1, while Jane is ranked as 2;
Pascal is ranked as 1 because Europe has only one student displayed
;
Xi is ranked as 1 because Asia has only one student displayed.
Then, we can group the dataset by the rank column and use exactly same way we have practiced to solve this question.
Code
To get a rank field, we can either use window function or variables.
Window Function
select
*,
row_number() over (partition by continent order by name asc) as "rank"
from student
We can also define variables to add rank field
select
*,
case when @cont=continent then @rk:=@rk+1
else @rk:=1
end as "rank",
@cont:=continent
from student,(select @rk:=0,@cont:=null) t
order by continent, name asc
Either way, we can have the following table:
name | continent | Rank |
---|---|---|
Jack | America | 1 |
Jane | America | 2 |
Pascal | Europe | 1 |
Xi | Asia | 1 |
Now the left job is just to reproduce the code I have written in demonstration part.
window function version
# build one temp table to store the table containing rank
with temp as (
select
*,
row_number() over (partition by continent order by name asc) as "rank"
from student)
select
max(case when continent="America" then name else null end) as America,
max(case when continent="Asia" then name else null end) as Asia,
max(case when continent="Europe" then name else null end) as Europe
from temp
group by rank
variable version
with temp as (select
*,
case when @continent=continent then @rk:=@rk+1 else @rk:=1 end as rk,
@continent:=continent
from student,(select @continent:=null,@rk:=0) t
order by continent asc,name asc)
select
max(case when continent="America" then name else null end) as America,
max(case when continent="Asia" then name else null end) as Asia,
max(case when continent="Europe" then name else null end) as Europe
from temp
group by rk;
Note: DO NOT FORGET TO ADD AGGREGATE FUNCTION
Python
Now, let me use python to solve this question, which can be simpler than solution of MySQL.
Same way. We need to use pd.pivot_table
to pivot the dataframe.
# first we prepare the dataframe
Student = pd.DataFrame({"name":["Jack","Pascal","Xi","Jane"],
"continent":["America","Europe","Asia","America"]})
Now we copy the dataframe to avoid any manipulation affecting the original dataste
student = Student.copy()
Then, we also need create one column used to group the data. Here's one powerful function rank
# set ascending=True to ensure that we order name alphabetically
student["rank"] = student["name"].groupby(student["continent"]).\
rank(method="first",ascending=True)
Next, we call pd.pivot_table
to perform pivotting.
result = student.pivot_table(
index="rank",
columns="continent",
values="name",
aggfunc="max",
fill_value="")
[Ouput]
continent | America | Asia | Europe |
---|---|---|---|
rank | |||
1.0 | Jack | Xi | Pascal |
2.0 | Jane |
Tweak the result a little bit
res = result.reset_index().drop("rank",axis=1)
res.columns.names = ""
res
[Output]
America | Asia | Europe | |
---|---|---|---|
0 | Jack | Xi | Pascal |
1 | Jane |
Question 2:Sales by Day of the Week
Orders
order_id | customer_id | order_date | item_id | quantity |
---|---|---|---|---|
1 | 1 | 2020-06-01 | 1 | 10 |
2 | 1 | 2020-06-08 | 2 | 10 |
3 | 2 | 2020-06-02 | 1 | 5 |
4 | 3 | 2020-06-03 | 3 | 5 |
5 | 4 | 2020-06-04 | 4 | 1 |
6 | 4 | 2020-06-05 | 5 | 5 |
7 | 5 | 2020-06-05 | 1 | 10 |
8 | 5 | 2020-06-14 | 4 | 5 |
9 | 5 | 2020-06-21 | 3 | 5 |
items
item_id | item_name | item_category |
---|---|---|
1 | LC Alg. Book | Book |
2 | LC DB. Book | Book |
3 | LC SmarthPhone | Phone |
4 | LC Phone 2020 | Phone |
5 | LC SmartGlass | Glasses |
6 | LC T-Shirt XL | T-Shirt |
You are the business owner and would like to obtain a sales report for category items and day of the week. Write an SQL query to report how many units in each category have been ordered on each day of the week. Return the result table ordered by category. The query result format is in the following example:
Solution
The result table requires us to display the sales indexed by each category of items per day of a week. So this is a typical question of pivoting long table to wide table. So
- First Setp: Group by the data by category**
select
xx
from orders o
right join items i
on o.item_id = i.item_id
group by i.item_category
order by category
- Second Step: Basd on the joined table, write a control flow to automatically calculate the sum of quantity for each category
select
i.item_category category,
sum(if(weekday(o.order_date)=0,o.quantity,0)) 'Monday',
sum(if(weekday(o.order_date)=1,o.quantity,0)) 'Tuesday',
sum(if(weekday(o.order_date)=2,o.quantity,0)) 'Wednesday',
sum(if(weekday(o.order_date)=3,o.quantity,0)) 'Thursday',
sum(if(weekday(o.order_date)=4,o.quantity,0)) 'Friday',
sum(if(weekday(o.order_date)=5,o.quantity,0)) 'Saturday',
sum(if(weekday(o.order_date)=6,o.quantity,0)) 'Sunday'
from orders o
right join items i
on o.item_id = i.item_id
group by item_category
order by category