Pivoting Fully Explained Using Python and MySQL

When we need to reshape dataframe such that the rearranged dataframe can satisfy the needs for information dissemination or for data analytics, there are two direction of pivoting table:

  • Pivoting from rows in a datasetset into columns
  • Pivoting from columns in a dataset into rows

The former rearrangment of dataset is referred to Pivoting Long to Wide Format, and the latter one is referred to Pivoting Wide to Long Format.

So what is a wide table and what is a long table?

Long Table and Wide Table

Long Table

Long table is an unprocessed, original dataset, so it is the best candidate for data analytics.

Scores

student school class grade
Andy Z english 90
Bernie Y english 80
Cindy Z english 70
Deb Y english 65
Andy Z math 60
Bernie Y math 55
Cindy Z math 100
Deb Y math 98
Andy Z physics 87
Bernie Y physics 100
Cindy Z physics 45
Deb Y physics 99

Products

product_id store price
0 store1 95
0 store3 105
0 store2 100
1 store1 70
1 store3 80

Wide Table

Wide table is used for disseminating information because the data is usually processed in a way that can increase people's knowledge of something. Also, wide table is a particular form of dataframe containing one column by one dimension and indexed by another dimension in the date column. For example, there are two typical wide tables:

Scores

product_id English Math Physics
Andy 90 60 87
Bernie 80 55 100
Cindy 70 100 45
Deb 65 98 99

The table shows scores per subject indexed by each student.

Products

product_id store1 store2 store3
0 95 100 105
1 70 null 80

The Products table contains each column per store indexed by per product. Looking at the table, we can easily know what are the prices of each product across all stores.

Converting between Long Table and Wide Table

Let's say we have the two unprocessed dataframes mentioned above, products and scores. Now we can convert them into wide table at will and perform the reverse operation, coverting the wide table to long format.

Converting to Wide Format

First, Let's take the products table as an example to demonstrate how the transformation works in Python and MySQL.

MySQL

products

product_id store price
0 store1 95
0 store3 105
0 store2 100
1 store1 70
1 store3 80

Now we need to write an SQL query to find the price of each product in each store.

Frist let's quickly review on the essence of wide table, which shows information per column and indexed by a particular dimension. So our goal is to group the price data by product_id and create 3 new fileds for store1, store2, and store3.

So how to create fields that can automatically identify which record is store1 or store2?
The answer is write a control flow. In MySQL, we need to use case when / if to create the required fields. Either way works in this case.

First we group by the dataset by product_id so that we can index each record by product_id.

select
  xx
from products
group by product_id

Then, based on the code chunk above, we need to write control fllow to create new fields for store1, store2, and store3. Here I use case when as the demonstration:

select
  product_id,
  case when store="store1" then price else null end as `store1`,
  case when store="store2" then price else null end as `store2`,
  case when store="store3" then price else null end as `store3`,
from products
group by product_id

Are we done? Absolutely not, we must use aggregation function after we group by the dataset; otherwise, it only returns one record. Ant it does not matter what kinds of aggregate function we use, we can either write max or min.

select
  product_id,
  max(case when store="store1" then price else null end) as `store1`,
  max(case when store="store2" then price else null end) as `store2`,
  max(case when store="store3" then price else null end) as `store3`,
from products
group by product_id

The output is as follows:

product_id store1 store2 store3
0 95 100 105
1 70 null 80

Python

In python, the question becomes extremely simple because pandas has its built-in function especially for pivoting dataframe. We can either use the top level function pd.pivot_table or the instance function df.pivot_table. Either way works.

Here is the syntax for pivotting long to wide format.
pd.pivot_table(df,index,columns,values,margins,aggfunc,fill_value)

  • df is the dataframe needed to be transformed, that is a long table
  • index means what field we use to index the dataframe? It is equivalent to group by of MySQL
  • columns means which column we want to display the data by column. It is equivalent to the new fields we create by control flow in MySQL.
  • values means which columns of the dataframe we want to aggregate?
  • aggfunc specifies the aggregate function we want to use, such as max,min,first,mean.
  • fill_value specifies whether we want to use certain value to fill with NaN.
  • margins specifies whether we want to add the subtotal.

Now let's pivot the products table to a wide format using this function

pd.pivot(Products,"product_id","store","price")

[Output]

store store1 store2 store3
product_id
0 95.0 100.0 105.0
1 70.0 NaN 80.0

Converting to Long Format

Now the long table has been transformed to the wide format, we can also recover it if we want.

MySQL

Write an SQL query to rearrange the Products table so that each row has (product_id, store, price). If a product is not available in a store, do not include a row with that product_id and store combination in the result table.

In MySQL, the job is rather boring. We only need to use union to concatenate all possible store tables. I directly show how it works:

Based on the wide table above, first, we filter out the product price of products sold in store1 and name it as price, and we create store as a new field.


select
    product_id,
    "store1" as store,
    store1 as price
from products
where store1 is not null

Then we write all possible situations and concatenate them


select
    product_id,
    "store1" as store,
    store1 as price
from products
where store1 is not null
union all
select
    product_id,
    "store2" as store,
    store2 as price
from products
where store2 is not null
union all
select
    product_id,
    "store3" as store,
    store3 as price
from products
where store3 is not null;

Python

Python has a built-in function, pd.melt(), to convert a wide format table to long table.

Here's the syntax
pd.melt(df,id_vars,value_vars,var_name,value_name,ignore_index)

  • df is the wide table needed to be melted
  • id_vars specifies the row of the new dataframe
  • value_vars specifies which columns need to be unpivoted. If not specified, all other columns other than id_vars will be unpivoted.
  • value_name specifies the name of new value column
  • var_name specifies the name of new variable column
  • ignore_index specifies whether we ignore the orginal index. Default is True

Now Let's unpivot the dataframe

melted_df = pd.melt(Products,id_vars=["product_id"],
                    value_vars=["store1","store2","store3"],
                    var_name="store",value_name="price",
                   ignore_index=True)

Then we filter out any null values

melted_df = pd.melt(Products,id_vars=["product_id"],
                    value_vars=["store1","store2","store3"],
                    var_name="store",value_name="price",
                   ignore_index=True)
result = melted_df[melted_df["price"].notnull()].sort_values(by="product_id")
result

[OUTPUT]

product_id store price
0 store1 95.0
0 store2 100.0
0 store3 105.0
1 store1 70.0
1 store3 80.0

We can do the same transformation for the score table using the same way, which is rather tedious. Now, let's try much more complicated question involving advanced use of pandas and mysql.

Advanced Use of Pandas and MySQL in Pivotting

Let's try some difficult questions in Leetcode, and I will use Python and MySQL to solve these questions to show how MySQL and Pandas work for pivotting and unpivotting.

Question1: Students Report By Geography

MySQL

A U.S graduate school has students from Asia, Europe and America. The students' location information are stored in table student as below.
student

name continent
Jack America
Pascal Europe
Xi Asia
Jane America

We need to pivot the continent column in this table so that each name is sorted alphabetically and displayed underneath its corresponding continent. The output headers should be America, Asia and Europe respectively. It is guaranteed that the student number from America is no less than either Asia or Europe.

The result should be like this:

America Asia Europe
Jack Xi Pascal
Jane

Explanation
We can easily find that this is a typical question of converting long to wide format. So the method of solving this quesiont is absolutely group by and control flow.
But the problem is that the original table only has two columns, one used for displaying data by columns and the other one for aggregating. So we do not have a column that we can use to group the dataset. In this case, we need to construct such column ourselves.

So we need to first rank the name by continent in an ascending order, so we will have the result:
Jack is ranked as 1, while Jane is ranked as 2;
Pascal is ranked as 1 because Europe has only one student displayed;
Xi is ranked as 1 because Asia has only one student displayed.

Then, we can group the dataset by the rank column and use exactly same way we have practiced to solve this question.

Code

To get a rank field, we can either use window function or variables.

Window Function

select
  *,
  row_number() over (partition by continent order by name asc) as "rank"
from student

We can also define variables to add rank field

select
  *,
  case when @cont=continent then @rk:=@rk+1
          else @rk:=1
  end as "rank",
  @cont:=continent
from student,(select @rk:=0,@cont:=null) t
order by continent, name asc

Either way, we can have the following table:

name continent Rank
Jack America 1
Jane America 2
Pascal Europe 1
Xi Asia 1

Now the left job is just to reproduce the code I have written in demonstration part.

window function version

# build one temp table to store the table containing rank
with temp as (
select
*,
  row_number() over (partition by continent order by name asc) as "rank"
from student)

select
  max(case when continent="America" then name else null end) as America,
  max(case when continent="Asia" then name else null end) as Asia,
  max(case when continent="Europe" then name else null end) as Europe
from temp
group by rank

variable version

with temp as (select
                *,
                case when @continent=continent then @rk:=@rk+1 else @rk:=1 end as rk,
                @continent:=continent
            from student,(select @continent:=null,@rk:=0) t
            order by continent asc,name asc)
select
    max(case when continent="America" then name else null end) as America,
    max(case when continent="Asia" then name else null end) as Asia,
    max(case when continent="Europe" then name else null end) as Europe
from temp
group by rk;

Note: DO NOT FORGET TO ADD AGGREGATE FUNCTION

Python

Now, let me use python to solve this question, which can be simpler than solution of MySQL.

Same way. We need to use pd.pivot_table to pivot the dataframe.

# first we prepare the dataframe
Student = pd.DataFrame({"name":["Jack","Pascal","Xi","Jane"],
                        "continent":["America","Europe","Asia","America"]})

Now we copy the dataframe to avoid any manipulation affecting the original dataste

student = Student.copy()

Then, we also need create one column used to group the data. Here's one powerful function rank

# set ascending=True to ensure that we order name alphabetically
student["rank"] = student["name"].groupby(student["continent"]).\
rank(method="first",ascending=True) 

Next, we call pd.pivot_table to perform pivotting.

result = student.pivot_table(
    index="rank",
    columns="continent",
    values="name",
    aggfunc="max",
    fill_value="")

[Ouput]

continent America Asia Europe
rank
1.0 Jack Xi Pascal
2.0 Jane

Tweak the result a little bit

res = result.reset_index().drop("rank",axis=1)
res.columns.names = ""
res

[Output]

America Asia Europe
0 Jack Xi Pascal
1 Jane

Question 2:Sales by Day of the Week

Orders

order_id customer_id order_date item_id quantity
1 1 2020-06-01 1 10
2 1 2020-06-08 2 10
3 2 2020-06-02 1 5
4 3 2020-06-03 3 5
5 4 2020-06-04 4 1
6 4 2020-06-05 5 5
7 5 2020-06-05 1 10
8 5 2020-06-14 4 5
9 5 2020-06-21 3 5

items

item_id item_name item_category
1 LC Alg. Book Book
2 LC DB. Book Book
3 LC SmarthPhone Phone
4 LC Phone 2020 Phone
5 LC SmartGlass Glasses
6 LC T-Shirt XL T-Shirt

You are the business owner and would like to obtain a sales report for category items and day of the week. Write an SQL query to report how many units in each category have been ordered on each day of the week. Return the result table ordered by category. The query result format is in the following example:

Solution
The result table requires us to display the sales indexed by each category of items per day of a week. So this is a typical question of pivoting long table to wide table. So

  • First Setp: Group by the data by category**
select
  xx
from orders o 
right join items i 
on o.item_id = i.item_id
group by i.item_category
order by category
  • Second Step: Basd on the joined table, write a control flow to automatically calculate the sum of quantity for each category
select 
        i.item_category category,
        sum(if(weekday(o.order_date)=0,o.quantity,0)) 'Monday',
        sum(if(weekday(o.order_date)=1,o.quantity,0)) 'Tuesday',
        sum(if(weekday(o.order_date)=2,o.quantity,0)) 'Wednesday',
        sum(if(weekday(o.order_date)=3,o.quantity,0)) 'Thursday',
        sum(if(weekday(o.order_date)=4,o.quantity,0)) 'Friday',
        sum(if(weekday(o.order_date)=5,o.quantity,0)) 'Saturday',
        sum(if(weekday(o.order_date)=6,o.quantity,0)) 'Sunday'
from orders o 
right join items i 
on o.item_id = i.item_id
group by item_category
order by category
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容