How to process csv file

CSV is a common, simple file format used to store tabular data, such as a spreadsheet or database. We have a need to process CSV file, as to discover, extract, and analyze the information we are interested in. Pandas are competent.

This wiki is intended as a guide to getting started with pandas, including examples for

  • Loading file into dataframe;
  • Picking sub dataframe of interest;
  • Displaying data;
  • Saving a dataframe to a CSV file;
  • Creating a new dataframe by hand.

Let's start!

1. Loading dataframe from an existing CSV file

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. Loading the CSV(example airport data) file by using:

import pandas as pd
in_file = 'airports.csv'
data = pd.read_csv(in_file)

<table>
<tr>
<td ><center>[图片上传失败...(image-ec4b2f-1550908860557)] Load into dataframe</center></td>
<td ><center>[图片上传失败...(image-209f06-1550908860557)] Open with excel</center></td>
</tr>
</table>

For dataframe, just think of it like a spreadsheet.

2. Picking sub dataframe of interest

By typing data.shape, we know that the dataframe has 54872 rows, and 18 columns.

Let's first see what types of airports are included:

types = data['type'] # only the 'type' column is selected
types = set(types)

Above, data['type'] is actually a Series(a one-dimensional labeled array capable of holding any data type), by using set(), it is converted into set type. Result is

types = {'balloonport',
 'closed',
 'heliport',
 'large_airport',
 'medium_airport',
 'seaplane_base',
 'small_airport''}

Similarly, by typing countris = set(data['iso_country']), we know 'ZN'(China) is included.

So, next, suppose we are just interested in the names and locations of large airports in China, how to do that?

a. select several columns of interest:

# a sub dataframe with only our interested columns
sub_data = data.loc[:,['type','name','latitude_deg','longitude_deg','iso_country']]

b. more precisely, select rows of interest (only 'large_airport' and 'CN'):

cn_large_airport = sub_data[(sub_data['iso_country']=='CN') & (sub_data['type']=='large_airport')]

3. Displaying data

After that, we get a dataframe with only 33 rows and 5 columns. Let's display it:

# pick 'lon/lat' columns and covert them into list.
lons = cn_large_airport["longitude_deg"].tolist() 
lats = cn_large_airport["latitude_deg"].tolist()
# display
import matplotlib.pyplot as plt
plt.scatter(lons,lats, s=5)

By using .tolist() function, we can convert Series type into list type (for this display example, it's also ok with no convertion).

Also display all the airports in China, results are:
[图片上传失败...(image-652589-1550908860557)]

4. Saving a dataframe to a CSV file

For saving a dataframe, use .to_csv() function, like:

cn_large_airport.to_csv('cn_large_airport.csv')

5. Creating a new dataframe by hand

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容