pandas库是用Python进行数据分析绝对会使用到的一个第三方库,因此无论如何你都必须要了解它,本文是对pandas库官方文档中对pandas库介绍部分的翻译,学习编程时,学会阅读和使用官方文档是解决问题最直接也是最靠谱的方法,因此建议å如果真想在编程上有所为,一定要去阅读官方文档,否则一直只能吃别人嚼过的东西。
pandas库官方文档地址:http://pandas.pydata.org/pandas-docs/stable/
注:由于英语水平有限,难免会有错误,如发现,请留言指正。
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fu ndamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.
pandas库是Python的第三方库,它提供快速,灵活并且富有表达能力的数据结构,这些数据结构让我们能够更加容易和直观的处理关系型和代带标签的数据。它致力于成为在Python中对真实世界进行数据分析的基础高层次构建模块。不仅如此,它还有一个更加远大的目标,那就是在任何编程语言中,成为一个最强大,最灵活的开源数据处理与分析的工具。它现在正在积极的向它的目标迈进!
pandas is well suited for many different kinds of data:
- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure
pandas适合多种不同种类的数据
- 具有异构类型列的表格数据(不同列的数据类型不相同,如有的列为字符型,有的为数字,有的为时间日期),如SQL表或Excel电子表格数据;
- 有序和无序(且不一定是固定频率)的时间序列数据;
- 具有行和列标签的任意矩阵数据(同质或者异构,不同列数据类型相同,则同质,否则异构);
- 任何其他形式的观察/统计数据集,这些数据在被存放到pandas的数据结构当中时,不需要打上标签;
The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users, DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.
pandas 最重要的两种数据结构是
Series
(一维)和DataFrame
(二维),这两种数据结构能够应对大多数金融,统计,工程领域的数据处理需求。对于R的使用者而言,DataFrame
提供的功能不仅包括了R’s data.frame
所能提供的一切,还包括了一些R’s data.frame
所没有的功能。pandas库构建在Numpy
库之上,被科学计算领域的很多第三方库集成。
Here are just a few of the things that pandas does well:
- Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
- Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
- Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
- Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
- Intuitive merging and joining data sets
- Flexible reshaping and pivoting of data sets
- Hierarchical labeling of axes (possible to have multiple labels per tick)
- Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
- Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.
这里列出pandas库很擅长的一些事情:
- 轻松处理浮点型和非浮点型数据的缺失值;
- 大小可变:数据列能够从
DataFrame
或者更高维度的数据结构中添加或者删除; - 自动和明确的数据对齐:对象可以显式对齐一组标签,或者用户可以简单地忽略标签,并让
Series
和DataFrame
在计算时,自动对齐你的数据; - 功能强大,灵活的
group by
函数,可用于对数据集执行拆分,应用,组合操作,用于聚合和转换数据; - 使其他Python和NumPy数据结构中不规整,不同索引的数据转换成DataFrame对象变得容易;
- 基于标签的自动切片,索引和大数据集子集选取;
- 直观的数据融合和连接操作;
- 灵活的对数据表进行结构重构,或者进行数据透视;
- 轴的分层标签(每个刻度可能有多个标签);
- 用于从文本文件(CSV和分隔符),Excel文件,数据库以及从超快HDF5格式文件加载数据的强大的IO工具;
- 时间序列特定功能:日期范围生成和变频,滑动窗口统计,滑动窗口线性回归,日期偏移和滞后等。
Many of these principles are here to address the shortcomings frequently experienced using other languages / scientific research environments. For data scientists, working with data is typically divided into multiple stages: munging and cleaning data, analyzing / modeling it, then organizing the results of the analysis into a form suitable for plotting or tabular display. pandas is the ideal tool for all of these tasks.
以上提到的很多的pandas的特点是为了解决其他语言/科学研究环境常有的一些缺点。对于数据科学家而言,通常数据分析工作分为几个阶段:清洗和整理数据,分析和建模,然后把结果整理成用于展示的图表。pandas是做这些工作的理想工具。(也就是pandas从清洗数据,到最后的结果展现阶段都会用到)
Some other notes
- pandas is fast. Many of the low-level algorithmic bits have been extensively tweaked in Cython code. However, as with anything else generalization usually sacrifices performance. So if you focus on one feature for your application you may be able to create a faster specialized tool.
- pandas is a dependency of statsmodels, making it an important part of the statistical computing ecosystem in Python.
- pandas has been used extensively in production in financial applications.
一些关键点:
- pandas速度很快。许多低水平算法已用Cython代码来编写。然而,pandas和其它任何关注一般化而牺牲一定性能的工具一样,因此如果您专注于应用程序的一个功能,您可以创建一个更快的专业工具。(也就是说pandas为了功能的全面,能应付更加一般化的数据处理工作,而不是某个特定的数据处理过程,它的效率基本只能是现在这个样子了,如果你的数据处理需求很固定,可以考虑自己用C或者c++等语言开发速度更快但只适合特定场景的工具)
- pandas是统计模型的依赖库,它是Python计算生态中的重要一部分;
- pandas已经广泛应用于金融领域的生产环境,
Note: This documentation assumes general familiarity with NumPy. If you haven’t used NumPy much or at all, do invest some time in learning about NumPy first.
注意:pandas的文档内容会假定你对
NumPy
库已经熟悉了,如果你没有使用过NumPy
,先花一些时间去学习它吧~