Apache Parquet: How to Be a Hero with the open source columnar data format on the Google and Amaz...

(Original source by Thomas Spicer)

Get All the Benefits of Apache Parquet File Format for Google Cloud, Amazon Athena and Redshift Spectrum

You have read about Google Cloud (BigQuery, Dataproc…), Amazon Redshift Spectrum and AWS Athena. You are looking to take advantage of one or two. However, before you jump into the deep end you will want to familiarize yourself with the opportunities of leveraging Apache Parquet file format instead of regular Text, CSV or TSV files. The parquet format is a columnar storage format which allows systems, like Amazon Athena, the ability to query information as columnar data vs a flat file like CSV

If you are not thinking about how to optimize for these new query service models, you can be throwing money out the window.

What Is Apache Parquet?

Apache Parquet format is a columnar storage format with the following characteristics:

Apache Parquet is column-oriented and designed to bring efficient columnar storage of data compared to row based files like CSV

Apache Parquet is built from the ground up with complex nested data structures in mind

Apache Parquet is built to support very efficient compression and encoding schemes

Apache Parquet allows to lower storage costs for data files and maximizes the effectiveness of querying data with serverless technologies like Amazon Athena, Redshift Spectrum and Google Dataproc.

Apache Parquet is a self-describing data format which embeds the schema, or structure, within the data itself. This results a file that is optimized for query performance and minimizing I/O. Parquet also supports very efficient compression and encoding schemes. The great thing is that it is licensed under the Apache software foundation and available to any project.

Parquet and The Rise of Cloud Warehouses & Interactive Query Services

The rise interactive query services like AWS Athena and Amazon Redshift Spectrum make it easy using standard SQL to analyze data in storage systems like Amazon S3. Also, data warehouses like Google BigQuery and the Google Dataproc platform can leverage different formats for data ingest.

However, the data format you select can have significant implications on performance and cost, especially if you are looking at machine learning, AI or other complex operations. We will walk you through a few examples of those considerations.

Parquet vs CSV

CSV is simple and ubiquitous. Many tools like Excel, Google Sheets and a host of others can generate CSV files. You can even create them with your favorite text editing tool. We all love CSV files, but everything has a cost, even your love of CSV files, especially if CSV is your default format for data processing pipelines.

AWS Athena and AWS Redshift Spectrum charge you by the amount of data scanned per query. (Many other services also charge based on data queried so this is not unique to AWS)

Google and Amazon charge you for the amount of data stored on GS/S3

Google Dataproc charges are time-based

Defaulting to the use of CSV will have both technical and financial outcomes (not in a good way). You will learn to love Apache Parquet just as much as your trusty CSV.

Example: A 1 TB CSV File

The following demonstrates the efficiency and effectiveness of using a Parquet file vs CSV.

By converting your CSV data to Parquet’s columnar format, compressing and partitioning it, you save money and reap the rewards of better performance. The following table compares the savings created by converting data into Parquet vs CSV.


Think about this: If over the course of a year you stuck with the uncompressed 1 TB CSV files as a foundation of your queries costs would be $2000 USD. Using Parquet files your total cost would be $3.65 USD. I know you love your CSV files, but do you love them THAT much?

Also, if time is money your analysts can be spending close to 5 minutes waiting for a query to complete simply because you use raw CSV. If you are paying someone $150 an hour and they are doing this once a day for a year then they spent about 30 hours simply waiting for a query to complete. That is roughly about $4500 in unproductive “wait” time. Total wait time for the Apache Parquet user? About 42 mins or $100.

Example 2: Parquet, CSV and Your Redshift Data Warehouse

Amazon Redshift Spectrum enables you to run Amazon Redshift SQL queries against data in Amazon S3. This can be an effective strategy for teams that want to partition data where some of it is resident within Redshift and other data is resident on S3. For example, let’s assume you have about 4 TB of data in a historical_purchase table in Redshift. Since it is not accessed frequently, offloading it to S3 makes sense. This will free up that space in Redshift while still providing your team access via Spectrum. Now, the big question becomes what format are you storing that 4 TB historical_purchase table in? CSV? How about using Parquet?

Our historical_purchase table has 4 equally sized columns, stored in Amazon S3 in three files; uncompressed CSV, gzip CSV and Parquet.


Uncompressed CSV File

The uncompressed CSV file has a total size of 4 TB. Running a query to get data from a single column of the table requires Redshift Spectrum to scan the entire file 4 TB. As result this query would cost $20.

GZIP CSV File

If you compress your CSV file using GZIP, the file size is reduced to 1GB.Great savings! However, Redshift Spectrum still has to scan the entire file. The good news is your CSV file is four times smaller than the uncompressed one so you pay one-fourth of what you did before. This query would cost $5.

Parquet File

If you compress your file and convert it to Apache Parquet you end up with 1 TB of data in S3. However, because Parquet is columnar, Redshift Spectrum can read only the column that is relevant for the query being run. It only needs to scan just 1/4 the data. This query would only cost $1.25.

If you are running this query once a day for a year, using uncompressed CSV files would cost $7300. Even compressed CSV queries would cost over $1800. However, using the Apache Parquet file format it would cost about $460. Still in love with your CSV file?

Summary

The trend toward “serverless”, interactive query services and pre-built data processing suites is rapidly progressing. It is providing new opportunities for teams to go faster with lower investments. Athena and Spectrum make it easy to analyze data in Amazon S3 using standard SQL. Also, Google supports loading Parquet files into BigQuery and Dataproc.

When you only pay for the queries that you run, or resources like CPU and storage, it is important to look at optimizing the data those systems are relying on.

By the way, we have launched a zero admin data processing framework for Amazon Redshift Spectrum and Amazon Athena which includes automated database/table creation, Parquet file conversion, partitioning and more. See announcement for details:

Amazon Redshift Spectrum Automated — 60 Second Setup, Zero Administration And Automatic…

Announcing fully-managed support of zero administration Amazon Redshift Spectrum data pipeline service.blog.openbridge.com

AWS Athena Automated — 60 Second Setup, Zero Administration And Automatic Optimization

We are excited to announce the release of our zero administration AWS Athena data pipeline service.blog.openbridge.com

Also, take a look at our post about AWS Redshift Spectrum and AWS Athena. Using Apache Parquet can benefit both!

How is AWS Redshift Spectrum different than AWS Athena?

This question has come up a few times and most of the discussion in centered around the technical difference. Rather…blog.openbridge.com

Did we miss anything? Do you have any questions about how to transform your CSV to Apache Parquet? If you want help to streamline your data to Google Cloud, AWS Athena, AWS Redshift Spectrum or other data technologies, feel free to leave a comment or contact us at hello@openbridge.com. You can also visit us at https://www.openbridge.comto learn how we are helping other companies with their data efforts.

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,544评论 6 501
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,430评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 162,764评论 0 353
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,193评论 1 292
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,216评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,182评论 1 299
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,063评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,917评论 0 274
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,329评论 1 310
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,543评论 2 332
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,722评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,425评论 5 343
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,019评论 3 326
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,671评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,825评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,729评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,614评论 2 353

推荐阅读更多精彩内容