谷歌云数据工程师考试 - Data Proc 复习笔记

Dataproc Summary

How to load data?

a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning.

Dataproc connects to BigQuery

Option 1:
Screen Shot 2018-07-15 at 12.34.04 am.png


BigQuery does not natively know how to work with a Hadoop file system.

Cloud storage can act as an intermediary between BigQuery and data proc.

You would export the data from BigQuery into cloud storage as sharded data.

Then the worker notes in data proc would read the sharded data.

Symmetrically, if the data proc job is producing output it can be stored in a format in cloud storage that can be input to BigQuery.

Appropriate for periodic or infrequent transfers

Option 2:

Another option is to setup a BigQuery connector on the Dataproc cluster. The connector is a Java library that enables read write access from Spark and Hadoop directly into BigQuery.

Need to save BigQuery result as table first.

![Screen Shot 2018-07-15 at 12.48.01 am.png](https://upload-images.jianshu.io/upload_images/9976001-6fcaa78c38c1d404.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) ![Screen Shot 2018-07-15 at 12.50.02 am.png](https://upload-images.jianshu.io/upload_images/9976001-9a1b2c9c68b70469.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

Screen Shot 2018-07-15 at 12.44.25 am.png

Screen Shot 2018-07-15 at 12.44.35 am.png

Screen Shot 2018-07-15 at 12.48.01 am.png

Screen Shot 2018-07-15 at 12.50.02 am.png

Screen Shot 2018-07-15 at 12.50.20 am.png
Option 3:

When you want to process data in memory for speed - Pandas Dataframe

In memory, fast but limited in size

Creating a Dataproc cluster

Ways:
Deployment manager template, which is an infrastructure automation service in Google Cloud.
CLI commands
Google cloud console

Keys:

0 Create a cluster specifically for one job

1 Match your data location to the compute location
-> better performance
-> also able to shut down cluster when not processing jobs

2 use Cloud Storage instead of HDFS, shutdown the cluster when it’s not actually processing data
-> It reduces the complexity of disk provisioning and enables you to shut down your cluster when it's not processing a job.

3 Use custom machine types to closely manage the resources that the job requires

4 On non-critical jobs requiring huge clusters, use preemptible VMs to hasten results and cut costs at the same time

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 12,196评论 0 10
  • 果粉爸妈的心情都挂在孩子脑门上 理发并非关系国计民生的大事,但不可否认其重要性,所以说“虽为毫末技术,却是顶上功夫...
    酉_酉阅读 4,525评论 23 30
  • 一缕阳光就能照亮整个抑郁的心房,一丝温暖就能化成一股热流,涌遍身心。别去想未隧的愿望,别去想没到达的远方。把心清空...
    我是兰姐阅读 2,256评论 1 5
  • W11901 《十二味生活设计》 这本书的动机起源于作者的——如果有机会真想和这个人见见面,看看他的工作室,和他谈...
    4plus阅读 3,630评论 1 0
  • 个人翻译,如有不妥之处,敬请指正,共同学习,共同进步! 原文地址:Importing Assets 资源导入 在U...
    _lijinglong阅读 11,391评论 0 2

友情链接更多精彩内容