谷歌云数据工程师考试 - BigQuery复习笔记

这周一直在准备谷歌云数据工程师的考试 (Google Cloud Data Engineer Certification),就把一篇学习笔记贴上来。

时间仓促暂时没法翻译成中文,希望看官们谅解。

BigQuery Summary

What is?
Google's fully managed, petabyte scale, low cost analytics data warehouse.

1.png

Serverless?
Yes

Benefits

Allows near real-time analysis
-> not totally real-time
-> if need milliseconds delay in transactional database, BigQuery is not the answer; the answer will be Cloud SQL or Spanner
-> but it is a good choice for ad-hoc analysis of very large datasets
-> data warehousing and business intelligence

Completely no-ops
-> no need to create cluster before querying
-> pay for amount of data processed

Pricing
-> pay for use
OR
-> flat-rate pricing

Durable
-> if you put data in BigQuery, it is just like GCS

Immutable audit logs
-> audit logs cannot be tampered with
-> you know if someone actually uses the dataset

A way to share data beyond silos of your company’s structure
-> cluster less, so can take BigQuery and share it with anybody under your company’s domain
-> anybody that you allow to access your table is able to access your table without any ops that is typically required to dealt with
-> no need to worry that “how do I log in the person’s cluster so I can use their data” it is cluster free!
-> BigQuery can become the form by which you get data collaboration across your company -> way of sharing analysis, and that is live!

How to use?

How to load data?

You can load data:

Not loaded but can query external data sources
BigQuery offers support for querying data directly from:

BigQuery supports loading data from Cloud Storage and readable data sources in the following formats:

You can load data into a new table or partition, you can append data to an existing table or partition, or you can overwrite a table or partition. For more information on working with partitions, see Managing Partitioned Tables.

When you load data into BigQuery, you can supply the table or partition schema, or for supported data formats, you can use schema auto-detection.

How to load schema?

  • Manually specify the schema:
  • Using the BigQuery web UI.
  • Inline using the CLI.
  • Create a schema file in JSON format.
  • Call the jobs.insert method and configure the configuration.load.schema property.
  • Call the tables.insert method and configure the schema in the table resource using theschema property.

Indexing

BigQuery does not use or support indexes.

Cost Reduction best practice

  • Avoid SELECT *. Query only the columns that you need.
  • Don't run queries to explore or preview table data. Use preview options.
  • Before running queries, estimate costs.
  • Use the maximum bytes billed setting to limit query costs.
  • Do not use a LIMIT clause as a method of cost control.
  • Create a dashboard to view your billing data so you can make adjustments to your BigQuery usage. Also consider streaming your audit logs to BigQuery so you can analyze usage patterns.
  • Partition your tables by date.
  • If possible, materialize your query results in stages.
  • If you are writing large query results to a destination table, use the default table expiration time to remove the data when it's no longer needed.
  • Use streaming inserts only if your data must be immediately available.

Performance Improve best practice

  • Avoid self-joins. Use a window function instead.
  • If your query processes keys that are heavily skewed to a few values, filter your data as early as possible.
  • Avoid unbalanced joins
  • Avoid joins that generate more outputs than inputs. When a CROSS JOIN is required, pre-aggregate your data.
  • Avoid point-specific DML statements (updating or inserting 1 row at a time). Batch your updates and inserts.

Billing

3.png

Free
-> loading
-> exporting
-> queries on metadata
-> cached queries: if you run a query and you run exactly the same query in that project, it’s free
-> per user cache
-> if two users on the same project, they don’t share the cache
-> query with errors

Storage
-> You get charged based on the amount of data in the table
-> Streaming data
-> get charged based on the ingest rate of that streaming data
-> automatically get a discount for older data

Processing
-> on-demand: charge for queries based on the amount of data that’s being processed, 1TB / month free
-> flat rate plan: talk with sales team
-> opt-in to run high-compute queries: JS UDF

总结就是:
data input: streaming insert收钱,其他不收
data storage: 收钱
data processing: 按量收

IAM / Permission

BigQuery uses Identity and Access Management (IAM) to manage access to resources. The three types of resources available in BigQuery are organizations, projects, and datasets. In the IAM policy hierarchy, datasets are child resources of projects. Tables and views are child resources of datasets — they inherit permissions from their parent dataset.

To grant access to a resource, assign one or more roles to a user, group, or service account. Organization and project roles affect the ability to run jobs or manage the project's resources, whereas dataset roles affect the ability to access or modify the data inside of a particular dataset.

IAM provides two types of roles: predefined and primitive roles. When you assign both predefined and primitive roles to a user, the permissions granted are a union of each role's permissions.

View Access Management

You cannot assign access controls directly to views. You can control view access by configuring access controls at the dataset level or at the project level.

Dataset-level access controls specify the operations users, groups, and service accounts are allowed to perform on views in that specific dataset. If you assign only dataset-level permissions, you must also assign a primitive or predefined, project-level role that provides access to the project, for example, bigquery.user.

Instead of granting access to individual datasets, you can assign predefined, project-level IAM roles that grant permissions to all views in all datasets in a project.

You can also create IAM custom roles. If you create a custom role, the permissions you grant depend on the view operations you want the user, group, or service account to be able to perform.

For more information on roles and permissions, see:

Encryption

By default, BigQuery encrypts customer content stored at rest. BigQuery handles and manages this default encryption for you without any additional actions on your part. First, data in a BigQuery table is encrypted using a data encryption key. Then, those data encryption keys are encrypted with key encryption keys, which is known as envelope encryption. Key encryption keys do not directly encrypt your data but are used to encrypt the data encryption keys that Google uses to encrypt your data.

If you want to control encryption yourself, you can use customer-managed encryption keys (CMEK) for BigQuery. Instead of Google managing the key encryption keys that protect your data, you control and manage key encryption keys in Cloud KMS.

BigQuery API

To use the Google BigQuery API, you must first authenticate to verify your client's identity. BigQuery authorizes access to resources based on the verified identity.

A service account is a Google account that is associated with your GCP project. Use a service account to access the BigQuery API if your application can run jobs associated with service credentials rather than an end-user's credentials, such as a batch processing pipeline.

Use user credentials to ensure that your application has access only to BigQuery tables that are available to the end user.

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,657评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,662评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,143评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,732评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,837评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,036评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,126评论 3 410
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,868评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,315评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,641评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,773评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,470评论 4 333
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,126评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,859评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,095评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,584评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,676评论 2 351

推荐阅读更多精彩内容