Prepare Your Data(1)

Concept Summary: Prepare the Data

Recipes in DSS

Note
Recipes in DSS contain the transformation steps, or processing logic, that act upon datasets.

In the Flow, blue squares represent datasets. The yellow, orange, and red circles, on the other hand, which connect datasets to one another, represent recipes.

Keeping processing logic separate from datasets has a number of benefits:

  • One is that data storage technologies rapidly change. As these winds shift, the underlying storage infrastructure of a dataset can change (for example, switching cloud providers) without impacting the processing logic found in the recipes of a Flow.
  • Another is a clear sense of data lineage in a project. By looking at the Flow, you can see all actions that have been applied to the data recorded in recipes – from the raw imported data to the final output dataset.

A circle in the Flow represents a recipe, but its color represents the category of recipe. DSS recipes can be divided into visual, code, or plugin recipes.

Visual recipes (in yellow) accomplish the most common data transformation operations, such as cleaning, grouping, and filtering, through a pre-defined graphical user interface.

Instead of a pre-defined visual recipe, you are free to define your own processing logic in a code recipe (in orange), using a language such as Python, R, or SQL.

The third category of recipe is the plugin recipe (typically in red). A full discussion of plugins within DSS is outside the scope of this section, but know that they are a way for coders to extend the native capabilities of DSS.

If code recipes give you complete freedom to perform any data processing task, and visual recipes can be used and understood by everyone in your team, a plugin recipe combines these benefits by wrapping a visual interface on top of a code recipe.

Prepare Recipe

The Prepare recipe is a visual recipe in DSS that allows you to create data cleansing, normalization, and enrichment scripts in an interactive way.

This is achieved by assembling a series of transformation steps from a library of more than 90 processors. Most processors are designed to handle one specific task, such as filtering rows, rounding numbers, extracting regular expressions, concatenating or splitting columns, and much more.

In addition to directly adding steps from the processor library, you can add steps to the script in a number of other ways.

In the column context menu, DSS will suggest steps to add based on the column’s meaning. For example, DSS will suggest to remove rows with invalid values according to the column meaning.

Another method to add steps to the script is through the Analyze window. Within a Prepare recipe, the Analyze window can guide data preparation, for example merging categorical values.

You can also directly drag columns to adjust their order, or switch from the Table view to the Columns view to apply certain steps to more than one column at a time.

When adding new steps to the script, you’ll notice how the step output is immediately visible. This is possible because the step is being applied to the same sample of the dataset found in the Explore tab. The quick feedback allows you to work incrementally, quickly modifying your transformation steps.

Notice that steps in the script constitute a list of instructions. These instructions are not immediately applied to the dataset itself. For example, adding a “Delete Column” step removes that column from the step preview, but it does not actually delete the column in the dataset, as it would in a spreadsheet. Only when you choose to actually run the recipe will DSS execute the instructions on the full input dataset, and thereby produce a new output dataset.

If a script starts to grow in complexity, a number of features can help you manage them.

  • You can disable steps.
  • You can organize individual steps into groups of steps.
  • You can add colors and comments to steps in order to send reminders to yourself and colleagues.
  • You can even copy and paste steps within the same recipe or to another recipe, even if that recipe is in another project or another DSS instance.

Date Handling in DSS

Working with dates poses a number of data cleaning challenges.

There are many date formats, different time zones, and components like “day of the week” which can be difficult to extract. A human might be able to recognize that “1/5/19”, “2019-01-05”, and “1 May, 2019” are all the same date. However, to a computer, these are just three different strings.

Strings representing dates need to be parsed, so that the computer can recognize the true, unambiguous meaning of the Date. The DSS answer to this problem can be found in the Prepare recipe.

When you have a column that appears to be a Date, DSS is able to recognize it as a date. In the example below, the meaning of the first column is an unparsed date.

You could open the processor library, filter for Dates, and search for a step to help in whatever situation you may find yourself. Here, we find the Parse date processor.

You could also take advantage of how DSS suggests transformation steps based on a column’s meaning. Because DSS has identified this column as an unparsed date, it suggests adding the Parse date processor to the script. Both methods achieve the same result.

Once you have chosen the correct processor, it is just a few more clicks to select the correct settings, in this case, the format of the date and the timezone for example.

Once you have a properly parsed date, you’re on your way! DSS will suggest new steps, such as “Compute time since”, “Extract date components”, and “Filter date range”.

Formulas in DSS

Often in a Prepare recipe, you will want to create new columns based on those already present in your dataset. In the world of machine learning, this is called feature generation.

Similar to what you might find in a spreadsheet tool like Excel, DSS has its own Formula language.

It is a powerful expression language to perform calculations, manipulate strings, and much more.

From the processor library, you can add a Formula step and provide the name of the output column.

You could write simple formulas directly in the Expression box. Clicking the Edit button, however, adds a few support measures. The first is code completion. As soon as you start typing, DSS starts suggesting columns from the dataset or functions to apply. The Editor will also alert you if the formula is invalid.

The Formula language allows you to craft expressions of considerable complexity. For example, you can use:

  • common mathematical functions, such as round, sum and max
  • comparison operators, such as >, <, >=, <=
  • logical operators, such as AND and OR
  • tests for missing values, such as isBlank() or isNULL()
  • string operations with functions like contains(), length(), and startsWith()
  • conditional if-then statements

What’s next?

This summary reviewed the concept of recipes in DSS, and more specifically, the Prepare recipe.

Get more practice using the Prepare recipe in the following hands-on section.

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 218,546评论 6 507
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 93,224评论 3 395
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 164,911评论 0 354
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,737评论 1 294
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,753评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,598评论 1 305
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,338评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,249评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,696评论 1 314
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,888评论 3 336
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 40,013评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,731评论 5 346
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,348评论 3 330
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,929评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,048评论 1 270
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,203评论 3 370
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,960评论 2 355

推荐阅读更多精彩内容