Efficient ETL Testing

以下文章来源于公众号Bright 技术人生Efficient ETL Testing,作者Bright Liao.

Efficient ETL Testing

It's always been a pain point to do ETL testing. But it more and more becomes a must after data being so widely used these days.

An ETL with more than 100 lines of code is common. The filter conditions, data transformation rules, join conditions and other logic there could be very complicated.

In these cases, we should do testing early to avoid possible production issues. Testing gives us confidence about what we coded and helps team with quality assurance.

But there are a lot of challenges about ETL testing there, and we see a lot of teams struggling.

ETL testing challenges

A common way to do ETL testing requires the steps below:

•Create a production-like environment.•Copy the database definition and table schema to the environment.•For tables used in the ETL, we prepare testing data and insert data to tables.•We run the ETL and it generates a new table with data as a result.•We compare the generated data and the expected data to find if there are any issues.

There is no easy thing for the above steps.

For step 1, a production-like environment not only costs, but also requires heavy ops work. Cloud services may ease the ops work but you may be tightly bounded to some cloud.

For step 2, we may need to write scripts to sync database and table schema. We also need to develop a strategy to store the existing data in test environment. The drawback of it is that it breaks the network separation from test to production environment.

For step 3, it's always been hard work to prepare testing data since some tables the ETL used may contain hundreds of columns and we have to pay attention to columns that are not used in the ETL. We also need to be careful about the column types and how the data is generated. And we need a script to insert data as well.

For step 4, we may need to maintain a separate configuration for test environment.

For step 5, comparing data manually is tedious work and it's easy to make mistakes.

Some team relies on the statistics of the output table to identify issues of ETLs. It is good practice. But when the logic becomes more and more complicated, it's not enough to just rely on statistics, since there might be cases that are not covered even by the real data.

Testing ETL in Easy SQL

Easy SQL provides a very light-weight way to do ETL testing. It removes most of the blockers mentioned above.

To prepare a test in Easy SQL is easy. The first thing to do is to create a spreadsheet from the provided template.

The template looks like below:

image.png

There are two concepts which are popular in testing domain. Easy SQL also adopted them:

•Test case: A single test used to test your code in some specific scenario.•Test suit: A bundle of a few unit test cases. Could be used to run them together.

Test suit

In the screenshot above, we see two test suits, named 'Suit 1' and 'Suit 2'. They are put in different sheets. In Easy SQL, if there is any sheet with a name starting with word 'Suit', the sheet is considered to be a test suit.

Test case

In test suit 'Suit 1', we can see two test cases. One case is 'A test for ETL abc.sql', and the other is 'Another test for ETL abc.sql'.

Test case is recognized by an uppercase keyword CASE in column 'A'. There should be a name of the test case in column 'B', and be next to the CASE keyword.

To describe a test case, we usually specify the variables that should be used to run the ETL, the data of all input tables, the data of the output tables. They are recognized by keywords VARS INPUT OUTPUT in column 'A' and values followed starting from column 'B'.

The data of output tables is used to test if output of the ETL after execution is exactly the same as the data specified in the test case.

Test case element format

The values of the mentioned elements in a test should be of formats below.

VARS: A table with header and exactly one row of data.•INPUT: A name of the input table specified at column 'B'; A table with header and number of rows of data starting from column 'C' of the same row; Mandatory descriptions of each row of data at column 'B' starting from the next row.•OUTPUT: The same format with 'INPUT', except that the descriptions of each row of data is optional.

You may ask why the descriptions of each row of data in 'INPUT' table is mandatory. This is a design on purpose. It is designed to improve test readability. The test case designer could record how the data is generated to explain the business rules behind the data and what is the scenario that the data is designed to cover.

For input tables and output tables, we may need to specify the type of each column. If so, we need to add type to the column names in format '{COLUMN_NAME}:{TYPE}'. If there is any column of a table with type specified, the type of other columns should be specified as well. If the type of any other column is not specified, it will be default to 'string' type.

Column types

The type of column varies for different backends.

For Spark, it should be int bigint boolean string and so on. The full list of types with built-in support are: int tinyint bigint double float string decimal boolean date timestamp array<string> array<int> array<tinyint> array<bigint> array<double> array<float> array<boolean> array<date> array<timestamp>.

For Postgres, it should be int bigint boolean text and so on. The full list of types could be found here[1]. The default type is text.

For Clickhouse, it should be Int8 Boolean String and so on. The full list of types could be found here[2].

For the other backends, please refer to the database data types related document of it.

Mock includes

If we have used include command in our ETL and we'd like to mock the body of the included file. We can add a INCLUDES section in the test case.

Then provide the mocked body of the ETL follow the rules below:

1.Column 'B' at the same row of the INCLUDES keyword should be filled with the file path of the include command in ETL.2.Column 'C' at the same row of the INCLUDES keyword should be filled with the mocked body of the included file.3.Add another row to specify a second INCLUDE to mock, with column 'B' and 'C' filled with file path and the mocked file body.

Usually, the included ETL file returns some temporary table. In this case, we can mock the content of the included file as below:

-- target=temp.THE_RETURNED_TEMP_TABLE

After this, we need to add an input table and provide the mocked data. The way to achieve this is the same as to define a normal input table above.

Test file name

We recommend creating one test file for one ETL. It means all the test cases in one spreadsheet file should be testing the same ETL.

In this case, the file name of the test file and the testing ETL could follow some convention so that we can find the ETL file given the test file.

Easy SQL provides a way to find ETL file from the test file automatically, which follows a simple convention that the base name of the ETL file and that of the test file should be the same.

E.g. when the ETL file is named some_etl.sql, then the test file should be named some_etl.xlsx.

We also recommend there is only one OUTPUT table in one ETL. In this case, the name of the ETL could be the full table name of the output table.

E.g. when an ETL output a table named some_db.some_table, the file name of the ETL should be some_db.some_table.sql and the test file name of the ETL should be some_db.some_table.xlsx.

Add test files to version control system

The test file mentioned above is a spreadsheet file. It is in binary format and not so easy to be added to version control system.

Easy SQL provides a way to dump a test in spreadsheet format to JSON format. After this, we can add the JSON file to version control system. In this way, we can easily compare the changes of each version of this case.

The JSON file is also optimized to let us compare data changes easily.

To convert a test file in spreadsheet format to JSON format. Run the command below:

(Before you run the command below, you will need to install two additional packages by pip3 install click==6.7 pymongo==3.10.1 xlrd==1.2.0)

python3 -m easy_sql.sql_test convert-json -f {YOUR_XLSX_FILE_PATH}

After the command finishes, there will be a JSON file with the same name but a .json suffix of the spreadsheet file generated. The directory of the JSON file is the same as the spreadsheet file.

Run test

Easy SQL provides a command line module to help to run ETL tests.

To run the ETL test, execute the command below:

(Before you run the command below, you will need to install two additional packages by pip3 install click==6.7 pymongo==3.10.1 xlrd==1.2.0)

python3 -m easy_sql.sql_test run-test -f {YOUR_XLSX_FILE_PATH} -b {BACKEND}

The test file could be a JSON test file as well. And the backend could be one of the supported backend.

For details of the command line usage, please run python3 -m easy_sql.sql_test --help.

Run test programmatically

Easy SQL also provides an interface to run ETL programmatically. This way, you can easily integrate tests in Easy SQL with your favorite testing framework.

To run a test in your code, write code below:

import os

For a concrete example, please refer to code here[3].

Summary

In this post, we talked about the necessity of ETL testing and challenges to do ETL testing.

In order to be efficient to create automated test cases, we have to spend some time to create some tools.

Easy SQL provides some built-in tools to help with ETL testing. With the help of Easy SQL, a lot of blockers have been removed. We only need to provide the main information about the test data. There is no more need to care about unrelated columns, data types, data comparing and so on.

Easy SQL embraces the most commonly used tool -- spreadsheet -- to create test cases. We can get a lot of benefits from it, such as a friendly and readable layout, the ability to use formula to prepare data, an intuitive way to record data and mock included code snippets etc.

In one word, with Easy SQL, we can do ETL testing more efficiently and save large amounts of time.

Previous posts about Easy SQL

A new ETL language -- Easy SQLA guide to write elegant ETLNeat syntax design of an ETL language (part 1)Neat syntax design of an ETL language (part 2)

References

[1] PostgreSQL types: https://www.postgresql.org/docs/current/datatype.html
[2] Clickhouse types: https://clickhouse.com/docs/en/sql-reference/data-types/
[3] Sample code to run test programmatically: https://github.com/easysql/easy_sql/blob/main/easy_sql/sql_test.py

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,029评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,395评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 157,570评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,535评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,650评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,850评论 1 290
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,006评论 3 408
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,747评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,207评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,536评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,683评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,342评论 4 330
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,964评论 3 315
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,772评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,004评论 1 266
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,401评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,566评论 2 349

推荐阅读更多精彩内容