PowerQuery处理不规则TXT文本的两种思路简评

在《M Is for Data Monkey》第七章，作者通过实例演示了如何用PowerQuery来处理不规则TXT文本(可点击蓝色部分下载示例文件)。我自己用如下方式处理即获得了初步结果：

let
源 = Table.FromColumns({Lines.FromBinary(File.Contents("C:\Users\wangh\Documents\power pivot教程\M is for Data monkey Examples\Ch07 Examples\GL Jan-Mar.TXT"), null, null, 936)}),
删除的顶端行 = Table.Skip(源,10),
去除的文本 = Table.TransformColumns(删除的顶端行,{},Text.Trim),
用位置分列 = Table.SplitColumn(去除的文本,"Column1",Splitter.SplitTextByRepeatedLengths(15),{"Column1.1", "Column1.2", "Column1.3", "Column1.4", "Column1.5", "Column1.6", "Column1.7", "Column1.8", "Column1.9"}),
更改的类型 = Table.TransformColumnTypes(用位置分列,{{"Column1.1", type text}, {"Column1.2", type text}, {"Column1.3", type text}, {"Column1.4", type text}, {"Column1.5", type text}, {"Column1.6", type text}, {"Column1.7", type text}, {"Column1.8", type text}, {"Column1.9", type text}}),
合并的列 = Table.CombineColumns(更改的类型,{"Column1.5", "Column1.6"},Combiner.CombineTextByDelimiter("", QuoteStyle.None),"已合并"),
合并的列1 = Table.CombineColumns(合并的列,{"Column1.7", "Column1.8"},Combiner.CombineTextByDelimiter("", QuoteStyle.None),"已合并.1"),
删除的列 = Table.RemoveColumns(合并的列1,{"Column1.3", "Column1.9"}),
提升的标题 = Table.PromoteHeaders(删除的列),
筛选的行 = Table.SelectRows(提升的标题, each ([#"Tran Date      "] <> null and [#"Tran Date      "] <> "123-03         " and [#"Tran Date      "] <> "===== End of Re" and [#"Tran Date      "] <> "===============" and [#"Tran Date      "] <> "Account        " and [#"Tran Date      "] <> "Balance        " and [#"Tran Date      "] <> "Dept xxx - Rest" and [#"Tran Date      "] <> "Detailed Genera" and [#"Tran Date      "] <> "Feb 2006" and [#"Tran Date      "] <> "Mar 2006" and [#"Tran Date      "] <> "March 20,2009  " and [#"Tran Date      "] <> "No.            " and [#"Tran Date      "] <> "Tran Date      " and [#"Tran Date      "] <> "XYZ Company Ltd"))
in
筛选的行

而书中的处理方式却不同：

let
源 = Table.FromColumns({Lines.FromBinary(File.Contents("C:\Users\wangh\Documents\power pivot教程\M is for Data monkey Examples\Ch07 Examples\GL Jan-Mar.TXT"), null, null, 936)}),
删除的顶端行 = Table.Skip(源,10),
去除的文本 = Table.TransformColumns(删除的顶端行,{},Text.Trim),
用位置分列 = Table.SplitColumn(去除的文本,"Column1",Splitter.SplitTextByRepeatedLengths(15),{"Column1.1", "Column1.2", "Column1.3", "Column1.4", "Column1.5", "Column1.6", "Column1.7", "Column1.8", "Column1.9"}),
更改的类型 = Table.TransformColumnTypes(用位置分列,{{"Column1.1", type text}, {"Column1.2", type text}, {"Column1.3", type text}, {"Column1.4", type text}, {"Column1.5", type text}, {"Column1.6", type text}, {"Column1.7", type text}, {"Column1.8", type text}, {"Column1.9", type text}}),
合并的列 = Table.CombineColumns(更改的类型,{"Column1.5", "Column1.6"},Combiner.CombineTextByDelimiter("", QuoteStyle.None),"已合并"),
合并的列1 = Table.CombineColumns(合并的列,{"Column1.7", "Column1.8"},Combiner.CombineTextByDelimiter("", QuoteStyle.None),"已合并.1"),
删除的列 = Table.RemoveColumns(合并的列1,{"Column1.3", "Column1.9"}),
提升的标题 = Table.PromoteHeaders(删除的列),
更改的类型1 = Table.TransformColumnTypes(提升的标题,{{"Tran Date      ", type date}}),
删除的错误 = Table.RemoveRowsWithErrors(更改的类型1, {"Tran Date      "}),
更改的类型2 = Table.TransformColumnTypes(删除的错误,{{"Tran Amount    ", type number}}),
删除的错误1 = Table.RemoveRowsWithErrors(更改的类型2, {"Tran Amount    "}),
筛选的行 = Table.SelectRows(删除的错误1, each ([#"Tran Amount    "] <> null))
in
筛选的行

差别在于从第11行开始，我用了筛选，而作者却通过更改“Tran Date”和“Tran Amount”两列的格式，再删除其中的错误和空行来实现剔除不需要的行的目的。

咋一看似乎我的操作更简单——我通过筛选这一个步骤就实现了作者要四个步骤才能实现的剔除噪音行的目的。

但实际上，我的筛选相当于硬编码，只适用于这一个特定的例子，而作者的这种方式则更聪明，适用于所有类似格式的TXT文档处理。在书中，作者很快引入了第二个文件《GL Apr-Jun.TXT》,用作者的处理方式，只需要刷新PowerQuery就ok了，用我的筛选方式，则需要在筛选步骤里新增筛选条件。

【反思】PowerQuery是一种程序化处理数据的方式，因此要尽可能避免硬编码，这样才能对操作步骤进行抽象，使其适应于某一类应用场景，而不是只能处理某一个特定的文件。

PowerQuery处理不规则TXT文本的两种思路简评

PowerQuery处理不规则TXT文本的两种思路简评

相关阅读更多精彩内容

友情链接更多精彩内容