NextFlow的步骤Process（系列之四）

Process是Nextflow流程的基本单元。我以为，最合适的翻译，可能是步骤，当然或许是工序。为了避免别扭，在之后的笔记中，均用步骤。

一个步骤，以process开头，包含了该步骤所有代码

process sayHello {

    """
    echo 'Hello world!' > file
    """

}

总的来说，整体包括五个部分

process < name > {

   [ directives ]

   input:
    < process inputs >

   output:
    < process outputs >

   when:
    < condition >

   [script|shell|exec]:
   < user script to be executed >

}

Script部分

这一部分主要就是包括Process中需要执行的命令。默认是BASH命令
如果有输入输出的话，这一个部分必须放在最后。

process doMoreThings {

  """
  blastp -db $db -query query.fa -outfmt 6 > blast_result
  cat blast_result | head -n 10 | cut -f 2 > top_hits
  blastdbcmd -db $db -entry_batch top_hits > sequences
  """

}

其中三个双引号支持变量多行和变量内插。
如果是要使用系统的变量，那么可以使用三个单引号，避免变量内插

process printPath {

   '''
   echo The path is: $PATH
   '''

}

当然还有一种方式是，使用反斜线

process doOtherThings {

  """
  blastp -db \$DB -query query.fa -outfmt 6 > blast_result
  cat blast_result | head -n $MAX | cut -f 2 > top_hits
  blastdbcmd -db \$DB -entry_batch top_hits > sequences
  """

}

Scripts支持多种语言与混编

Scripts部分默认使用bash命令，但是用户可以指定，使其使用其他脚本语言，如Perl，Python，Ruby，R 等

process perlStuff {

    """
    #!/usr/bin/perl

    print 'Hi there!' . '\n';
    """

}

process pyStuff {

    """
    #!/usr/bin/python

    x = 'Hello'
    y = 'world!'
    print "%s - %s" % (x,y)
    """

}

条件式Script

seq_to_align = ...
mode = 'tcoffee'

process align {
    input:
    file seq_to_aln from sequences

    script:
    if( mode == 'tcoffee' )
        """
        t_coffee -in $seq_to_aln > out_file
        """

    else if( mode == 'mafft' )
        """
        mafft --anysymbol --parttree --quiet $seq_to_aln > out_file
        """

    else if( mode == 'clustalo' )
        """
        clustalo -i $seq_to_aln -o out_file
        """

    else
        error "Invalid alignment mode: ${mode}"

}

模板

也就是所，可以写一些脚本模板，直接被重复调用

process template_example {

    input:
    val STR from 'this', 'that'

    script:
    template 'my_script.sh'

}

目录下有my_script.sh文件，内容为

#!/bin/bash
echo "process started at `date`"
echo $STR
:
echo "process completed"

测试模板的方式，可以是直接在shell终端输入

STR='foo' bash templates/my_script.sh

Shell区块

用于强制Shell上下文，此时Nextflow的变量需要用!来指定

process myTask {

    input:
    val str from 'Hello', 'Hola', 'Bonjour'

    shell:
    '''
    echo User $USER says !{str}
    '''

}

其中$USER变量是Shell的，而!{str}是Nextflow的

本地执行（内置语法）

Nextflow本身就是Groovy的拓展，可直接使用自带的命令

x = Channel.from( 'a', 'b', 'c')

process simpleSum {
    input:
    val x

    exec:
    println "Hello Mr. $x"
}

输入

Nextflow的Process是相对独立的，通过Channels进行通讯。每一个Input的区块，可以定义输入数据来源。每个Process只能有一个Input区块，而Input区块可以包括多个Input声明（也就是允许多个输入）。
大体语法如下

input:
  <input qualifier> <input name> [from <source channel>] [attributes]

输入常用数值

num = Channel.from( 1, 2, 3 )

process basicExample {
  input:
  val x from num

  "echo process job $x"

}

那么会输出

process job 3
process job 1
process job 2

由于数据来源于同一个Process，所以可以省略from

num = Channel.from( 1, 2, 3 )

process basicExample {
  input:
  val num

  "echo process job $num"

}

从文件中输入

proteins = Channel.fromPath( '/some/path/*.fa' )

process blastThemAll {
  input:
  file query_file from proteins

  "blastp -query ${query_file} -db nr"

}

在输入文件名与管道名一致的情况下，可以省略

proteins = Channel.fromPath( '/some/path/*.fa' )

process blastThemAll {
  input:
  file proteins

  "blastp -query $proteins -db nr"

}

可以直接对输入命名为本地变量，于是可以省略$符号？似乎就可以保证外部命令正常运行

input:
    file query_file name 'query.fa' from proteins

或者直接

input:
    file 'query.fa' from proteins

使用的时候

proteins = Channel.fromPath( '/some/path/*.fa' )

process blastThemAll {
  input:
  file 'query.fa' from proteins

  "blastp -query query.fa -db nr"

}

多个输入文件

fasta = Channel.fromPath( "/some/path/*.fa" ).buffer(size:3)

process blastThemAll {
    input:
    file 'seq' from fasta

    "echo seq*"

}

将会输出

seq1 seq2 seq3
seq1 seq2 seq3
...

fasta = Channel.fromPath( "/some/path/*.fa" ).buffer(size:3)

process blastThemAll {
    input:
    file 'seq?.fa' from fasta

    "cat seq1.fa seq2.fa seq3.fa"

}

动态文件名输入

process simpleCount {
  input:
  val x from species
  file "${x}.fa" from genomes

  """
  cat ${x}.fa | grep '>'
  """
}

标准输入的类型

str = Channel.from('hello', 'hola', 'bonjour', 'ciao').map { it+'\n' }

process printAll {
   input:
   stdin str

   """
   cat -
   """

}

将会输出

hola
bonjour
ciao
hello

环境变量的类型

str = Channel.from('hello', 'hola', 'bonjour', 'ciao')

process printEnv {

    input:
    env HELLO from str

    '''
    echo $HELLO world!
    '''

}

将会输出

hello world!
ciao world!
bonjour world!
hola world!

Input of type 'set'

The set qualifier allows you to group multiple parameters in a single parameter definition. It can be useful when a process receives, in input, tuples of values that need to be handled separately. Each element in the tuple is associated to a corresponding element with the set definition. For example:

<pre style="box-sizing: border-box; font-family: Consolas, "Andale Mono WT", "Andale Mono", "Lucida Console", "Lucida Sans Typewriter", "DejaVu Sans Mono", "Bitstream Vera Sans Mono", "Liberation Mono", "Nimbus Mono L", Monaco, "Courier New", Courier, monospace; font-size: 12px; white-space: pre; margin: 0px; padding: 12px; line-height: 1.5; display: block; overflow: auto; color: rgb(64, 64, 64);">tuple = Channel.from( [1, 'alpha'], [2, 'beta'], [3, 'delta'] )

process setExample {
input:
set val(x), file('latin.txt') from tuple

"""
echo Processing $x
cat - latin.txt > copy
"""

}
</pre>

In the above example the set parameter is used to define the value x and the file latin.txt, which will receive a value from the same channel.

In the set declaration items can be defined by using the following qualifiers: val, env, file and stdin.

A shorter notation can be used by applying the following substitution rules:

long	short
val(x)	x
file(x)	(not supported)
file('name')	'name'
file(x:'name')	x:'name'
stdin	'-'
env(x)	(not supported)

Thus the previous example could be rewritten as follows:

process setExample {
input:
set x, 'latin.txt' from tuple

"""

echo Processing $x
cat - latin.txt > copy
"""

}
</pre>

File names can be defined in dynamic manner as explained in the Dynamic input file names section.

输入的自动重复（亮点！）

可以使用一个each标签，高效地产生重复步骤，如

sequences = Channel.fromPath('*.fa')
methods = ['regular', 'expresso', 'psicoffee']

process alignSequences {
  input:
  file seq from sequences
  each mode from methods

  """
  t_coffee -in $seq -mode $mode > result
  """
}

以上会对米一个序列文件，分别执行三个模式的比对

了解多个输入通道的工作模式

process foo {
  echo true
  input:
  val x from Channel.from(1,2)
  val y from Channel.from('a','b','c')
  script:
   """
   echo $x and $y
   """
}

会输出

1 and a
2 and b

而

process bar {
  echo true
  input:
  val x from Channel.value(1)
  val y from Channel.from('a','b','c')
  script:
   """
   echo $x and $y
   """
}

则会自动重复 1

1 and a
1 and b
1 and c

还有其他....

输出

methods = ['prot','dna', 'rna']

process foo {
  input:
  val x from methods

  output:
  val x into receiver

  """
  echo $x > file
  """

}

receiver.println { "Received: $it" }

process align {
  input:
  val x from species
  file seq from sequences

  output:
  file "${x}.aln" into genomes

  """
  t_coffee -in $seq > ${x}.aln
  """
}

When

Directive

写在最后

我觉得经过这两三个小时的文档阅读，应该可以掌握Nextflow的使用了。暂时也没必要继续看文档了

NextFlow的步骤Process（系列之四）