Process是Nextflow流程的基本单元。 我以为,最合适的翻译,可能是步骤,当然或许是工序。为了避免别扭,在之后的笔记中,均用步骤。
一个步骤,以process开头,包含了该步骤所有代码
process sayHello {
    """
    echo 'Hello world!' > file
    """
}
总的来说,整体包括五个部分
process < name > {
   [ directives ]
   input:
    < process inputs >
   output:
    < process outputs >
   when:
    < condition >
   [script|shell|exec]:
   < user script to be executed >
}
Script部分
这一部分主要就是包括Process中需要执行的命令。默认是BASH命令
如果有输入输出的话,这一个部分必须放在最后。
process doMoreThings {
  """
  blastp -db $db -query query.fa -outfmt 6 > blast_result
  cat blast_result | head -n 10 | cut -f 2 > top_hits
  blastdbcmd -db $db -entry_batch top_hits > sequences
  """
}
其中三个双引号支持变量多行和变量内插。
如果是要使用系统的变量,那么可以使用三个单引号,避免变量内插
process printPath {
   '''
   echo The path is: $PATH
   '''
}
当然还有一种方式是,使用反斜线
process doOtherThings {
  """
  blastp -db \$DB -query query.fa -outfmt 6 > blast_result
  cat blast_result | head -n $MAX | cut -f 2 > top_hits
  blastdbcmd -db \$DB -entry_batch top_hits > sequences
  """
}
Scripts支持多种语言与混编
Scripts部分默认使用bash命令,但是用户可以指定,使其使用其他脚本语言,如Perl,Python,Ruby,R 等
process perlStuff {
    """
    #!/usr/bin/perl
    print 'Hi there!' . '\n';
    """
}
process pyStuff {
    """
    #!/usr/bin/python
    x = 'Hello'
    y = 'world!'
    print "%s - %s" % (x,y)
    """
}
条件式Script
seq_to_align = ...
mode = 'tcoffee'
process align {
    input:
    file seq_to_aln from sequences
    script:
    if( mode == 'tcoffee' )
        """
        t_coffee -in $seq_to_aln > out_file
        """
    else if( mode == 'mafft' )
        """
        mafft --anysymbol --parttree --quiet $seq_to_aln > out_file
        """
    else if( mode == 'clustalo' )
        """
        clustalo -i $seq_to_aln -o out_file
        """
    else
        error "Invalid alignment mode: ${mode}"
}
模板
也就是所,可以写一些脚本模板,直接被重复调用
process template_example {
    input:
    val STR from 'this', 'that'
    script:
    template 'my_script.sh'
}
目录下有my_script.sh文件,内容为
#!/bin/bash
echo "process started at `date`"
echo $STR
:
echo "process completed"
测试模板的方式,可以是直接在shell终端输入
STR='foo' bash templates/my_script.sh
Shell区块
用于强制Shell上下文,此时Nextflow的变量需要用!来指定
process myTask {
    input:
    val str from 'Hello', 'Hola', 'Bonjour'
    shell:
    '''
    echo User $USER says !{str}
    '''
}
其中$USER变量是Shell的,而!{str}是Nextflow的
本地执行(内置语法)
Nextflow本身就是Groovy的拓展,可直接使用自带的命令
x = Channel.from( 'a', 'b', 'c')
process simpleSum {
    input:
    val x
    exec:
    println "Hello Mr. $x"
}
输入
Nextflow的Process是相对独立的,通过Channels进行通讯。每一个Input的区块,可以定义输入数据来源。每个Process只能有一个Input区块,而Input区块可以包括多个Input声明(也就是允许多个输入)。
大体语法如下
input:
  <input qualifier> <input name> [from <source channel>] [attributes]

输入常用数值
num = Channel.from( 1, 2, 3 )
process basicExample {
  input:
  val x from num
  "echo process job $x"
}
那么会输出
process job 3
process job 1
process job 2
由于数据来源于同一个Process,所以可以省略from
num = Channel.from( 1, 2, 3 )
process basicExample {
  input:
  val num
  "echo process job $num"
}
从文件中输入
proteins = Channel.fromPath( '/some/path/*.fa' )
process blastThemAll {
  input:
  file query_file from proteins
  "blastp -query ${query_file} -db nr"
}
在输入文件名与管道名一致的情况下,可以省略
proteins = Channel.fromPath( '/some/path/*.fa' )
process blastThemAll {
  input:
  file proteins
  "blastp -query $proteins -db nr"
}
可以直接对输入命名为本地变量,于是可以省略$符号?似乎就可以保证外部命令正常运行
input:
    file query_file name 'query.fa' from proteins
或者直接
input:
    file 'query.fa' from proteins
使用的时候
proteins = Channel.fromPath( '/some/path/*.fa' )
process blastThemAll {
  input:
  file 'query.fa' from proteins
  "blastp -query query.fa -db nr"
}
多个输入文件
fasta = Channel.fromPath( "/some/path/*.fa" ).buffer(size:3)
process blastThemAll {
    input:
    file 'seq' from fasta
    "echo seq*"
}
将会输出
seq1 seq2 seq3
seq1 seq2 seq3
...

fasta = Channel.fromPath( "/some/path/*.fa" ).buffer(size:3)
process blastThemAll {
    input:
    file 'seq?.fa' from fasta
    "cat seq1.fa seq2.fa seq3.fa"
}
动态文件名输入
process simpleCount {
  input:
  val x from species
  file "${x}.fa" from genomes
  """
  cat ${x}.fa | grep '>'
  """
}
标准输入的类型
str = Channel.from('hello', 'hola', 'bonjour', 'ciao').map { it+'\n' }
process printAll {
   input:
   stdin str
   """
   cat -
   """
}
将会输出
hola
bonjour
ciao
hello
环境变量的类型
str = Channel.from('hello', 'hola', 'bonjour', 'ciao')
process printEnv {
    input:
    env HELLO from str
    '''
    echo $HELLO world!
    '''
}
将会输出
hello world!
ciao world!
bonjour world!
hola world!
Input of type 'set'
The set qualifier allows you to group multiple parameters in a single parameter definition. It can be useful when a process receives, in input, tuples of values that need to be handled separately. Each element in the tuple is associated to a corresponding element with the set definition. For example:
<pre style="box-sizing: border-box; font-family: Consolas, "Andale Mono WT", "Andale Mono", "Lucida Console", "Lucida Sans Typewriter", "DejaVu Sans Mono", "Bitstream Vera Sans Mono", "Liberation Mono", "Nimbus Mono L", Monaco, "Courier New", Courier, monospace; font-size: 12px; white-space: pre; margin: 0px; padding: 12px; line-height: 1.5; display: block; overflow: auto; color: rgb(64, 64, 64);">tuple = Channel.from( [1, 'alpha'], [2, 'beta'], [3, 'delta'] )
process setExample {
input:
set val(x), file('latin.txt')  from tuple
"""
echo Processing $x
cat - latin.txt > copy
"""
}
</pre>
In the above example the set parameter is used to define the value x and the file latin.txt, which will receive a value from the same channel.
In the set declaration items can be defined by using the following qualifiers: val, env, file and stdin.
A shorter notation can be used by applying the following substitution rules:
<colgroup style="box-sizing: border-box;"><col width="47%" style="box-sizing: border-box;"><col width="53%" style="box-sizing: border-box;"></colgroup>
| long | short | 
|---|---|
| val(x) | x | 
| file(x) | (not supported) | 
| file('name') | 'name' | 
| file(x:'name') | x:'name' | 
| stdin | '-' | 
| env(x) | (not supported) | 
Thus the previous example could be rewritten as follows:
<pre style="box-sizing: border-box; font-family: Consolas, "Andale Mono WT", "Andale Mono", "Lucida Console", "Lucida Sans Typewriter", "DejaVu Sans Mono", "Bitstream Vera Sans Mono", "Liberation Mono", "Nimbus Mono L", Monaco, "Courier New", Courier, monospace; font-size: 12px; white-space: pre; margin: 0px; padding: 12px; line-height: 1.5; display: block; overflow: auto; color: rgb(64, 64, 64);">tuple = Channel.from( [1, 'alpha'], [2, 'beta'], [3, 'delta'] )
process setExample {
input:
set x, 'latin.txt' from tuple
"""
echo Processing $x
cat - latin.txt > copy
"""
}
</pre>
File names can be defined in dynamic manner as explained in the Dynamic input file names section.
输入的自动重复(亮点!)
可以使用一个each标签,高效地产生重复步骤,如
sequences = Channel.fromPath('*.fa')
methods = ['regular', 'expresso', 'psicoffee']
process alignSequences {
  input:
  file seq from sequences
  each mode from methods
  """
  t_coffee -in $seq -mode $mode > result
  """
}
以上会对米一个序列文件,分别执行三个模式的比对
了解多个输入通道的工作模式
process foo {
  echo true
  input:
  val x from Channel.from(1,2)
  val y from Channel.from('a','b','c')
  script:
   """
   echo $x and $y
   """
}
会输出
1 and a
2 and b
而
process bar {
  echo true
  input:
  val x from Channel.value(1)
  val y from Channel.from('a','b','c')
  script:
   """
   echo $x and $y
   """
}
则会自动重复 1
1 and a
1 and b
1 and c
还有其他....
输出
methods = ['prot','dna', 'rna']
process foo {
  input:
  val x from methods
  output:
  val x into receiver
  """
  echo $x > file
  """
}
receiver.println { "Received: $it" }
process align {
  input:
  val x from species
  file seq from sequences
  output:
  file "${x}.aln" into genomes
  """
  t_coffee -in $seq > ${x}.aln
  """
}
When
Directive
写在最后
我觉得经过这两三个小时的文档阅读,应该可以掌握Nextflow的使用了。暂时也没必要继续看文档了