Advanced Training - September 2023

2 minute read

Youtube Link

Training material and info

Slack channel: #sept23-advanced-training

Operators:

The map operator

The map operator:

workflow {
    Channel.of( 1, 2, 3, 4, 5 )
    | map { it * it }
    | view
}

Each element of the channel is being multiplied by itself. By default, each element of the list in the channel is named “it”. You can use “->” to name the elements.

workflow {
    Channel.of( 1, 2, 3, 4, 5 )
    | map { num -> num * num }
    | view
}

Here each of the integers in the list will have the name “num” for each of the repeated actions performed on the elements.

I also notice that they use “ ” as a pipe symbol. Does it work also with “.”? Seems like it, but have to use view() instead of simply view

Maps in groovy are represented by key:value pairs inside square brackets.

The view operator

View the outputs of a pipeline, or channel. It’s possible to customize the output. E.g., : view { "Found: $it"}.

The splitCsv operator

When using the splitCsv() operator, with header: true, the column names are mapped to the different elements, or values, of the columns/cells in the csv. With the header: true argument we can use the header names to access the csv elements.

samplesheet.csv:

id,repeat,type,fastq1,fastq2
sampleA,1,normal,data/reads/sampleA_rep1_normal_R1.fastq.gz,data/reads/sampleA_rep1_normal_R2.fastq.gz
sampleA,1,tumor,data/reads/sampleA_rep1_tumor_R1.fastq.gz,data/reads/sampleA_rep1_tumor_R2.fastq.gz
sampleA,2,normal,data/reads/sampleA_rep2_normal_R1.fastq.gz,data/reads/sampleA_rep2_normal_R2.fastq.gz
workflow {
    Channel.fromPath("data/samplesheet.csv")
    | splitCsv ( header: true )
    | view
}

operators -> nextflow run main.nf 
N E X T F L O W  ~  version 23.04.3
Launching `main.nf` [small_boltzmann] DSL2 - revision: b156d7f712
[id:sampleA, repeat:1, type:normal, fastq1:data/reads/sampleA_rep1_normal_R1.fastq.gz, fastq2:data/reads/sampleA_rep1_normal_R2.fastq.gz]
[id:sampleA, repeat:1, type:tumor, fastq1:data/reads/sampleA_rep1_tumor_R1.fastq.gz, fastq2:data/reads/sampleA_rep1_tumor_R2.fastq.gz]
[id:sampleA, repeat:2, type:normal, fastq1:data/reads/sampleA_rep2_normal_R1.fastq.gz, fastq2:data/reads/sampleA_rep2_normal_R2.fastq.gz]
[id:sampleA, repeat:2, type:tumor, fastq1:data/reads/sampleA_rep2_tumor_R1.fastq.gz, fastq2:data/reads/sampleA_rep2_tumor_R2.fastq.gz]

Notice the key:value pairs inside square brackets.

To extract three specific columns into a tuple (notice the use of the file() method to make sure that these are paths to files):

workflow {

    reads = Channel.fromPath("data/samplesheet.csv")
    .splitCsv ( header: true )
    .map{ row -> tuple(row.id, file(row.fastq1), file(row.fastq2))}
    .view()
}

operators -> nextflow run main.nf 
N E X T F L O W  ~  version 23.04.3
Launching `main.nf` [elegant_lagrange] DSL2 - revision: 2c9851a553
[sampleA, [/workspace/gitpod/nf-training/advanced/operators/data/reads/sampleA_rep1_normal_R1.fastq.gz, /workspace/gitpod/nf-training/advanced/operators/data/reads/sampleA_rep1_normal_R2.fastq.gz]]
[sampleA, [/workspace/gitpod/nf-training/advanced/operators/data/reads/sampleA_rep1_tumor_R1.fastq.gz, /workspace/gitpod/nf-training/advanced/operators/data/reads/sampleA_rep1_tumor_R2.fastq.gz]]
[sampleA, [/workspace/gitpod/nf-training/advanced/operators/data/reads/sampleA_rep2_normal_R1.fastq.gz, /workspace/gitpod/nf-training/advanced/operators/data/reads/sampleA_rep2_normal_R2.fastq.gz]]
[sampleA, [/workspace/gitpod/nf-training/advanced/operators/data/reads/sampleA_rep2_tumor_R1.fastq.gz, /workspace/gitpod/nf-training/advanced/operators/data/reads/sampleA_rep2_tumor_R2.fastq.gz]]

This would be the same. Using the default “it” name:

workflow {
    Channel.fromPath("data/samplesheet.csv")
    | splitCsv( header: true )
    | map { [it.id, [file(it.fastq1), file(it.fastq2)]] }
    | view
}

What happens here is that we are iterating over each row in the csv file.

It can be a good idea to keep as much metadata as possible through the workflow. Little cost of doing that.

Map objects are equivalent to python dictionaries.

workflow {

    reads = Channel.fromPath("data/samplesheet.csv")
    .splitCsv ( header: true )
    .map { row -> 
        meta = [id: row.id, type: row.type, repeat: row.repeat] 
        [meta, tuple(file(row.fastq1), file(row.fastq2))]
    }
    .view()
}

N E X T F L O W  ~  version 23.04.3
Launching `main.nf` [evil_hopper] DSL2 - revision: 9c91ffa343
[[id:sampleA, type:normal, repeat:1], [/workspace/gitpod/nf-training/advanced/operators/data/reads/sampleA_rep1_normal_R1.fastq.gz, /workspace/gitpod/nf-training/advanced/operators/data/reads/sampleA_rep1_normal_R2.fastq.gz]]
[[id:sampleA, type:tumor, repeat:1], [/workspace/gitpod/nf-training/advanced/operators/data/reads/sampleA_rep1_tumor_R1.fastq.gz, /workspace/gitpod/nf-training/advanced/operators/data/reads/sampleA_rep1_tumor_R2.fastq.gz]]
[[id:sampleA, type:normal, repeat:2], [/workspace/gitpod/nf-training/advanced/operators/data/reads/sampleA_rep2_normal_R1.fastq.gz, /workspace/gitpod/nf-training/advanced/operators/data/reads/sampleA_rep2_normal_R2.fastq.gz]]

Here we gather the id, type and repeat columns into a map object. And then we gather that map object (meta. I think often referred to as the metamap) in a tuple with the two fastq file path columns. We can access the different elements in meta later in the processes by e.g., meta.id.

Leave a Comment