Skip to content

Commit 247d794

Browse files
committed
Restructure workflow manuals
1 parent b3dfb2a commit 247d794

16 files changed

+388
-405
lines changed

posts/IMAM-01-main-page-index.qmd

Lines changed: 26 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,30 @@ sidebar: id-imam
88

99
# Introduction {.unnumbered}
1010

11-
Welcome to the Illumina metagenomic data analysis manual. This manual contains a step by step guide for performing quality control, filtering host sequences, assembling reads into contigs, annotating the contigs, and then extracing viral contigs and their corresponding reads. In the final chapter of the manual we will show how to automate all of these steps into a single pipeline for speed and convenience.
11+
Welcome to the Illumina metagenomic data analysis manual. This documentation covers the **IMAM** workflow: an illumina metagenomic pipeline designed for quality control, filtering host sequences, assembling reads into contigs, annotating the contigs, and then extracing viral contigs and their corresponding reads.
1212

13-
::: callout-tip
14-
If you are just interested in running the automated workflow, then you only have to check out the chapters 'Preparation' and 'Automating data analysis'.
15-
:::
13+
This manual is divided into two parts:
14+
15+
1. **Quick Start:** How to set up and run the automated pipeline immediately.
16+
2. **Manual Execution:** A detailed breakdown of the underlying bioinformatic steps (Under the hood).
17+
18+
## Workflow Summary
19+
20+
The pipeline performs the following steps:
21+
22+
1. **Quality control**
23+
* Merging and decompressing with zcat.
24+
* **Deduplication**: Remove duplicate reads from the uncompressed FASTQ files with [cd-hit-dup](https://github.com/weizhongli/cdhit/blob/master/doc/cdhit-user-guide.wiki).
25+
* **Quality trimming**: Perform quality and sequence adapter trimming with [fastp](https://github.com/OpenGene/fastp).
26+
* **Host filtering**: Remove reads that map to a host genome (e.g., human) with [bwa](https://github.com/lh3/bwa).
27+
28+
2. **De novo assembly**
29+
* Perform de novo assembly of the host-filtered reads to create contigs with [SPades](https://github.com/ablab/spades).
30+
31+
3. **Taxonomic classification**
32+
* Annotate the aggregated contigs by assigning taxonomic classifications to them based on sequence similarity to known proteins in a database using [diamond blastx](https://github.com/bbuchfink/diamond).
33+
34+
4. **Extracting Viral Sequences and Analyzing Mapped Reads**
35+
* Extract contigs for annotated, unannotated and viral contigs.
36+
* Map the quality-filtered and host-filtered reads back to the assembled contigs to quantify the abundance of different contigs in each sample.
37+
* Extract and count mapped reads for each annotation file.

posts/IMAM-02-preparation.qmd

Lines changed: 0 additions & 48 deletions
This file was deleted.

posts/IMAM-02-quick_start.qmd

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# 1. Quick Start {.unnumbered}
2+
3+
This chapter guides you through setting up the environment and executing the automated pipeline.
4+
5+
::: callout-warning
6+
# Important!
7+
8+
In the following sections whenever a **"parameter"** in brackets `{}` is shown, the intention is to fill in your own filename or value. Each parameter will be explained in the section in detail.
9+
:::
10+
11+
::: callout-tip
12+
Notice the small *"Copy to Clipboard"* button on the right hand side of each code chunk, this can be used to copy the code.
13+
:::
14+
15+
## 1.1 Singularity Setup {.unnumbered}
16+
17+
The workflow is distributed as a self-contained Singularity container image, which includes all necessary software dependencies and helper scripts.
18+
19+
**Prerequisites:**
20+
[Singularity/Apptainer](https://docs.sylabs.io/guides/latest/user-guide/){target="_blank"} version 3.x or later must be installed on your system. If you are working with a High Performance Computing (HPC) system, this is likely already installed. Try writing `singularity --help` in your terminal (that’s connected to the HPC system) and see if the command is recognized.
21+
22+
## 1.2 Download the Image {.unnumbered}
23+
24+
Download the required workflow image file (`imam_workflow.sif`) directly through the terminal:
25+
26+
``` bash
27+
wget https://github.com/LucvZon/illumina-metagenomic-analysis-manual/releases/latest/download/imam_workflow.sif
28+
```
29+
30+
Or go to the [github page](https://github.com/LucvZon/illumina-metagenomic-analysis-manual/releases/latest){target="_blank"} and manually download it there, then transfer it to your HPC system.
31+
32+
## 1.3 Verify Container {.unnumbered}
33+
34+
You can verify the download by checking the container version or starting an interactive shell.
35+
36+
``` bash
37+
# Check version
38+
singularity run imam_workflow.sif --version
39+
40+
# Start interactive shell
41+
singularity shell imam_workflow.sif
42+
```
43+
44+
`singularity shell imam_workflow.sif` will drop you into a shell running inside the container. The conda environment needed for this workflow is automatically active on start-up of the interactive shell. All the tools of the conda environment will therefore be ready to use.
45+
46+
Please note that you do not have to run `conda activate {environment}` to activate the environment – everything is inside imam_workflow.sif. If you're curious about the conda environment we're using, you can check it out [here](https://github.com/LucvZon/illumina-metagenomic-analysis-manual/blob/main/envs/environment.yml){target="_blank"}
47+
48+
## 1.4 Project Setup
49+
50+
We use a tool called [Snakemake](https://snakemake.readthedocs.io/en/stable/){target="_blank"} to automate the analysis. To simplify the creation of the required project directory, the container includes a helper script called `prepare_project.py`. This script automates the creation of the project directory, the sample configuration file (`sample.tsv`), and the general settings configuration file (`config.yaml`), guiding you through each step with clear prompts and error checking.
51+
52+
### Required files
53+
54+
Before running the setup script, ensure you have the following two files ready:
55+
56+
1. `Diamond DB`: A diamond database that will be used to annotate assembled contigs.
57+
2. `Reference genome`: An indexed fasta file that will be used as a host filter for your reads.
58+
59+
### Initializing the Project Directory
60+
61+
Use `singularity exec` to run the setup script. This will create your project folder and generate the **sample.tsv**, **config.yaml** and **Snakefile** required for the pipeline.
62+
63+
``` bash
64+
singularity exec \
65+
--bind /mnt/viro0002-data:/mnt/viro0002-data \
66+
--bind $HOME:$HOME \
67+
--bind $PWD:$PWD \
68+
imam_workflow.sif \
69+
python /prepare_project.py \
70+
-p {project.folder} \
71+
-n {name} \
72+
-r {reads} \
73+
-t {threads}
74+
```
75+
76+
- `{name}`: The name of your study, no spaces allowed.
77+
- `{project.folder}`: The project folder where you run your workflow and store results.
78+
- `{reads}`: The folder that contains your raw .fastq.gz files. Raw read files must adhere to the naming scheme as described [here](https://help.basespace.illumina.com/files-used-by-basespace/fastq-files#naming){target="_blank"}.
79+
80+
::: callout-important
81+
**The `--bind` arguments are needed to explicitly tell Singularity to mount the necessary host directories into the container.** The part before the colon is the path on the host machine that you want to make available. The path after the colon is the path inside the container where the host directory should be mounted.
82+
83+
As a default, Singularity often automatically binds your home directory (`$HOME`) and the current directory (`$PWD`). We also explicitly bind `/mnt/viro0002-data` in this example. If your input files (reads, reference, databases) or output project directory reside outside these locations, you MUST add specific `--bind /host/path:/container/path` options for those locations, otherwise the container won’t be able to find them.
84+
:::
85+
86+
::: callout-note
87+
When **prepare_project.py** prompts for the **reference genome** and **diamond database** paths, you must enter the absolute host paths, and these paths must be accessible via one of the bind mounts.
88+
89+
Also, it'll ask if you want to create a raw_data/ folder with softlinks to your raw fastq.gz files. This is not required for running the workflow, but it can be convenient to have softlinks to your raw data available in your project directory.
90+
:::
91+
92+
After running the prepare_project.py helper script, you should have the following files in your project directory:
93+
94+
- The **sample.tsv** should have 3 columns: sample (sample name), fq1 and fq2 (paths to raw read files). Please note that samples sequenced by Illumina machines can be ran across different lanes. In such cases, the Illumina software will generate multiple fastq files for each sample that are lane specific (e.g. L001 = Lane 1, etc). So you may end up with a sample.tsv file that contains samples like `1_S1_L001` and `1_S1_L002`, even though these are the same sample, just sequenced across different lanes. The snakemake workflow will recognize this behaviour and merge these files together accordingly.
95+
96+
- The **config.yaml** contains more general information like the indexed reference and database you supplied as well as the amount of default threads to use.
97+
98+
- The **Snakefile** is the "recipe" for the workflow, describing all the steps we have done by hand, and it is most commonly placed in the root directory of your project (you can open the Snakefile with a text editor and have a look).
99+
100+
## 1.5 Executing the Pipeline
101+
102+
Once the project directory is initialized, navigate into it and run the workflow.
103+
104+
1. Navigate to the project:
105+
```bash
106+
cd {project.folder}
107+
```
108+
109+
This folder should contain your `Snakefile`, `sample.tsv` and `config.yaml` files, which were generated during **step 1.4**.
110+
111+
2. Dry Run (Optional but Recommended):
112+
Check for errors without executing commands.
113+
```bash
114+
singularity exec \
115+
--bind /mnt/viro0002-data:/mnt/viro0002-data \
116+
--bind $HOME:$HOME \
117+
--bind $PWD:$PWD \
118+
imam_workflow.sif \
119+
snakemake --snakefile Snakefile --cores 1 --dryrun
120+
```
121+
122+
3. Run the workflow:
123+
Remove `--dryrun` and set the number of threads
124+
```bash
125+
singularity exec \
126+
--bind /mnt/viro0002-data:/mnt/viro0002-data \
127+
--bind $HOME:$HOME \
128+
--bind $PWD:$PWD \
129+
imam_workflow.sif \
130+
snakemake --snakefile Snakefile --cores {threads}
131+
```

posts/IMAM-03-quality_control.qmd

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
1-
# 2. Quality control {.unnumbered}
1+
# 2. Manual: Quality Control {.unnumbered}
2+
3+
The following chapters showcase how the workflow works and guides users on how to manually run each step of the workflow.
24

35
::: callout-warning
46
# Important!

posts/IMAM-04-assembly.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# 3. De novo assembly {.unnumbered}
1+
# 3. Manual: De novo assembly {.unnumbered}
22

33
## 3.1 metaSPades {.unnumbered}
44

posts/IMAM-05-annotation.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# 4. Taxonomic classification {.unnumbered}
1+
# 4. Manual: Taxonomic classification {.unnumbered}
22

33
## 4.1 Diamond {.unnumbered}
44

posts/IMAM-06-parse_annotation.qmd

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# 5. Extracting Viral Sequences and Analyzing Mapped Reads {.unnumbered}
1+
# 5. Manual: Extracting Viral Sequences and Analyzing Mapped Reads {.unnumbered}
22

33
We will conclude the pipeline with various steps which will create valuable data files for further downstream analysis.
44

@@ -83,8 +83,4 @@ seqkit grep -f <(cut -f1 {input.viral}) {input.contigs} > {output.viral}
8383

8484
- `{input.annotated}`, `{input.unannotated}` and `{input.viral}` are .tsv annotation files from **steps 4.3 and 5.1**.
8585
- `{input.contigs}` is the .fasta file containing all contigs for a sample from **step 3.1**.
86-
- `{output.annotated}`, `{output.unannotated}` and `{output.viral}` are .fasta files containing the contigs.
87-
88-
::: {.callout-note}
89-
You can now move to the final chapter to automate all of the steps we've previously discussed.
90-
:::
86+
- `{output.annotated}`, `{output.unannotated}` and `{output.viral}` are .fasta files containing the contigs.

0 commit comments

Comments
 (0)