|
| 1 | +# 1. Quick Start {.unnumbered} |
| 2 | + |
| 3 | +This chapter guides you through setting up the environment and executing the automated pipeline. |
| 4 | + |
| 5 | +::: callout-warning |
| 6 | +# Important! |
| 7 | + |
| 8 | +In the following sections whenever a **"parameter"** in brackets `{}` is shown, the intention is to fill in your own filename or value. Each parameter will be explained in the section in detail. |
| 9 | +::: |
| 10 | + |
| 11 | +::: callout-tip |
| 12 | +Notice the small *"Copy to Clipboard"* button on the right hand side of each code chunk, this can be used to copy the code. |
| 13 | +::: |
| 14 | + |
| 15 | +## 1.1 Singularity Setup {.unnumbered} |
| 16 | + |
| 17 | +The workflow is distributed as a self-contained Singularity container image, which includes all necessary software dependencies and helper scripts. |
| 18 | + |
| 19 | +**Prerequisites:** |
| 20 | +[Singularity/Apptainer](https://docs.sylabs.io/guides/latest/user-guide/){target="_blank"} version 3.x or later must be installed on your system. If you are working with a High Performance Computing (HPC) system, this is likely already installed. Try writing `singularity --help` in your terminal (that’s connected to the HPC system) and see if the command is recognized. |
| 21 | + |
| 22 | +## 1.2 Download the Image {.unnumbered} |
| 23 | + |
| 24 | +Download the required workflow image file (`imam_workflow.sif`) directly through the terminal: |
| 25 | + |
| 26 | +``` bash |
| 27 | +wget https://github.com/LucvZon/illumina-metagenomic-analysis-manual/releases/latest/download/imam_workflow.sif |
| 28 | +``` |
| 29 | + |
| 30 | +Or go to the [github page](https://github.com/LucvZon/illumina-metagenomic-analysis-manual/releases/latest){target="_blank"} and manually download it there, then transfer it to your HPC system. |
| 31 | + |
| 32 | +## 1.3 Verify Container {.unnumbered} |
| 33 | + |
| 34 | +You can verify the download by checking the container version or starting an interactive shell. |
| 35 | + |
| 36 | +``` bash |
| 37 | +# Check version |
| 38 | +singularity run imam_workflow.sif --version |
| 39 | + |
| 40 | +# Start interactive shell |
| 41 | +singularity shell imam_workflow.sif |
| 42 | +``` |
| 43 | + |
| 44 | +`singularity shell imam_workflow.sif` will drop you into a shell running inside the container. The conda environment needed for this workflow is automatically active on start-up of the interactive shell. All the tools of the conda environment will therefore be ready to use. |
| 45 | + |
| 46 | +Please note that you do not have to run `conda activate {environment}` to activate the environment – everything is inside imam_workflow.sif. If you're curious about the conda environment we're using, you can check it out [here](https://github.com/LucvZon/illumina-metagenomic-analysis-manual/blob/main/envs/environment.yml){target="_blank"} |
| 47 | + |
| 48 | +## 1.4 Project Setup |
| 49 | + |
| 50 | +We use a tool called [Snakemake](https://snakemake.readthedocs.io/en/stable/){target="_blank"} to automate the analysis. To simplify the creation of the required project directory, the container includes a helper script called `prepare_project.py`. This script automates the creation of the project directory, the sample configuration file (`sample.tsv`), and the general settings configuration file (`config.yaml`), guiding you through each step with clear prompts and error checking. |
| 51 | + |
| 52 | +### Required files |
| 53 | + |
| 54 | +Before running the setup script, ensure you have the following two files ready: |
| 55 | + |
| 56 | +1. `Diamond DB`: A diamond database that will be used to annotate assembled contigs. |
| 57 | +2. `Reference genome`: An indexed fasta file that will be used as a host filter for your reads. |
| 58 | + |
| 59 | +### Initializing the Project Directory |
| 60 | + |
| 61 | +Use `singularity exec` to run the setup script. This will create your project folder and generate the **sample.tsv**, **config.yaml** and **Snakefile** required for the pipeline. |
| 62 | + |
| 63 | +``` bash |
| 64 | +singularity exec \ |
| 65 | + --bind /mnt/viro0002-data:/mnt/viro0002-data \ |
| 66 | + --bind $HOME:$HOME \ |
| 67 | + --bind $PWD:$PWD \ |
| 68 | + imam_workflow.sif \ |
| 69 | + python /prepare_project.py \ |
| 70 | + -p {project.folder} \ |
| 71 | + -n {name} \ |
| 72 | + -r {reads} \ |
| 73 | + -t {threads} |
| 74 | +``` |
| 75 | + |
| 76 | +- `{name}`: The name of your study, no spaces allowed. |
| 77 | +- `{project.folder}`: The project folder where you run your workflow and store results. |
| 78 | +- `{reads}`: The folder that contains your raw .fastq.gz files. Raw read files must adhere to the naming scheme as described [here](https://help.basespace.illumina.com/files-used-by-basespace/fastq-files#naming){target="_blank"}. |
| 79 | + |
| 80 | +::: callout-important |
| 81 | +**The `--bind` arguments are needed to explicitly tell Singularity to mount the necessary host directories into the container.** The part before the colon is the path on the host machine that you want to make available. The path after the colon is the path inside the container where the host directory should be mounted. |
| 82 | + |
| 83 | +As a default, Singularity often automatically binds your home directory (`$HOME`) and the current directory (`$PWD`). We also explicitly bind `/mnt/viro0002-data` in this example. If your input files (reads, reference, databases) or output project directory reside outside these locations, you MUST add specific `--bind /host/path:/container/path` options for those locations, otherwise the container won’t be able to find them. |
| 84 | +::: |
| 85 | + |
| 86 | +::: callout-note |
| 87 | +When **prepare_project.py** prompts for the **reference genome** and **diamond database** paths, you must enter the absolute host paths, and these paths must be accessible via one of the bind mounts. |
| 88 | + |
| 89 | +Also, it'll ask if you want to create a raw_data/ folder with softlinks to your raw fastq.gz files. This is not required for running the workflow, but it can be convenient to have softlinks to your raw data available in your project directory. |
| 90 | +::: |
| 91 | + |
| 92 | +After running the prepare_project.py helper script, you should have the following files in your project directory: |
| 93 | + |
| 94 | +- The **sample.tsv** should have 3 columns: sample (sample name), fq1 and fq2 (paths to raw read files). Please note that samples sequenced by Illumina machines can be ran across different lanes. In such cases, the Illumina software will generate multiple fastq files for each sample that are lane specific (e.g. L001 = Lane 1, etc). So you may end up with a sample.tsv file that contains samples like `1_S1_L001` and `1_S1_L002`, even though these are the same sample, just sequenced across different lanes. The snakemake workflow will recognize this behaviour and merge these files together accordingly. |
| 95 | + |
| 96 | +- The **config.yaml** contains more general information like the indexed reference and database you supplied as well as the amount of default threads to use. |
| 97 | + |
| 98 | +- The **Snakefile** is the "recipe" for the workflow, describing all the steps we have done by hand, and it is most commonly placed in the root directory of your project (you can open the Snakefile with a text editor and have a look). |
| 99 | + |
| 100 | +## 1.5 Executing the Pipeline |
| 101 | + |
| 102 | +Once the project directory is initialized, navigate into it and run the workflow. |
| 103 | + |
| 104 | +1. Navigate to the project: |
| 105 | +```bash |
| 106 | +cd {project.folder} |
| 107 | +``` |
| 108 | + |
| 109 | +This folder should contain your `Snakefile`, `sample.tsv` and `config.yaml` files, which were generated during **step 1.4**. |
| 110 | + |
| 111 | +2. Dry Run (Optional but Recommended): |
| 112 | +Check for errors without executing commands. |
| 113 | +```bash |
| 114 | +singularity exec \ |
| 115 | + --bind /mnt/viro0002-data:/mnt/viro0002-data \ |
| 116 | + --bind $HOME:$HOME \ |
| 117 | + --bind $PWD:$PWD \ |
| 118 | + imam_workflow.sif \ |
| 119 | + snakemake --snakefile Snakefile --cores 1 --dryrun |
| 120 | +``` |
| 121 | + |
| 122 | +3. Run the workflow: |
| 123 | +Remove `--dryrun` and set the number of threads |
| 124 | +```bash |
| 125 | +singularity exec \ |
| 126 | + --bind /mnt/viro0002-data:/mnt/viro0002-data \ |
| 127 | + --bind $HOME:$HOME \ |
| 128 | + --bind $PWD:$PWD \ |
| 129 | + imam_workflow.sif \ |
| 130 | + snakemake --snakefile Snakefile --cores {threads} |
| 131 | +``` |
0 commit comments