Merge pull request #73 from marinak-ebi/slurm-instructions

marinak-ebi · web-flow · commit d0218ac21d4d · 2024-04-09T16:30:47.000+01:00
Update instructions for running on SLURM
diff --git a/Late adults stats pipeline/DRrequiredAgeing/DRrequiredAgeingPackage/R/sideFunctions.R b/Late adults stats pipeline/DRrequiredAgeing/DRrequiredAgeingPackage/R/sideFunctions.R
@@ -4556,7 +4556,7 @@ waitTillCommandFinish = function(checkcommand = 'squeue --format="%A %.30j"',
 
 submit_limit_jobs = function(bch_file,
                              job_id_logfile,
-                             max_jobs=400) {
+                             max_jobs=600) {
   message0("Start submit_limit_jobs")
   system(paste("echo > ", job_id_logfile))
   file <- file(bch_file, "r")
@@ -4566,7 +4566,7 @@ submit_limit_jobs = function(bch_file,
       if(num_running <= max_jobs) {
         break
       }
-      Sys.sleep(1)
+      Sys.sleep(0.5)
     }
     job_id <- system(command, wait=TRUE, intern = TRUE)
     system(paste("echo '", job_id, "' >> ", job_id_logfile))
@@ -4748,7 +4748,6 @@ StatsPipeline = function(path = getwd(),
                    containWhat = 'Exit'))
     stop('An error occured in step 2. Parquet2Rdata conversion')
   
-  system(command = "mkdir ../compressed_logs", wait = TRUE)
   system(command = "find ../ -type f -name '*.log' -exec zip -m ../compressed_logs/step2_logs.zip {} +", wait = TRUE)
   system(command = "find ../ -type f -name '*.err' -exec zip -m ../compressed_logs/step2_logs.zip {} +", wait = TRUE)
 
@@ -4815,20 +4814,10 @@ StatsPipeline = function(path = getwd(),
   ## Compress logs
   message0('End of packaging data. ')
   message0('Phase II. Compressing the log files and house cleaning ... ')
-  system(command = 'mv *.R  DataGeneratingLog/', wait = TRUE)
   system(command = 'mv *.bch  DataGeneratingLog/', wait = TRUE)
   system(command = 'zip -rm phase2_logs.zip DataGeneratingLog/', wait = TRUE)
   system(command = 'mv phase2_logs.zip ../compressed_logs/', wait = TRUE)
 
-  ## remove logs
-  message0('Removing the log files prior to the run of the statistical anlyses ...')
-  system(command = 'find ./*/*_RawData/ClusterErr/ -name *ClusterErr -type f  |xargs rm',
-         ignore.stdout = TRUE,
-         wait = TRUE)
-  system(command = 'find ./*/*_RawData/ClusterOut/ -name *ClusterOut -type f  |xargs rm',
-         ignore.stdout = TRUE,
-         wait = TRUE)
-
   ## Add all single jobs into one single job
   message0('Appending all procedure based jobs into one single file ...')
   if (!dir.exists('jobs'))
diff --git a/README.md b/README.md
@@ -78,91 +78,109 @@ flowchart TB
     classDef title font-size:30px
     class stats_pipeline title
 ```
-# How to run IMPC Statistical Pipeline
-Instructions are made for release 20.2.
+# How to Run IMPC Statistical Pipeline
+These instructions are tailored for Release 21.0.
 
 ## Step 1. Data Preprocessing and Analysis
 ### Preparation
-0. We work under mi_stats virtual user:
-`become mi_stats`
-
-1. Create working directory.
-```console
-mkdir --mode=775 ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr20.2
-cd ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr20.2
+0. Start screen
+```
+screen -S stats-pipeline
 ```
 
-2. Copy the input parquet files (±80*10^6 data points) and mp_chooser_json.
+1. Switch to the mi_stats virtual user:
 ```console
-cp ${KOMP_PATH}/data-releases/latest-input/dr20.0/output/flatten_observations_parquet/*.parquet ./
-cp ${KOMP_PATH}/data-releases/latest-input/dr20.0/output/mp_chooser_json/part-*.txt ./
+become mi_stats
 ```
-We copied files from 20.0 release, but next time the location of the input files could differ.<br>
-According to [Observations Output Schema](https://github.com/mpi2/impc-etl/wiki/Observations-Output-Schema) some fields hava array data type. However in current dataset those fields, instead of being array, are comma-separated lists.
 
-3. Convert JSON mp_chooser file to Rdata.
+2. Set necessary variables:
 ```console
-R -e "a = jsonlite::fromJSON('part-00000-b2483dca-4c84-4c90-a79b-e97df8c95091-c000.txt');save(a,file='mp_chooser_20230411.json.Rdata')"
+export VERSION="21.0"
+export REMOTE="mpi2"
+export BRANCH="master"
+export KOMP_PATH="<absolute_path_to_directory>"
 ```
-**Note:** we kept the name of the mp_chooser file exactly as mp_chooser_20230411.json.Rdata, because it is used on the code.
 
-4. Clone impc_stats_pipeline repository.
+3. Create a working directory:
 ```console
-cd /tmp
-git clone https://github.com/mpi2/impc_stats_pipeline.git
+mkdir --mode=775 ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr${VERSION}
+cd ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr${VERSION}
 ```
 
-5. Update mp_chooser file in several directories.
+4. Copy the input parquet files (±80*10^6 data points) and `mp_chooser_json`:
 ```console
-cp ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr20.2/mp_chooser_20230411.json.Rdata /tmp/impc_stats_pipeline/Late\ adults\ stats\ pipeline/DRrequiredAgeing/DRrequiredAgeingPackage/inst/extdata/annotation/
-cp ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr20.2/mp_chooser_20230411.json.Rdata /tmp/impc_stats_pipeline/Late\ adults\ stats\ pipeline/DRrequiredAgeing/DRrequiredAgeingPackage/inst/extdata/StatsPipeline/jobs/Postgres
+cp ${KOMP_PATH}/data-releases/latest-input/dr${VERSION}/output/flatten_observations_parquet/*.parquet ./
+cp ${KOMP_PATH}/data-releases/latest-input/dr${VERSION}/output/mp_chooser_json/part-*.txt ./mp_chooser.json
 ```
+**Note:** Be cautious, the location of the input files may vary.<br>
+Refer to the [Observations Output Schema](https://github.com/mpi2/impc-etl/wiki/Observations-Output-Schema). In the current dataset, some fields that should be arrays are presented as comma-separated lists.
 
-6. Update master branch of the repository on GitHub with the new version of mp_chooser.
+5. Convert the mp_chooser JSON file to Rdata:
 ```console
-git add mp_chooser_20230411.json.Rdata
-git commit -m "Replace an mp_chooser"
-git push origin master
+R -e "a = jsonlite::fromJSON('mp_chooser.json');save(a,file='mp_chooser.json.Rdata')"
+export MP_CHOOSER_FILE=$(echo -n '"'; realpath mp_chooser.json.Rdata | tr -d '\n'; echo -n '"')
 ```
-**Note:** Login with [credentials](https://www.ebi.ac.uk/seqdb/confluence/display/MouseInformatics/GitHub+Machine+User) using personal access token for impc-stats-pipeline repo.
 
-7. Update packages to the latest version.
+6. Update packages to the latest version:
 ```console
-cd ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr20.2
-wget https://raw.githubusercontent.com/mpi2/impc_stats_pipeline/dev/Late%20adults%20stats%20pipeline/DRrequiredAgeing/DRrequiredAgeingPackage/inst/extdata/StatsPipeline/UpdatePackagesFromGithub.R
-Rscript UpdatePackagesFromGithub.R mpi2 master
+cd ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr${VERSION}
+wget https://raw.githubusercontent.com/${REMOTE}/impc_stats_pipeline/${BRANCH}/Late%20adults%20stats%20pipeline/DRrequiredAgeing/DRrequiredAgeingPackage/inst/extdata/StatsPipeline/UpdatePackagesFromGithub.R
+Rscript UpdatePackagesFromGithub.R ${REMOTE} ${BRANCH}
 rm UpdatePackagesFromGithub.R
 ```
 
 ### Run Statistical Pipeline
-8. Start screen.
+7. Execute the `StatsPipeline` function on SLURM:
 ```console
-cd ~
-screen -S stats-pipeline
-cd ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr20.2
+cd ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr${VERSION}
+sbatch \
+    --time=30-00:00:00 \
+    --mem=8G \
+    -o ../stats_pipeline_logs/stats_pipeline_${VERSION}.log \
+    -e ../stats_pipeline_logs/stats_pipeline_${VERSION}.err \
+    --wrap="R -e 'DRrequiredAgeing:::StatsPipeline(DRversion=${VERSION})'"
 ```
+**Note:** Remember to note down the job ID number that will appear after submitting the job.
 
-9. Run statistical pipeline.
-```console
-alias bsub-1gb='bsub -q long -R "select[type==X86_64 && mem > 1000] rusage[mem=1000]" -M1000'
-bsub-1gb -o ../stats_pipeline_logs/stats_pipeline_20.2.log -e ../stats_pipeline_logs/stats_pipeline_20.2.err R -e 'DRrequiredAgeing:::StatsPipeline(DRversion=20.2)'
-```
-- To leave screen press combination `Ctrl + A + D`.
+- To leave screen, press combination `Ctrl + A + D`.
 - Don't forget to write down the number that will appear after leaving the screen, for example, 3507472, and number of cluster node.
+- Also make sure to remember which login node you started the screen session on.
 
-10. Check progress with this commands as an example.
-- To log in on specific node: 
-`ssh codon-login-01`
-- Activate screen to check progress:
-`screen -r 3507472.stats-pipeline`
+8. Monitor progress using the following commands:
+- Activate screen to check progress: `screen -r 3507472.stats-pipeline`
+- Use `squeue` to check job status.
+- Review the log files:
+```console
+less ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_logs/stats_pipeline_${VERSION}.log
+less ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_logs/stats_pipeline_${VERSION}.err
+```
 
 ## Step 2. Run Annotation Pipeline
-The `IMPC_HadoopLoad` command uses the power of LSF cluster to assign the annotations to the StatPackets and transfers the files to the Hadoop cluster. The files will be transferred to Hadoop:/hadoop/user/mi_stats/impc/statpackets/DRXX.
+The `IMPC_HadoopLoad` command uses the power of cluster to assign the annotations to the StatPackets and transfers the files to the Hadoop cluster. The files will be transferred to Hadoop:/hadoop/user/mi_stats/impc/statpackets/DRXX.
+1. Reconnect to screen session
+Make sure to connect to the same login node you used to start the screen session.
 ```console
-cd ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr20.2/SP/jobs/Results_IMPC_SP_Windowed
-alias bsub-1gb='bsub -q long -R "select[type==X86_64 && mem > 1000] rusage[mem=1000]" -M1000'
-bsub-1gb -o ../stats_pipeline_logs/annotation_pipeline_20.2.log -e ../stats_pipeline_logs/annotation_pipeline_20.2.err R -e 'DRrequiredAgeing:::IMPC_HadoopLoad(prefix="DR20.2",transfer=FALSE)'
+screen -r 3507472.stats-pipeline
 ```
+
+2. Update packages to the latest version:
+```console
+wget https://raw.githubusercontent.com/${REMOTE}/impc_stats_pipeline/${BRANCH}/Late%20adults%20stats%20pipeline/DRrequiredAgeing/DRrequiredAgeingPackage/inst/extdata/StatsPipeline/UpdatePackagesFromGithub.R
+Rscript UpdatePackagesFromGithub.R ${REMOTE} ${BRANCH}
+rm UpdatePackagesFromGithub.R
+```
+
+3. Run annotation pipeline without exporting it to Hadoop: `transfer=FALSE`
+```console
+cd ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr${VERSION}/SP/jobs/Results_IMPC_SP_Windowed
+sbatch \
+    --time=3-00:00:00 \
+    --mem=8G \
+    -o ../../../../stats_pipeline_logs/annotation_pipeline_${VERSION}.log \
+    -e ../../../../stats_pipeline_logs/annotation_pipeline_${VERSION}.err \
+    --wrap="R -e 'DRrequiredAgeing:::IMPC_HadoopLoad(prefix=${VERSION},transfer=FALSE,mp_chooser_file=${MP_CHOOSER_FILE})'"
+```
+
 - The most complex part of this process is that some files will fail to transfer and you need to use scp command to transfer files to the Hadoop cluster manually.
 - When you are sure that all files are there, you can share the path with Federico.
 **Note**: in the slides transfer=TRUE, which means we haven't transfered files this time. 
@@ -175,7 +193,7 @@ This process generates statistical reports typically utilized by the IMPC workin
 3. The commands below will generate two CSV files in the `${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_drXX.y/SP/jobs/Results_IMPC_SP_Windowed` directory for the unidimentional and categorical results. The files can be gzip and moved to the FTP directory. You can decorate and format the files by using one of the formatted files in the previous data releases.
 ```console
 R
-DRrequiredAgeing:::IMPC_statspipelinePostProcess()
+DRrequiredAgeing:::IMPC_statspipelinePostProcess(mp_chooser_file=${MP_CHOOSER_FILE})
 DRrequiredAgeing:::ClearReportsAfterCreation()
 ```