Skip to content

Commit d0218ac

Browse files
authored
Merge pull request #73 from marinak-ebi/slurm-instructions
Update instructions for running on SLURM
2 parents c2cf5ce + 0eef272 commit d0218ac

File tree

2 files changed

+72
-65
lines changed

2 files changed

+72
-65
lines changed

Late adults stats pipeline/DRrequiredAgeing/DRrequiredAgeingPackage/R/sideFunctions.R

Lines changed: 2 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -4556,7 +4556,7 @@ waitTillCommandFinish = function(checkcommand = 'squeue --format="%A %.30j"',
45564556

45574557
submit_limit_jobs = function(bch_file,
45584558
job_id_logfile,
4559-
max_jobs=400) {
4559+
max_jobs=600) {
45604560
message0("Start submit_limit_jobs")
45614561
system(paste("echo > ", job_id_logfile))
45624562
file <- file(bch_file, "r")
@@ -4566,7 +4566,7 @@ submit_limit_jobs = function(bch_file,
45664566
if(num_running <= max_jobs) {
45674567
break
45684568
}
4569-
Sys.sleep(1)
4569+
Sys.sleep(0.5)
45704570
}
45714571
job_id <- system(command, wait=TRUE, intern = TRUE)
45724572
system(paste("echo '", job_id, "' >> ", job_id_logfile))
@@ -4748,7 +4748,6 @@ StatsPipeline = function(path = getwd(),
47484748
containWhat = 'Exit'))
47494749
stop('An error occured in step 2. Parquet2Rdata conversion')
47504750

4751-
system(command = "mkdir ../compressed_logs", wait = TRUE)
47524751
system(command = "find ../ -type f -name '*.log' -exec zip -m ../compressed_logs/step2_logs.zip {} +", wait = TRUE)
47534752
system(command = "find ../ -type f -name '*.err' -exec zip -m ../compressed_logs/step2_logs.zip {} +", wait = TRUE)
47544753

@@ -4815,20 +4814,10 @@ StatsPipeline = function(path = getwd(),
48154814
## Compress logs
48164815
message0('End of packaging data. ')
48174816
message0('Phase II. Compressing the log files and house cleaning ... ')
4818-
system(command = 'mv *.R DataGeneratingLog/', wait = TRUE)
48194817
system(command = 'mv *.bch DataGeneratingLog/', wait = TRUE)
48204818
system(command = 'zip -rm phase2_logs.zip DataGeneratingLog/', wait = TRUE)
48214819
system(command = 'mv phase2_logs.zip ../compressed_logs/', wait = TRUE)
48224820

4823-
## remove logs
4824-
message0('Removing the log files prior to the run of the statistical anlyses ...')
4825-
system(command = 'find ./*/*_RawData/ClusterErr/ -name *ClusterErr -type f |xargs rm',
4826-
ignore.stdout = TRUE,
4827-
wait = TRUE)
4828-
system(command = 'find ./*/*_RawData/ClusterOut/ -name *ClusterOut -type f |xargs rm',
4829-
ignore.stdout = TRUE,
4830-
wait = TRUE)
4831-
48324821
## Add all single jobs into one single job
48334822
message0('Appending all procedure based jobs into one single file ...')
48344823
if (!dir.exists('jobs'))

README.md

Lines changed: 70 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -78,91 +78,109 @@ flowchart TB
7878
classDef title font-size:30px
7979
class stats_pipeline title
8080
```
81-
# How to run IMPC Statistical Pipeline
82-
Instructions are made for release 20.2.
81+
# How to Run IMPC Statistical Pipeline
82+
These instructions are tailored for Release 21.0.
8383

8484
## Step 1. Data Preprocessing and Analysis
8585
### Preparation
86-
0. We work under mi_stats virtual user:
87-
`become mi_stats`
88-
89-
1. Create working directory.
90-
```console
91-
mkdir --mode=775 ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr20.2
92-
cd ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr20.2
86+
0. Start screen
87+
```
88+
screen -S stats-pipeline
9389
```
9490

95-
2. Copy the input parquet files (±80*10^6 data points) and mp_chooser_json.
91+
1. Switch to the mi_stats virtual user:
9692
```console
97-
cp ${KOMP_PATH}/data-releases/latest-input/dr20.0/output/flatten_observations_parquet/*.parquet ./
98-
cp ${KOMP_PATH}/data-releases/latest-input/dr20.0/output/mp_chooser_json/part-*.txt ./
93+
become mi_stats
9994
```
100-
We copied files from 20.0 release, but next time the location of the input files could differ.<br>
101-
According to [Observations Output Schema](https://github.com/mpi2/impc-etl/wiki/Observations-Output-Schema) some fields hava array data type. However in current dataset those fields, instead of being array, are comma-separated lists.
10295

103-
3. Convert JSON mp_chooser file to Rdata.
96+
2. Set necessary variables:
10497
```console
105-
R -e "a = jsonlite::fromJSON('part-00000-b2483dca-4c84-4c90-a79b-e97df8c95091-c000.txt');save(a,file='mp_chooser_20230411.json.Rdata')"
98+
export VERSION="21.0"
99+
export REMOTE="mpi2"
100+
export BRANCH="master"
101+
export KOMP_PATH="<absolute_path_to_directory>"
106102
```
107-
**Note:** we kept the name of the mp_chooser file exactly as mp_chooser_20230411.json.Rdata, because it is used on the code.
108103

109-
4. Clone impc_stats_pipeline repository.
104+
3. Create a working directory:
110105
```console
111-
cd /tmp
112-
git clone https://github.com/mpi2/impc_stats_pipeline.git
106+
mkdir --mode=775 ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr${VERSION}
107+
cd ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr${VERSION}
113108
```
114109

115-
5. Update mp_chooser file in several directories.
110+
4. Copy the input parquet files (±80*10^6 data points) and `mp_chooser_json`:
116111
```console
117-
cp ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr20.2/mp_chooser_20230411.json.Rdata /tmp/impc_stats_pipeline/Late\ adults\ stats\ pipeline/DRrequiredAgeing/DRrequiredAgeingPackage/inst/extdata/annotation/
118-
cp ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr20.2/mp_chooser_20230411.json.Rdata /tmp/impc_stats_pipeline/Late\ adults\ stats\ pipeline/DRrequiredAgeing/DRrequiredAgeingPackage/inst/extdata/StatsPipeline/jobs/Postgres
112+
cp ${KOMP_PATH}/data-releases/latest-input/dr${VERSION}/output/flatten_observations_parquet/*.parquet ./
113+
cp ${KOMP_PATH}/data-releases/latest-input/dr${VERSION}/output/mp_chooser_json/part-*.txt ./mp_chooser.json
119114
```
115+
**Note:** Be cautious, the location of the input files may vary.<br>
116+
Refer to the [Observations Output Schema](https://github.com/mpi2/impc-etl/wiki/Observations-Output-Schema). In the current dataset, some fields that should be arrays are presented as comma-separated lists.
120117

121-
6. Update master branch of the repository on GitHub with the new version of mp_chooser.
118+
5. Convert the mp_chooser JSON file to Rdata:
122119
```console
123-
git add mp_chooser_20230411.json.Rdata
124-
git commit -m "Replace an mp_chooser"
125-
git push origin master
120+
R -e "a = jsonlite::fromJSON('mp_chooser.json');save(a,file='mp_chooser.json.Rdata')"
121+
export MP_CHOOSER_FILE=$(echo -n '"'; realpath mp_chooser.json.Rdata | tr -d '\n'; echo -n '"')
126122
```
127-
**Note:** Login with [credentials](https://www.ebi.ac.uk/seqdb/confluence/display/MouseInformatics/GitHub+Machine+User) using personal access token for impc-stats-pipeline repo.
128123

129-
7. Update packages to the latest version.
124+
6. Update packages to the latest version:
130125
```console
131-
cd ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr20.2
132-
wget https://raw.githubusercontent.com/mpi2/impc_stats_pipeline/dev/Late%20adults%20stats%20pipeline/DRrequiredAgeing/DRrequiredAgeingPackage/inst/extdata/StatsPipeline/UpdatePackagesFromGithub.R
133-
Rscript UpdatePackagesFromGithub.R mpi2 master
126+
cd ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr${VERSION}
127+
wget https://raw.githubusercontent.com/${REMOTE}/impc_stats_pipeline/${BRANCH}/Late%20adults%20stats%20pipeline/DRrequiredAgeing/DRrequiredAgeingPackage/inst/extdata/StatsPipeline/UpdatePackagesFromGithub.R
128+
Rscript UpdatePackagesFromGithub.R ${REMOTE} ${BRANCH}
134129
rm UpdatePackagesFromGithub.R
135130
```
136131

137132
### Run Statistical Pipeline
138-
8. Start screen.
133+
7. Execute the `StatsPipeline` function on SLURM:
139134
```console
140-
cd ~
141-
screen -S stats-pipeline
142-
cd ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr20.2
135+
cd ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr${VERSION}
136+
sbatch \
137+
--time=30-00:00:00 \
138+
--mem=8G \
139+
-o ../stats_pipeline_logs/stats_pipeline_${VERSION}.log \
140+
-e ../stats_pipeline_logs/stats_pipeline_${VERSION}.err \
141+
--wrap="R -e 'DRrequiredAgeing:::StatsPipeline(DRversion=${VERSION})'"
143142
```
143+
**Note:** Remember to note down the job ID number that will appear after submitting the job.
144144

145-
9. Run statistical pipeline.
146-
```console
147-
alias bsub-1gb='bsub -q long -R "select[type==X86_64 && mem > 1000] rusage[mem=1000]" -M1000'
148-
bsub-1gb -o ../stats_pipeline_logs/stats_pipeline_20.2.log -e ../stats_pipeline_logs/stats_pipeline_20.2.err R -e 'DRrequiredAgeing:::StatsPipeline(DRversion=20.2)'
149-
```
150-
- To leave screen press combination `Ctrl + A + D`.
145+
- To leave screen, press combination `Ctrl + A + D`.
151146
- Don't forget to write down the number that will appear after leaving the screen, for example, 3507472, and number of cluster node.
147+
- Also make sure to remember which login node you started the screen session on.
152148

153-
10. Check progress with this commands as an example.
154-
- To log in on specific node:
155-
`ssh codon-login-01`
156-
- Activate screen to check progress:
157-
`screen -r 3507472.stats-pipeline`
149+
8. Monitor progress using the following commands:
150+
- Activate screen to check progress: `screen -r 3507472.stats-pipeline`
151+
- Use `squeue` to check job status.
152+
- Review the log files:
153+
```console
154+
less ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_logs/stats_pipeline_${VERSION}.log
155+
less ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_logs/stats_pipeline_${VERSION}.err
156+
```
158157

159158
## Step 2. Run Annotation Pipeline
160-
The `IMPC_HadoopLoad` command uses the power of LSF cluster to assign the annotations to the StatPackets and transfers the files to the Hadoop cluster. The files will be transferred to Hadoop:/hadoop/user/mi_stats/impc/statpackets/DRXX.
159+
The `IMPC_HadoopLoad` command uses the power of cluster to assign the annotations to the StatPackets and transfers the files to the Hadoop cluster. The files will be transferred to Hadoop:/hadoop/user/mi_stats/impc/statpackets/DRXX.
160+
1. Reconnect to screen session
161+
Make sure to connect to the same login node you used to start the screen session.
161162
```console
162-
cd ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr20.2/SP/jobs/Results_IMPC_SP_Windowed
163-
alias bsub-1gb='bsub -q long -R "select[type==X86_64 && mem > 1000] rusage[mem=1000]" -M1000'
164-
bsub-1gb -o ../stats_pipeline_logs/annotation_pipeline_20.2.log -e ../stats_pipeline_logs/annotation_pipeline_20.2.err R -e 'DRrequiredAgeing:::IMPC_HadoopLoad(prefix="DR20.2",transfer=FALSE)'
163+
screen -r 3507472.stats-pipeline
165164
```
165+
166+
2. Update packages to the latest version:
167+
```console
168+
wget https://raw.githubusercontent.com/${REMOTE}/impc_stats_pipeline/${BRANCH}/Late%20adults%20stats%20pipeline/DRrequiredAgeing/DRrequiredAgeingPackage/inst/extdata/StatsPipeline/UpdatePackagesFromGithub.R
169+
Rscript UpdatePackagesFromGithub.R ${REMOTE} ${BRANCH}
170+
rm UpdatePackagesFromGithub.R
171+
```
172+
173+
3. Run annotation pipeline without exporting it to Hadoop: `transfer=FALSE`
174+
```console
175+
cd ${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_dr${VERSION}/SP/jobs/Results_IMPC_SP_Windowed
176+
sbatch \
177+
--time=3-00:00:00 \
178+
--mem=8G \
179+
-o ../../../../stats_pipeline_logs/annotation_pipeline_${VERSION}.log \
180+
-e ../../../../stats_pipeline_logs/annotation_pipeline_${VERSION}.err \
181+
--wrap="R -e 'DRrequiredAgeing:::IMPC_HadoopLoad(prefix=${VERSION},transfer=FALSE,mp_chooser_file=${MP_CHOOSER_FILE})'"
182+
```
183+
166184
- The most complex part of this process is that some files will fail to transfer and you need to use scp command to transfer files to the Hadoop cluster manually.
167185
- When you are sure that all files are there, you can share the path with Federico.
168186
**Note**: in the slides transfer=TRUE, which means we haven't transfered files this time.
@@ -175,7 +193,7 @@ This process generates statistical reports typically utilized by the IMPC workin
175193
3. The commands below will generate two CSV files in the `${KOMP_PATH}/impc_statistical_pipeline/IMPC_DRs/stats_pipeline_input_drXX.y/SP/jobs/Results_IMPC_SP_Windowed` directory for the unidimentional and categorical results. The files can be gzip and moved to the FTP directory. You can decorate and format the files by using one of the formatted files in the previous data releases.
176194
```console
177195
R
178-
DRrequiredAgeing:::IMPC_statspipelinePostProcess()
196+
DRrequiredAgeing:::IMPC_statspipelinePostProcess(mp_chooser_file=${MP_CHOOSER_FILE})
179197
DRrequiredAgeing:::ClearReportsAfterCreation()
180198
```
181199

0 commit comments

Comments
 (0)