Setup GPF development environment and production data¶

Install conda¶

wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh

Install Node.js via NVM¶

Install NVM:

wget -qO- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash

Afterwards, restart your terminal and install the latest version of Node.js:

nvm install node

Clone repositories¶

Clone the gpf and gpfjs repositories:

git clone git@github.com:iossifovlab/gpf.git
git clone git@github.com:iossifovlab/gpfjs.git

Install the dependencies for gpfjs:

cd gpfjs
npm install

Create GPF conda environment¶

From inside of the gpf directory run:

mamba env create --name gpf --file ./environment.yml
mamba env update --name gpf --file ./dev-environment.yml
conda activate gpf
for d in dae wdae; do (cd $d; pip install -e .); done

Configure your default genomics resources repository¶

When working, GPF needs various genomic resources like reference genome, gene models, gene properties, etc.

By default GPF will fetch these resources from the default genomic resources repository without caching them. This can be slow.

For development it is recommended to use a caching repository. To this end create a file named .grr_definition.yaml in your home directory with the following content:

type: group
children:

- id: "seqpipe"
  type: "url"
  url: "https://grr.seqpipe.org"
  cache_dir: "<path to cache dir>"

- id: "default"
  type: "url"
  url: "https://www.iossifovlab.com/distribution/public/genomic-resources-repository"
  cache_dir: "<path to cache dir>"

Replace <path to cache dir> with a directory on your local filesystem that is suitable for caching the large genomic resource data.

Install dvc¶

You will need a separate virtual environment with dvc and dvc-ssh installed.

mamba create -n dvc -c conda-forge dvc dvc-ssh

Setup production data¶

Note

To use the data-hg38-production instance configuration you are will need access to Seqpipe’s internal network (either working on an office computer or using a VPN).

Clone the production data git repositories:

git clone git@github.com:iossifovlab/data-hg38-production.git
git clone git@github.com:seqpipe/data-phenodb-production.git

Pull data via dvc:

conda activate dvc
cd data-hg38-production
dvc pull -r nemo
cd ../data-phenodb-production
dvc pull -r nemo

Extract the phenotype data:

cd data-phenodb-production
./extract_phenodbs.sh

Setup a setenv.sh script with the following contents:

conda activate gpf
export DAE_DB_DIR=<path to data-hg38-production>
export DAE_PHENODB_DIR=<path to data-phenodb-production>
export GPF_PREFIX=gpfjs

You can place this script wherever you want. Afterwards, source it:

source setenv.sh

Navigate to data-hg38-production and run the following script to adjust your instance configuration:
```
./scripts/adjust_seqpipe_minimal.sh
```
There are other adjustment scripts available inside scripts, which will configure different subsets of data.

Running GPF instance¶

Run gpfjs
```
cd gpfjs
ng serve
```
In a new terminal, run gpf

Source your environment file:
```
source setenv.sh
```
Run the following script to initialize everything needed to run the wdae server. This script only needs to be ran once:
```
cd gpf/wdae/wdae
reset_dev.sh
```
Finally, run the server:
```
./wdaemanage.py runserver
```

Running dae unit tests in parallel¶

cd gpf/dae
unset DAE_DB_DIR
export PYTHONHASHSEED=0
pytest -n 10 dae tests