Setup GPF development environment and production data

Install conda

wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh

Install Node.js via NVM

Install NVM:

wget -qO- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash

Afterwards, restart your terminal and install the latest version of Node.js:

nvm install node

Clone repositories

Clone the gpf and gpfjs repositories:

git clone git@github.com:iossifovlab/gpf.git
git clone git@github.com:iossifovlab/gpfjs.git

Install the dependencies for gpfjs:

cd gpfjs
npm install

Create GPF conda environment

From inside of the gpf directory run:

mamba env create --name gpf --file ./environment.yml
mamba env update --name gpf --file ./dev-environment.yml
conda activate gpf
for d in dae wdae; do (cd $d; pip install -e .); done

Configure your default genomics resources repository

When working, GPF needs various genomic resources like reference genome, gene models, gene properties, etc.

By default GPF will fetch these resources from the default genomic resources repository without caching them. This can be slow.

For development it is recommended to use a caching repository. To this end create a file named .grr_definition.yaml in your home directory with the following content:

type: group
children:

- id: "seqpipe"
  type: "url"
  url: "https://grr.seqpipe.org"
  cache_dir: "<path to cache dir>"

- id: "default"
  type: "url"
  url: "https://www.iossifovlab.com/distribution/public/genomic-resources-repository"
  cache_dir: "<path to cache dir>"

Replace <path to cache dir> with a directory on your local filesystem that is suitable for caching the large genomic resource data.

Install dvc

You will need a separate virtual environment with dvc and dvc-ssh installed.

mamba create -n dvc -c conda-forge dvc dvc-ssh

Setup production data

Note

To use the data-hg38-production instance configuration you are will need access to Seqpipe’s internal network (either working on an office computer or using a VPN).

  1. Clone the production data git repositories:

    git clone git@github.com:iossifovlab/data-hg38-production.git
    git clone git@github.com:seqpipe/data-phenodb-production.git
    
  2. Pull data via dvc:

    conda activate dvc
    cd data-hg38-production
    dvc pull -r nemo
    cd ../data-phenodb-production
    dvc pull -r nemo
    
  3. Extract the phenotype data:

    cd data-phenodb-production
    ./extract_phenodbs.sh
    
  4. Setup a setenv.sh script with the following contents:

    conda activate gpf
    export DAE_DB_DIR=<path to data-hg38-production>
    export DAE_PHENODB_DIR=<path to data-phenodb-production>
    export GPF_PREFIX=gpfjs
    

    You can place this script wherever you want. Afterwards, source it:

    source setenv.sh
    
  5. Navigate to data-hg38-production and run the following script to adjust your instance configuration:

    ./scripts/adjust_seqpipe_minimal.sh
    

    There are other adjustment scripts available inside scripts, which will configure different subsets of data.

Running GPF instance

  1. Run gpfjs

    cd gpfjs
    ng serve
    
  2. In a new terminal, run gpf

    Source your environment file:

    source setenv.sh
    

    Run the following script to initialize everything needed to run the wdae server. This script only needs to be ran once:

    cd gpf/wdae/wdae
    reset_dev.sh
    

    Finally, run the server:

    ./wdaemanage.py runserver
    

Running dae unit tests in parallel

cd gpf/dae
unset DAE_DB_DIR
export PYTHONHASHSEED=0
pytest -n 10 dae tests