Set up project dependent settings¶
All commands in Scrimer scripts and this manual suppose that you will set some environment
variables that define your project and that you add the required tools into your PATH
.
Directory layout¶
Here we present the layout that we use to organize all the data needed to run the pipeline. The inputs together with intermediate (and final) results total to hundreds of files. Having those files organized can help prevent mistakes.
Note
The method of organizing your data presented here is just our suggestion. Python scripts doing most of the work are not dependent on any particular directory structure.
Genomes directory¶
We assume that genome data is shared among different projects and different people on the same machine. Thus we place it in a location that is different from project specific data. This is where the reference genome should be placed.
Project directory¶
A directory containing files specific for one input dataset. Various steps can be run with various settings in the same project directory. We organize our files in a waterfall structure of directories, where each directory name is prefixed with a two digit number. The directory name is some short meaningful description of the step, the first digit in the prefix corresponds to part of the process (read mapping, variant calling etc.), and the second digit distinguishes substeps or runs with different settings.
To start a new project, create a new directory. To use Scrimer you have to convert your data
to the fastq
format. Put your .fastq
data in a subdirecotry called 00-raw
.
project.sh¶
Create a file called project.sh
in your project directory. It will consist of KEY=VALUE
lines that will define your project specific settings, and each time you want to use Scrimer
you’ll start by:
cd my/project/directory
. project.sh
Example project.sh
:
# number of cores you want to use for parallel calculations
CPUS=8
# location of genome data in your system
# you need write access to add a new reference genome to that location
GENOMES=/data/genomes
# reference genome used
GENOME=taeGut1
GENOMEDIR=$GENOMES/$GENOME
GENOMEFA=$GENOMEDIR/$GENOME.fa
# genome in blat format
GENOME2BIT=$GENOMEDIR/$GENOME.2bit
# gmap index location
GMAP_IDX_DIR=$GENOMEDIR
GMAP_IDX=gmap_${GENOME}
# smalt index
SMALT_IDX=$GENOMEDIR/smalt/${GENOME}k13s4
# primers used to synthetize cDNA
# (sequences were found in .pdf report from the company that did the normalization)
PRIMER1=AAGCAGTGGTATCAACGCAGAGTTTTTGTTTTTTTCTTTTTTTTTTNN
PRIMER2=AAGCAGTGGTATCAACGCAGAGTACGCGGG
PRIMER3=AAGCAGTGGTATCAACGCAGAGT
Adding the tools to your PATH¶
For each scrimer session, all the executables that are used have to be in one of
the directories mentioned in PATH
.
You can set up your PATH
easily by using the file you created during installation:
export PATH=$( cat ~/scrimer-env/paths | tr "\n" ":" ):$PATH
Such line can be at the end of your project.sh
file, so everythig is set up at once.
Alternatively you can copy all the tool executables into your virtual environment
bin directory (~/scrimer-env/bin
).