Scrimer installation

You need a default installation of Python 2.7 [1] with virtualenv [2].

# create and activate new python virtual environment for scrimer
# in home directory of current user
virtualenv ~/scrimer-env
. ~/scrimer-env/bin/activate

# install cython in advance because of pybedtools
# and distribute because of pyvcf
pip install cython distribute

# now install scrimer from pypi
# with it's additional dependencies (pyvcf, pysam, pybedtools)
pip install scrimer

Scrimer depends on several python modules, that should be installed automatically using the above procedrue.

  • pysam [3] is used to manipulate the indexed fasta and bam files
  • pybedtools [4] is used to read and write the annotations
  • PyVCF [5] is used to access variants data

Special cases

If you’re in an environment where you’re not able to install virtualenv systemwide, we recommend using the technique described at http://eli.thegreenplace.net/2013/04/20/bootstrapping-virtualenv/.

If you’re in a grid environment, this can help with paths that differ on different nodes

virtualnev --relocatable ~/scrimer-env

Non-python dependecies

Apart from the Python modules, the Scrimer pipeline relies on other tools that should be installed in your PATH. Follow the installation instructions in each package.

For reference we recorded the commands used to install those dependencies in the scrimer virtual box image. If your system is Debian 7, the commands could work verbatim.

  • bedtools [8] is a dependency of pybedtools, used for manipulating with gff and bed files
  • samtools [12] is used for manipulating short read alignments, and for calling variants
  • LASTZ [7] is used to find the longest isotigs
  • tabix [9] creates compressed and indexed verisions of annotation files
  • GMAP [11] produces a spliced mapping of your contigs to the reference genome
  • smalt [13] maps short reads to consensus contigs to discover variants
  • GNU parallel [14] is used throughout the pipeline to speed up some lengthy calculations [34]
  • blat and isPcr [15] are used to check the designed primers
  • Primer3 [16] is used to find the most optimal primes sequences
  • cutadapt [6] is used to remove cDNA synthesis primers.

Additional tools can be installed to provide some more options.

  • FastQC [20] can be used to check the tag cleaning process
  • agrep [21] and tre-agrep [22] can be used to check the tag cleaning process
  • sort-alt [10] provides alphanumeric sorting of chromosome names, rename sort to sort-alt after compiling
  • IGV [23] is great for visualizing the data when checking the results
  • newbler [24] is the best option for assembling 454 mRNA data [32] [33]
  • MIRA [25] does well with 454 transcriptome assembly as well [32] [33]
  • sim4db [26] can be used as alternative spliced mapper, part of the kmer suite, apply our patch [27] to get standard conformant output
  • Pipe Viewer [28] can be used to display the progress of longer operations
  • BioPython [18] and NumPy [19] are required for running 5prime_stats.py
  • mawk [29], awk is often used in the pipeline, and mawk is usually an order of magnitude faster
  • vcflib [31] has a nice interface for working with vcf files (but new bcftools are good as well)

Add installed tools to your PATH

To easily manage locations of the tools that you’re using with the Scrimer pipeline, create a text file containing paths to directories, where binaries of your tools are located. The format is one path per line, for example:

/opt/bedtools/bin
/opt/samtools-0.1.18
/opt/lastz/bin
/opt/tabix
/opt/gmap/bin
/data/samba/liborm/sw_testbed/smalt-0.7.4
/data/samba/liborm/sw_testbed/FastQC
/data/samba/liborm/sw_testbed/kmer/sim4db

Put this file to your virtual environment directory, e.g. ~/scrimer-env/paths. You can run the following snippet when starting your work session:

export PATH=$( cat ~/scrimer-env/paths | tr "\n" ":" ):$PATH

References

Papers

[32](1, 2) Mundry,M. et al. (2012) Evaluating Characteristics of De Novo Assembly Software on 454 Transcriptome Data: A Simulation Approach. PLoS ONE, 7, e31410. DOI: http://dx.doi.org/10.1371/journal.pone.0031410
[33](1, 2) Kumar,S. and Blaxter,M.L. (2010) Comparing de novo assemblers for 454 transcriptome data. BMC Genomics, 11, 571. DOI: http://dx.doi.org/10.1186/1471-2164-11-571
[34]Tange,O. (2011) GNU Parallel - The Command-Line Power Tool. ;login: The USENIX Magazine, 36, 42-47.