Scrimer installation¶

You need a default installation of Python 2.7 [1] with virtualenv [2].

# create and activate new python virtual environment for scrimer
# in home directory of current user
virtualenv ~/scrimer-env
. ~/scrimer-env/bin/activate

# install cython in advance because of pybedtools
# and distribute because of pyvcf
pip install cython distribute

# now install scrimer from pypi
# with it's additional dependencies (pyvcf, pysam, pybedtools)
pip install scrimer

Scrimer depends on several python modules, that should be installed automatically using the above procedrue.

pysam [3] is used to manipulate the indexed fasta and bam files
pybedtools [4] is used to read and write the annotations
PyVCF [5] is used to access variants data

Special cases¶

If you’re in an environment where you’re not able to install virtualenv systemwide, we recommend using the technique described at http://eli.thegreenplace.net/2013/04/20/bootstrapping-virtualenv/.

If you’re in a grid environment, this can help with paths that differ on different nodes

virtualnev --relocatable ~/scrimer-env

Non-python dependecies¶

Apart from the Python modules, the Scrimer pipeline relies on other tools that should be installed in your PATH. Follow the installation instructions in each package.

For reference we recorded the commands used to install those dependencies in the scrimer virtual box image. If your system is Debian 7, the commands could work verbatim.

bedtools [8] is a dependency of pybedtools, used for manipulating with gff and bed files
samtools [12] is used for manipulating short read alignments, and for calling variants
LASTZ [7] is used to find the longest isotigs
tabix [9] creates compressed and indexed verisions of annotation files
GMAP [11] produces a spliced mapping of your contigs to the reference genome
smalt [13] maps short reads to consensus contigs to discover variants
GNU parallel [14] is used throughout the pipeline to speed up some lengthy calculations [34]
blat and isPcr [15] are used to check the designed primers
Primer3 [16] is used to find the most optimal primes sequences
cutadapt [6] is used to remove cDNA synthesis primers.

Additional tools can be installed to provide some more options.

FastQC [20] can be used to check the tag cleaning process
agrep [21] and tre-agrep [22] can be used to check the tag cleaning process
sort-alt [10] provides alphanumeric sorting of chromosome names, rename sort to sort-alt after compiling
IGV [23] is great for visualizing the data when checking the results
newbler [24] is the best option for assembling 454 mRNA data [32] [33]
MIRA [25] does well with 454 transcriptome assembly as well [32] [33]
sim4db [26] can be used as alternative spliced mapper, part of the kmer suite, apply our patch [27] to get standard conformant output
Pipe Viewer [28] can be used to display the progress of longer operations
BioPython [18] and NumPy [19] are required for running 5prime_stats.py
mawk [29], awk is often used in the pipeline, and mawk is usually an order of magnitude faster
vcflib [31] has a nice interface for working with vcf files (but new bcftools are good as well)

Add installed tools to your PATH¶

To easily manage locations of the tools that you’re using with the Scrimer pipeline, create a text file containing paths to directories, where binaries of your tools are located. The format is one path per line, for example:

/opt/bedtools/bin
/opt/samtools-0.1.18
/opt/lastz/bin
/opt/tabix
/opt/gmap/bin
/data/samba/liborm/sw_testbed/smalt-0.7.4
/data/samba/liborm/sw_testbed/FastQC
/data/samba/liborm/sw_testbed/kmer/sim4db

Put this file to your virtual environment directory, e.g. ~/scrimer-env/paths. You can run the following snippet when starting your work session:

export PATH=$( cat ~/scrimer-env/paths | tr "\n" ":" ):$PATH

References¶

Python packages¶

[1]	Python http://www.python.org/

[2]	virtualenv http://www.virtualenv.org/en/latest/

[3]	pysam http://code.google.com/p/pysam/

[4]	pybedtools http://pythonhosted.org/pybedtools/

[5]	PyVCF https://github.com/jamescasbon/PyVCF

[6]	https://code.google.com/p/cutadapt/

Other software¶

[7]	lastz http://www.bx.psu.edu/~rsharris/lastz/

[8]	bedtools https://github.com/arq5x/bedtools2

[9]	tabix http://www.htslib.org/, http://samtools.sourceforge.net/tabix.shtml

[10]	sort-alt https://github.com/lh3/foreign/tree/master/sort

[11]	gmap http://research-pub.gene.com/gmap/

[12]	samtools http://www.htslib.org/, http://sourceforge.net/projects/samtools/files/

[13]	smalt http://www.sanger.ac.uk/resources/software/smalt/, we used 0.7.0.1, because the latest version (0.7.3) crashes

[14]	GNU parallel http://www.gnu.org/software/parallel/

[15]	http://users.soe.ucsc.edu/~kent/src/, get `blatSrc35.zip` and `isPcr33.zip`, before `make` do `export MACHTYPE` and `export BINDIR=<dir>`

[16]	http://primer3.sourceforge.net/

[17]	https://code.google.com/p/ea-utils/

Optional software¶

[18]	BioPython http://biopython.org/

[19]	numpy http://www.numpy.org/

[20]	FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

[21]	agrep https://github.com/Wikinaut/agrep

[22]	tre-agrep http://laurikari.net/tre/

[23]	IGV http://www.broadinstitute.org/igv/

[24]	newbler http://454.com/products/analysis-software/index.asp

[25]	MIRA http://www.chevreux.org/projects_mira.html

[26]	sim4db http://sourceforge.net/apps/mediawiki/kmer/index.php?title=Main_Page

[27]	patch for sim4db gff output http://sourceforge.net/p/kmer/patches/2/

[28]	Pipe Viewer http://www.ivarch.com/programs/pv.shtml

[29]	mawk http://invisible-island.net/mawk/

[30]	yEd http://www.yworks.com/en/products_yed_about.html

[31]	vcflib https://github.com/ekg/vcflib

Papers¶

[32]	(1, 2) Mundry,M. et al. (2012) Evaluating Characteristics of De Novo Assembly Software on 454 Transcriptome Data: A Simulation Approach. PLoS ONE, 7, e31410. DOI: http://dx.doi.org/10.1371/journal.pone.0031410

[33]	(1, 2) Kumar,S. and Blaxter,M.L. (2010) Comparing de novo assemblers for 454 transcriptome data. BMC Genomics, 11, 571. DOI: http://dx.doi.org/10.1186/1471-2164-11-571

[34]	Tange,O. (2011) GNU Parallel - The Command-Line Power Tool. ;login: The USENIX Magazine, 36, 42-47.