Components of the Scrimer pipeline

Components bound to the pipeline

Transfer annotation from the reference genome to the contigs

File: scripts/liftover.py

Input

  • gff file produced by the exon predictor software
  • set of bed files containing the features to be transfered to the cDNA

Output

  • ‘inverted’ gff file annotating exons in the transcripts and containing also all the features from the other input files, if they were overlapping with the predicted exons

Author: Libor Morkovsky, 2012

Create a contig scaffold according to contig hits in the reference genome

File: scripts/scaffold.py

Input

  • fasta file with the original sequences
  • set of gff files with exon features having the Target attribute (product of liftover.py)
  • other gff/bed files to remap to the genome

Output

  • fasta with the input sequences pasted with 100 Ns
    • unambiguos sequences assigned to ‘chromosomes’ in the order of the template genome (that was used to generate the Target exon mappings)
    • ambiguos (having more than one possible chromosome) assigned to chrAmb
    • unmapped sequences assigned to chrUnmapped (to distinguish from chrUn in target genome)
  • gff file with the locations of the input sequences (gene) and remapped contents of the input gff files

Algorithm

  • go through the input gff files, construct a dictionary {read_name -> {chr -> location}}
  • add the lowest coordinate found on a given chromosome (overwriting previous values)
  • sort the list with single candidate locations with ‘chromocompare’

Author: Libor Morkovsky, 2012

Design primers using sequences, annotations and selected variants

File: scripts/design_primers.py

Input

  • reference sequence (fasta with samtools fai index)
  • annotations (gff3, has to contain exon entries)
  • filtered variants (vcf, primers are designed for variants with PASS)

Output

  • PCR and genotyping primers selected using primer3 (gff3)

Algorithm

  • there is only a few selected variants, so the least amount of work will be to do the work only for variants
  • for each of the selected variants
    • request exons
    • apply the technical constraints (minimal primer length of 20 from the edge of an exon)
    • patch exon sequence to mark positions of known variants
    • find suitable genotyping primers
    • design PCR primers to flank the (usable) genotyping primers

Author: Libor Morkovsky, 2012, 2014

Export primers from gff3 to the FASTA, tabular or isPcr format

File: scripts/extract_primers.py

Input

  • gff file containing primers
  • optional output format

Output

  • fa/isPcr file with the primer sequences

Algorithm

  • for each line in gff
  • fa output: if current line has a SEQUENCE attribute, output it
  • ispcr output: only pcr primers are of interest, and in equivalent pairs on encountering gt-xx put it to dict keyed by Parent if there are both entries output and remove from dict

Author: Libor Morkovsky, 2012

Tools for FASTA/FASTQ manipulation

Given pairs of matching sequences, create clusters and find the longest representative

File: scripts/find_redundant_sequences.py

Given pairs of ‘almost identical’ sequences create clusters of sequences. From each cluster select the longest sequence. Output names of all other sequences (for example to remove them from the data afterwards).

Uses disjoint sets forest to store the clusters so it should scale to millions of sequences.

Input

  • custom formated (--format=general:name1,size1,start1,name2,size2,start2,strand2,identity,coverage) output from lastz on standard input

Output

  • names of sequences that were selected as reduntant

Filter sequences in a FASTQ file based on their position in the file

File: scripts/fastq_kill_lines.py

Input

  • FASTQ file
  • list of indices (0 based)

Output

  • FASTQ file without the given sequences

Author: Libor Morkovsky, 2012

Filter sequences in FASTQ, FASTA based on their identifier

File: scripts/seq_filter_by_id.py

Taken from BioPython. FASTA and FASTQ readers are pasted in the file, so the program is standalone.

Filter a FASTA, FASTQ or SSF file with IDs from a tabular file.

Takes six command line options, tabular filename, ID column numbers (comma separated list using one based counting), input filename, input type (e.g. FASTA or SFF) and two output filenames (for records with and without the given IDs, same format as input sequence file).

If either output filename is just a minus sign, that file is not created. This is intended to allow output for just the matched (or just the non-matched) records.

When filtering an SFF file, any Roche XML manifest in the input file is preserved in both output files.

Note in the default NCBI BLAST+ tabular output, the query sequence ID is in column one, and the ID of the match from the database is in column two. Here sensible values for the column numbers would therefore be “1” or “2”.

This tool is a short Python script which requires Biopython 1.54 or later for SFF file support. If you use this tool in scientific work leading to a publication, please cite the Biopython application note:

Cock et al 2009. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3. http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878.

This script is copyright 2010 by Peter Cock, SCRI, UK. All rights reserved. See accompanying text file for licence details (MIT/BSD style).

This is version 0.0.1 of the script.

Break sequences in FASTA file into fragments

File: scripts/fasta_fragments.py

Taken from lastz. Break sequences in a fasta file into fragments, so some kind of short read aligner can be used for further processing.

Break a fasta file into fragments.

$$$ todo: spread out the fragment starts so that the last fragment ends at the $$$ .. end of a sequence, if possible

$$$ todo: find runs of N and reset the fragment start position to skip past $$$ .. such runs

$$$ liborm(2012): fixed to read only the seq name, not the full line

General purpose tools

Primer3 wrapper

File: scrimer/primer3_connector.py

A Python interface to the primer3_core executable.

TODO: it is not possible to keep a persistent primer3 process
using subprocess module - communicate() terminates the input stream and waits for the process to finish

Author: Libor Morkovsky 2012

class primer3_connector.BoulderIO

Provides Python interface for BoulderIO format used by Primer3.

classmethod deparse(records)

Accepts a dict or a list of dicts, produces a BoulderIO string (KEY=VAL\n) with records separated by '=\n'.

classmethod parse(string)

Parse a BoulderIO string (KEY=VAL\n) return a list of records, where each record is a dictionary end of the string implies a single '=\n' (record separator).

class primer3_connector.Primer3(p3path='primer3_core', **kwargs)

Wraps Primer3 executable. kwargs are converted to strings and used as default parameters for each call of primer3 binary.

call(records)

Merge each of the records with default_params, the record taking precedence, call the primer3 binary, parse the output and return a list of dictionaries, {RIGHT:[], LEFT:[], PAIR:[], INTERNAL:[]} for each input record uppercase keys (in the result) are the original names from BoulderIO format, lowercase keys have no direct equivalent in primer3 output (position, other-keys)

Convert CIGAR matches to sim4db ‘script’

File: scripts/cigar_to_sim4db_scr.py

Script that parses CIGAR file produced by aligning fragments of contigs to a genome (tested with smalt output) and outputs a ‘script’ for limiting the exon model regions of sim4db.

Input

  • output of some read mapper in CIGAR format
  • all the fragments must be reported by the aligner
  • the fragment names have to be in the same order as the master sequences
  • the fragments must be named like readname_number (like fasta_fragments.py does)
  • all the hits from one read must follow each other

Output

  • sim4db ‘script’

Algorithm

  • load chromosome definition file
  • parse the hits: - extract read name - check if hit is in known chromosome (report error otherwise) - for each hit create +-50 KB region clipped to chromosome ends - when a read name different from the previous one is encountered, merge all the regions - output each contiguos region as a script line

Author: Libor Morkovsky, 2012

Count different bases in 5’ end of reads in FASTQ

File: scripts/5prime_stats.py

Find the most common letter in first n bases of reads in FASTQ file. Useful for finding and recognizing primer sequences in the reads.

Variant filters for PyVCF vcf_filter.py

File: scrimer/pyvcf_filters.py

Implementation of vcf filters for pyvcf vcf_filter.py.

Author: Libor Morkovsky 2012

class pyvcf_filters.DistinguishingVariants(args)

Given a group of samples, choose variants that are not shared with the rest of the samples

classmethod customize_parser(parser)
filter_name()
name = 'contrast-samples'