Components of the Scrimer pipeline¶
Components bound to the pipeline¶
Transfer annotation from the reference genome to the contigs¶
File: scripts/liftover.py
Input
- gff file produced by the exon predictor software
- set of bed files containing the features to be transfered to the cDNA
Output
- ‘inverted’ gff file annotating exons in the transcripts and containing also all the features from the other input files, if they were overlapping with the predicted exons
Author: Libor Morkovsky, 2012
Create a contig scaffold according to contig hits in the reference genome¶
File: scripts/scaffold.py
Input
- fasta file with the original sequences
- set of gff files with exon features having the Target attribute
(product of
liftover.py
) - other gff/bed files to remap to the genome
Output
- fasta with the input sequences pasted with 100 Ns
- unambiguos sequences assigned to ‘chromosomes’ in the order of the template genome (that was used to generate the Target exon mappings)
- ambiguos (having more than one possible chromosome) assigned to chrAmb
- unmapped sequences assigned to chrUnmapped (to distinguish from chrUn in target genome)
- gff file with the locations of the input sequences (gene) and remapped contents of the input gff files
Algorithm
- go through the input gff files, construct a dictionary {read_name -> {chr -> location}}
- add the lowest coordinate found on a given chromosome (overwriting previous values)
- sort the list with single candidate locations with ‘chromocompare’
Author: Libor Morkovsky, 2012
Design primers using sequences, annotations and selected variants¶
File: scripts/design_primers.py
Input
- reference sequence (fasta with samtools fai index)
- annotations (gff3, has to contain exon entries)
- filtered variants (vcf, primers are designed for variants with PASS)
Output
- PCR and genotyping primers selected using primer3 (gff3)
Algorithm
- there is only a few selected variants, so the least amount of work will be to do the work only for variants
- for each of the selected variants
- request exons
- apply the technical constraints (minimal primer length of 20 from the edge of an exon)
- patch exon sequence to mark positions of known variants
- find suitable genotyping primers
- design PCR primers to flank the (usable) genotyping primers
Author: Libor Morkovsky, 2012, 2014
Export primers from gff3 to the FASTA, tabular or isPcr format¶
File: scripts/extract_primers.py
Input
- gff file containing primers
- optional output format
Output
- fa/isPcr file with the primer sequences
Algorithm
- for each line in gff
- fa output: if current line has a
SEQUENCE
attribute, output it- ispcr output: only pcr primers are of interest, and in equivalent pairs on encountering gt-xx put it to dict keyed by
Parent
if there are both entries output and remove from dict
Author: Libor Morkovsky, 2012
Tools for FASTA/FASTQ manipulation¶
Given pairs of matching sequences, create clusters and find the longest representative¶
File: scripts/find_redundant_sequences.py
Given pairs of ‘almost identical’ sequences create clusters of sequences. From each cluster select the longest sequence. Output names of all other sequences (for example to remove them from the data afterwards).
Uses disjoint sets forest to store the clusters so it should scale to millions of sequences.
Input
- custom formated (
--format=general:name1,size1,start1,name2,size2,start2,strand2,identity,coverage
) output fromlastz
on standard input
Output
- names of sequences that were selected as reduntant
Filter sequences in a FASTQ file based on their position in the file¶
File: scripts/fastq_kill_lines.py
Input
- FASTQ file
- list of indices (0 based)
Output
- FASTQ file without the given sequences
Author: Libor Morkovsky, 2012
Filter sequences in FASTQ, FASTA based on their identifier¶
File: scripts/seq_filter_by_id.py
Taken from BioPython. FASTA and FASTQ readers are pasted in the file, so the program is standalone.
Filter a FASTA, FASTQ or SSF file with IDs from a tabular file.
Takes six command line options, tabular filename, ID column numbers (comma separated list using one based counting), input filename, input type (e.g. FASTA or SFF) and two output filenames (for records with and without the given IDs, same format as input sequence file).
If either output filename is just a minus sign, that file is not created. This is intended to allow output for just the matched (or just the non-matched) records.
When filtering an SFF file, any Roche XML manifest in the input file is preserved in both output files.
Note in the default NCBI BLAST+ tabular output, the query sequence ID is in column one, and the ID of the match from the database is in column two. Here sensible values for the column numbers would therefore be “1” or “2”.
This tool is a short Python script which requires Biopython 1.54 or later for SFF file support. If you use this tool in scientific work leading to a publication, please cite the Biopython application note:
Cock et al 2009. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3. http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878.
This script is copyright 2010 by Peter Cock, SCRI, UK. All rights reserved. See accompanying text file for licence details (MIT/BSD style).
This is version 0.0.1 of the script.
Break sequences in FASTA file into fragments¶
File: scripts/fasta_fragments.py
Taken from lastz. Break sequences in a fasta file into fragments, so some kind of short read aligner can be used for further processing.
Break a fasta file into fragments.
$$$ todo: spread out the fragment starts so that the last fragment ends at the $$$ .. end of a sequence, if possible
$$$ todo: find runs of N and reset the fragment start position to skip past $$$ .. such runs
$$$ liborm(2012): fixed to read only the seq name, not the full line
General purpose tools¶
Primer3 wrapper¶
File: scrimer/primer3_connector.py
A Python interface to the primer3_core executable.
- TODO: it is not possible to keep a persistent primer3 process
- using subprocess module - communicate() terminates the input stream and waits for the process to finish
Author: Libor Morkovsky 2012
-
class
primer3_connector.
BoulderIO
¶ Provides Python interface for
BoulderIO
format used by Primer3.-
classmethod
deparse
(records)¶ Accepts a dict or a list of dicts, produces a BoulderIO string
(KEY=VAL\n)
with records separated by'=\n'
.
-
classmethod
parse
(string)¶ Parse a BoulderIO string
(KEY=VAL\n)
return a list of records, where each record is a dictionary end of the string implies a single'=\n'
(record separator).
-
classmethod
-
class
primer3_connector.
Primer3
(p3path='primer3_core', **kwargs)¶ Wraps Primer3 executable. kwargs are converted to strings and used as default parameters for each call of primer3 binary.
-
call
(records)¶ Merge each of the records with default_params, the record taking precedence, call the
primer3
binary, parse the output and return a list of dictionaries,{RIGHT:[], LEFT:[], PAIR:[], INTERNAL:[]}
for each input record uppercase keys (in the result) are the original names from BoulderIO format, lowercase keys have no direct equivalent in primer3 output (position
,other-keys
)
-
Convert CIGAR matches to sim4db ‘script’¶
File: scripts/cigar_to_sim4db_scr.py
Script that parses CIGAR
file
produced by aligning fragments of contigs to a genome (tested with smalt
output)
and outputs a ‘script’ for limiting the exon model regions of sim4db
.
Input
- output of some read mapper in
CIGAR
format
- all the fragments must be reported by the aligner
- the fragment names have to be in the same order as the master sequences
- the fragments must be named like readname_number (like
fasta_fragments.py
does)- all the hits from one read must follow each other
Output
- sim4db ‘script’
Algorithm
- load chromosome definition file
- parse the hits: - extract read name - check if hit is in known chromosome (report error otherwise) - for each hit create +-50 KB region clipped to chromosome ends - when a read name different from the previous one is encountered, merge all the regions - output each contiguos region as a script line
Author: Libor Morkovsky, 2012
Count different bases in 5’ end of reads in FASTQ¶
File: scripts/5prime_stats.py
Find the most common letter in first n bases of reads in FASTQ file. Useful for finding and recognizing primer sequences in the reads.