Map reads to the scaffold¶
Map all the reads using smalt¶
Set up variables:
# data from previous steps
SCAFFOLD=33-scaffold/sc-demo.fasta
INFILES=10-cutadapt/*.fastq
OUT=40-map-smalt
SMALT_IDX=${SCAFFOLD%/*}/smalt/${SCAFFOLD##*/}-k13s4
Create index for the scaffold and map the reads. Mapping 3 GB of reads (fastq format) takes ~5 hours in 8 threads on Intel Xeon E5620, 0.5 GB memory per each mapping. This step would benefit from parallelization even on multiple machines (not implemented here).
# create smalt index
mkdir -p ${SMALT_IDX%/*}
smalt index -s 4 $SMALT_IDX $SCAFFOLD
# map each file, smalt is multithreaded so feed the files sequentially
mkdir -p $OUT
parallel -j 1 "smalt map -n $CPUS -p -f sam -o $OUT/{/.}.sam $SMALT_IDX {}" ::: $INFILES
# Illumina reads can be maped e.g. with bwa
bwa index $SCAFFOLD
parallel -j 1 "bwa mem -t $CPUS $SCAFFOLD {} > $OUT/{/.}.sam" ::: $INFILES
Merge mapping output to single file¶
Create a fasta index for the scaffold:
samtools faidx $SCAFFOLD
Create readgroups.txt¶
According to your sample wet lab details, create a readgroups.txt
file.
Because samtools merge -r
attaches read group to each alignment (line) in the input
according to the original filename, the format is ($BASENAME is the fastq file name
without suffix, $SAMPLE is your biological sample, ${BASENAME%%.*} is the dna library name,
all tab
separated):
@RG ID:$BASENAME SM:$SAMPLE LB:${BASENAME%%.*} PL:LS454 DS:$SPECIES
The library name (LB) is important because of rmdup
,
description (DS) is here used to identify the species.
Note
The order of the rows matters for the vcf output, the sample columns order is probably the order of first appearance in the @RG.
The following code generates most of the readgroups.txt
file, you
have to reorder lines and fill the places marked with ‘??’:
OUT=40-map-smalt
DIR=10-cutadapt
find $DIR -name '*.fastq' | xargs -n1 basename | sed s/.fastq// | awk '{OFS="\t";lb=$0;sub(/\..*$/,"",lb);print "@RG", "ID:" $0, "SM:??", "LB:" lb, "PL:LS454", "DS:??";}' > $OUT/readgroups.txt
# edit the file (ctrl-o enter ctrl-x saves and exits the editor)
nano $OUT/readgroups.txt
# sort the readgroups according to species
<$OUT/readgroups.txt sort -k6,6 > $OUT/readgroups-s.txt
Prepare the sam files¶
Extract the sequence headers from the first .sam
file (other files should have identical headers):
SAMFILE=$( echo $OUT/*.sam | awk '{print $1;}' )
samtools view -S -t $SCAFFOLD.fai -H $SAMFILE > $OUT/sequences.txt
cat $OUT/sequences.txt $OUT/readgroups-s.txt > $OUT/sam-header.txt
samtools merge
requires sorted alignments, sort them in parallel. This creates .bam
files
in the output directory:
parallel -j $CPUS "samtools view -but $SCAFFOLD.fai {} | samtools sort - {.}" ::: $OUT/*.sam
Merge¶
Merge all the alignments. Do not remove duplicates because the duplicate detection algorithm is based on the read properties of genomic DNA ([1], [2]).
samtools merge -ru -h $OUT/sam-header.txt - $OUT/*.bam | samtools sort - $OUT/alldup
samtools index $OUT/alldup.bam
Check the results¶
Unmapped read counts.
parallel -j $CPUS 'echo $( cut -f2 {}|grep -c "^4$" ) {}' ::: $OUT/*.sam
Mapping statistics
samtools idxstats $OUT/alldup.bam | awk '{map += $3; unmap += $4;} END {print unmap/map;}'
Coverage sums for IGV
igvtools count -z 5 -w 25 -e 250 $OUT/alldup.bam $OUT/alldup.bam.tdf ${CONTIGS%.*}.genome
[1] | http://seqanswers.com/forums/showthread.php?t=6543 |
[2] | http://seqanswers.com/forums/showthread.php?t=5424 |