Detect and choose variants

Call variants with samtools

Set up input and output for the current step:


# outputs
mkdir -p $OUT

Run the variant calling in parallel. Takes 3 hours for 15 samples on a single Intel Xeon E5620: splitchr $SCAFFOLD.fai | parallel -j $CPUS "samtools mpileup -DSu -L 10000 -f $SCAFFOLD -r {} $ALIGNS | bcftools view -bvcg - > $OUT/part-{}.bcf"

# samtools > 0.1.19 splitchr $SCAFFOLD.fai | parallel -j $CPUS "samtools mpileup -u -t DP,SP -L 10000 -f $SCAFFOLD -r {} $ALIGNS | bcftools call -O b -vm - > $OUT/part-{}.bcf"

Merge the intermediate results. is used to generate the correct ordering of parts, as in the .fai: splitchr $SCAFFOLD.fai | xargs -i echo $OUT/part-{}.bcf | xargs -x bcftools cat > $VARIANTS-raw.bcf
bcftools index $VARIANTS-raw.bcf

# samtools > 0.1.19
# workaround for splitchr $SCAFFOLD.fai | parallel "bcftools view -O z $OUT/part-{}.bcf > $OUT/{.}.vcf.gz && bcftools index $OUT/{.}.vcf.gz"
bcftools concat -O b $OUT/*.vcf.gz > $VARIANTS-raw.bcf
bcftools index $VARIANTS-raw.bcf

Remove the intermediate results, if the merge was ok: splitchr $SCAFFOLD.fai | xargs -i echo $OUT/part-{}.bcf | xargs rm

Call variants with FreeBayes

This is an alternative to the previous section. FreeBayes uses local realignment around INDELs, so the variant calling for 454 data should be better.


GNAME=$( echo ${SCAFFOLD##*/} | cut -d. -f1 )
mkdir -p $OUT splitchr $SCAFFOLD.fai | parallel -j $CPUS "freebayes -f $SCAFFOLD -r {} $ALIGNS > $OUT/${GNAME}-{}.vcf"

# join the results

# headers
FILE=$( find $OUT -name ${GNAME}-*.vcf | sort | head -1 )
<$FILE egrep '^#' > $OFILE
# the rest in .fai order splitchr $SCAFFOLD.fai | parallel -j 1 "egrep -v '^#' $OUT/${GNAME}-{}.vcf >> $OFILE"

# filter the variants on quality (ignore the warning messages)
<$OFILE bcftools view -i 'QUAL > 20' -O z > $OUT/variants-qual.vcf.gz
tabix -p vcf $OUT/variants-qual.vcf.gz

Filter variants

We’re interested in two kinds of variant qualities

  • all possible variants so they can be avoided in primer design
  • high confidence variants that can be used to answer our questions

Filtering strategy:

  • use predefined samtools filtering
    • indels caused by 454 homopolymer problems generally have low quality scores, so they should be filtered at this stage
  • remove uninteresting information (for convenient viewing in IGV)
    • overall low coverage sites (less than 3 reads per sample - averaged, to avoid discarding some otherwise interesting information because of one bad sample)
  • select the interesting variants, leave the rest in the file flagged as ‘uninteresting’
    • only SNPs
    • at least 3 reads per sample
    • no shared variants between the two species

Samtools filtering

Quite high strand bias in RNASeq data can be expected, so don’t filter on strand bias (-1 0). Use the defaults for other settings of vcfutils varFilter command:

  • minimum RMS mapping quality for SNPs [10]
  • minimum read depth [2]
  • maximum read depth [10000000]
  • minimum number of alternate bases [2]
  • SNP within INT bp around a gap to be filtered [3]
  • window size for filtering adjacent gaps [10]
  • min P-value for baseQ bias [1e-100]
  • min P-value for mapQ bias [0]
  • min P-value for end distance bias [0.0001]
  • FLOAT min P-value for HWE (plus F<0) [0.0001]
bcftools view $VARIANTS-raw.bcf | varFilter -1 0 | bgzip > $VARIANTS-filtered.vcf.gz
tabix -p vcf $VARIANTS-filtered.vcf.gz

Convenience filtering

The required average depth per sample can be adjusted here. Using pv as a progress meter. pv can be substituted by cat:

# filter on average read depth and site quality
pv -p $VCFINPUT | bgzip -d | --no-filtered - avg-dps --avg-depth-per-sample 5 sq --site-quality 30 | bgzip > $VCFOUTPUT

# samtools > 0.1.19 produce conflicting info tags, get rid of it if the above filtering fails
pv -p $VCFINPUT | bgzip -d | sed 's/,Version="3"//' | --no-filtered - avg-dps --avg-depth-per-sample 5 sq --site-quality 30 | bgzip > $VCFOUTPUT

# freebayes output
zcat $OUT/variants-qual.vcf.gz| varFilter -1 0 | --no-filtered - avg-dps --avg-depth-per-sample 5 sq --site-quality 30 | bgzip > $OUT/demo-filt1.vcf.gz

Interesting variants

The filtered out variants are kept in the file, only marked as filtered out. This way both the selected and non-selected variants can be checked in IGV. Required minumum depth per sample can be adjusted here:

# samtools files

# freebayes files

<$VCFINPUT pv -p | zcat | - dps --depth-per-sample 3 snp-only contrast-samples --sample-names lu02 lu14 lu15 | bgzip > $VCFOUTPUT
tabix -p vcf $VCFOUTPUT

Check the results

Extract calculated variant qualities, so the distribution can be checked (-> common power law distribution, additional peak at 999):

zcat $VCFINPUT | grep -v '^#' | cut -f6 > $VCFINPUT.qual
# and visualize externally

# or directly in terminal, using data_hacks tools
zcat $VCFINPUT | grep -v '^#' | cut -f6 | -b 30

Count selected variants:

zcat -d $VCFOUTPUT | grep -c PASS

Count variants on chromosome Z:

zcat -d $VCFOUTPUT | grep PASS | grep -c ^chrZ