RNAseq short variant discovery (SNPs + Indels)
Best Practices Workflows | Created 2018-01-09 | Last updated 2019-06-03


Comments (8)

Purpose

Identify short variants (SNPs and Indels) in RNAseq data.



Reference Implementations

Pipeline Summary Notes Github Terra
RNAseq short variant per-sample calling BAM to VCF universal (expected) yes TBD

Expected input

This workflow is designed to operate on a set of samples one sample at a time; joint calling RNAseq is not supported.


Main Steps

Mapping to the Reference

Tools involved: STAR

We begin with mapping RNA reads to a reference, we recommend using STAR aligner because it increased sensitivity (especially for INDELS). We use STAR’s two-pass mode to get better alignments around novel splice junctions.

Data Cleanup

Tools involved: MergeBamAlignment, MarkDuplicates

We use MergeBamAlignment and MarkDuplicates (similarly to our DNA best practices pipelines)

SplitNCigarReads

Tools involved: SplitNCigarReads

Because RNA aligners have different conventions than DNA aligners, we need reformat some of the alignments that span introns for HaplotypeCaller. This step splits reads with N in the cigar into multiple supplementary alignments and hard clips mismatching overhangs. By default this step also reassigns mapping qualities for good alignments to match DNA conventions.

Base Quality Recalibration

Tools involved: BaseRecalibrator, Apply Recalibration, AnalyzeCovariates (optional)

This step is performed per-sample and consists of applying machine learning to detect and correct for patterns of systematic errors in the base quality scores, which are confidence scores emitted by the sequencer for each base. Base quality scores play an important role in weighing the evidence for or against possible variant alleles during the variant discovery process, so it's important to correct any systematic bias observed in the data. Biases can originate from biochemical processes during library preparation and sequencing, from manufacturing defects in the chips, or instrumentation defects in the sequencer. The recalibration procedure involves collecting covariate statistics from all base calls in the dataset, building a model from those statistics, and applying base quality adjustments to the dataset based on the resulting model. The initial statistics collection can be parallelized by scattering across genomic coordinates, typically by chromosome or batches of chromosomes but this can be broken down further to boost throughput if needed. Then the per-region statistics must be gathered into a single genome-wide model of covariation; this cannot be parallelized but it is computationally trivial, and therefore not a bottleneck. Finally, the recalibration rules derived from the model are applied to the original dataset to produce a recalibrated dataset. This is parallelized in the same way as the initial statistics collection, over genomic regions, then followed by a final file merge operation to produce a single analysis-ready file per sample.

Variant Calling

Tools involved: HaplotypeCaller

HaplotypeCaller doesn’t need any specific changes to run with RNA once the bam has been run through SplitNCigarReads. We do adjust the minimum phred-scaled confidence threshold for calling variants to 20, but this value will depend on your specific use case.

Variant Filtering

Tools involved: VariantFiltration

We recommend specific hard filters, since VQSR and CNNScoreVariants require truth data for training that we don’t yet have for RNA.



Return to top Comment on this article