SraToFastQ (v1)

Converts data in SRA format to FastQ using the SRAToolkit version 2.3.5-2.

Author: NCBI

Contact:

gp-help@broadinstitute.org

Algorithm Version: 2.3.5-2

Introduction

The Sequence Read Archive (SRA) stores raw sequence data from "next-generation" sequencing technologies including 454, IonTorrent, Illumina, SOLiD, Helicos and Complete Genomics. In addition to raw sequence data, SRA now stores alignment information in the form of read placements on a reference sequence.
 
SRA is NIH’s primary archive of high-throughput sequencing data and is part of the international partnership of archives (INSDC) at the NCBI, the European Bioinformatics Institute and the DNA Database of Japan. Data submitted to any of the three organizations are shared among them.
 
The SraToFastQ module converts SRA data into FASTQ format using the SRAToolkit.

Parameters

Name Description 
accession

A Sequence Read Archive accession

sra file A Sequence Read Archive (.sra) file
paired end *

Whether the reads are paired end. This determines whether the read will be separated into left and right ends. If yes, the forward and reverse reads are output in two separate files. This creates two files.

RationaleSome of the reads in SRA are paired end reads where they sequenced (e.g.) from the left and right end of the sequence and have an estimated gap size between the ends (i.e. the average length of the fragments they are sequencing). It is important that you know if the sequences are paired-end for your downstream analysis, and most programs take the pairs into consideration. 

This is the --split-files option for Fastq-dump

(source for description & rationale: Edwards Lab)

quality score offset

-Q | –offset <integer> Offset to use for ASCII quality scores. Default is 33 (“!”).

Rationale: In the old days, Illumina used slightly different ways to calculate the quality scores and a slightly different offset for the quality scores. You should leave this as the default value for almost all applications.

(source for description & rationale: Edwards Lab)

alignment status
--aligned Dump only aligned sequences. Aligned datasets only; see sra-stat.
--unaligned Dump only unaligned sequences. Will dump all for unaligned datasets

Rationale: If you are looking for reads that map to the human genome, for example, you may want only the aligned or unaligned part. This is optional and up to you.

(source for description & rationale: Edwards Lab)

biological reads only
 --skip-technical Dump only biological reads.

Rationale: If the sequencing was done with the “Illumina multiplex library construction protocol” the SRA entry ends up with application reads and technical reads like this:

Application Read Forward -> Technical Read Forward <- Application Read Reverse - Technical Read Reverse.

You don’t want the technical reads – you only want the biological reads – so include –skip-technical to remove those technical parts. If you omit this option and include the –split-files you actually end up with three or four files per SRA archive!

(source for description & rationale: Edwards Lab)

output compression

You can compress the sequences files using one of two standard compression algorithms, gzip or bzip2. Gzip is probably more widely supported (but only just) and several common downstream programs like bowtie2 can use both gzip and bzip2 directly.

(source for description: Edwards Lab)

* - required

Output Files

  1. <accession>.fastq - if accession provided and data is not paired end
  2. <accession>_1.fastq, <accession>_2.fastq - if accession is provided and data is paired end
  3. <sra.file>.fastq - if sra file is provided and data is not paired end
  4. <sra.file>_1.fastq, or <sra.file>_2.fastq  - if sra file is provided and data is paired end

Example Data

  1. accession - ERR419271
  2. paired end - no
  1. accession - ERR497829
  2. paired end - yes

Platform Dependencies

Task Type:
RNA-seq

CPU Type:
any

Operating System:
Linux, Mac

Language:
any

Version Comments

Version Release Date Description
1 2014-08-19