Showing tool doc from version 4.1.4.0 | The latest version is 4.1.4.0

IlluminaBasecallsToSam (Picard)

Transforms raw Illumina sequencing data into an unmapped SAM or BAM file.

The IlluminaBaseCallsToSam program collects, demultiplexes, and sorts reads across all of the tiles of a lane via barcode to produce an unmapped SAM/BAM file. An unmapped BAM file is often referred to as a uBAM. All barcode, sample, and library data is provided in the LIBRARY_PARAMS file. Note, this LIBRARY_PARAMS file should be formatted according to the specifications indicated below. The following is an example of a properly formatted LIBRARY_PARAMS file:

BARCODE_1 OUTPUT SAMPLE_ALIAS LIBRARY_NAME AAAAAAAA SA_AAAAAAAA.bam SA_AAAAAAAA LN_AAAAAAAA AAAAGAAG SA_AAAAGAAG.bam SA_AAAAGAAG LN_AAAAGAAG AACAATGG SA_AACAATGG.bam SA_AACAATGG LN_AACAATGG N SA_non_indexed.bam SA_non_indexed LN_NNNNNNNN

The BARCODES_DIR file is produced by the ExtractIlluminaBarcodes tool for each lane of a flow cell.

Usage example:

java -jar picard.jar IlluminaBasecallsToSam \
BASECALLS_DIR=/BaseCalls/ \
LANE=001 \
READ_STRUCTURE=25T8B25T \
RUN_BARCODE=run15 \
IGNORE_UNEXPECTED_BARCODES=true \
LIBRARY_PARAMS=library.params

Category Base Calling


Overview

IlluminaBasecallsToSam transforms a lane of Illumina data file formats (bcl, locs, clocs, qseqs, etc.) into SAM or BAM file format.

In this application, barcode data is read from Illumina data file groups, each of which is associated with a tile. Each tile may contain data for any number of barcodes, and a single barcode's data may span multiple tiles. Once the barcode data is collected from files, each barcode's data is written to its own SAM/BAM. The barcode data must be written in order; this means that barcode data from each tile is sorted before it is written to file, and that if a barcode's data does span multiple tiles, data collected from each tile must be written in the order of the tiles themselves.

This class employs a number of private subclasses to achieve this goal. The TileReadAggregator controls the flow of operation. It is fed a number of Tiles which it uses to spawn TileReaders. TileReaders are responsible for reading Illumina data for their respective tiles from disk, and as they collect that data, it is fed back into the TileReadAggregator. When a TileReader completes a tile, it notifies the TileReadAggregator, which reviews what was read and conditionally queues its writing to disk, baring in mind the requirements of write-order described in the previous paragraph. As writes complete, the TileReadAggregator re-evaluates the state of reads/writes and may queue more writes. When all barcodes for all tiles have been written, the TileReadAggregator shuts down.

The TileReadAggregator controls task execution using a specialized ThreadPoolExecutor. It accepts special Runnables of type PriorityRunnable which allow a priority to be assigned to the runnable. When the ThreadPoolExecutor is assigning threads, it gives priority to those PriorityRunnables with higher priority values. In this application, TileReaders are assigned lowest priority, and write tasks are assigned high priority. It is designed in this fashion to minimize the amount of time data must remain in memory (write the data as soon as possible, then discard it from memory) while maximizing CPU usage.

IlluminaBasecallsToSam (Picard) specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Arguments
--BARCODE_PARAMS
null Deprecated (use LIBRARY_PARAMS). Tab-separated file for creating all output BAMs for barcoded run with single IlluminaBasecallsToSam invocation. Columns are BARCODE, OUTPUT, SAMPLE_ALIAS, and LIBRARY_NAME. Row with BARCODE=N is used to specify a file for no barcode match
--BASECALLS_DIR
 -B
null The basecalls directory.
--LANE
 -L
null Lane number.
--LIBRARY_PARAMS
null Tab-separated file for creating all output BAMs for a lane with single IlluminaBasecallsToSam invocation. The columns are OUTPUT, SAMPLE_ALIAS, and LIBRARY_NAME, BARCODE_1, BARCODE_2 ... BARCODE_X where X = number of barcodes per cluster (optional). Row with BARCODE_1 set to 'N' is used to specify a file for no barcode match. You may also provide any 2 letter RG header attributes (excluding PU, CN, PL, and DT) as columns in this file and the values for those columns will be inserted into the RG tag for the BAM file created for a given row.
--OUTPUT
 -O
null Deprecated (use LIBRARY_PARAMS). The output SAM or BAM file. Format is determined by extension.
--READ_STRUCTURE
 -RS
null A description of the logical structure of clusters in an Illumina Run, i.e. a description of the structure IlluminaBasecallsToSam assumes the data to be in. It should consist of integer/character pairs describing the number of cycles and the type of those cycles (B for Sample Barcode, M for molecular barcode, T for Template, and S for skip). E.g. If the input data consists of 80 base clusters and we provide a read structure of "28T8M8B8S28T" then the sequence may be split up into four reads: * read one with 28 cycles (bases) of template * read two with 8 cycles (bases) of molecular barcode (ex. unique molecular barcode) * read three with 8 cycles (bases) of sample barcode * 8 cycles (bases) skipped. * read four with 28 cycles (bases) of template The skipped cycles would NOT be included in an output SAM/BAM file or in read groups therein.
--RUN_BARCODE
null The barcode of the run. Prefixed to read names.
--SAMPLE_ALIAS
 -ALIAS
null Deprecated (use LIBRARY_PARAMS). The name of the sequenced sample
--SEQUENCING_CENTER
null The name of the sequencing center that produced the reads. Used to set the @RG->CN header tag.
Optional Tool Arguments
--ADAPTERS_TO_CHECK
[INDEXED, DUAL_INDEXED, NEXTERA_V2, FLUIDIGM] Which adapters to look for in the read.
--APPLY_EAMSS_FILTER
true Apply EAMSS filtering to identify inappropriately quality scored bases towards the ends of reads and convert their quality scores to Q2.
--arguments_file
[] read one or more arguments files and add them to the command line
--BARCODE_POPULATION_STRATEGY
ORPHANS_ONLY When should the sample barcode (as read by the sequencer) be placed on the reads in the BC tag?
--BARCODES_DIR
 -BCD
null The barcodes directory with _barcode.txt files (generated by ExtractIlluminaBarcodes). If not set, use BASECALLS_DIR.
--FIRST_TILE
null If set, this is the first tile to be processed (used for debugging). Note that tiles are not processed in numerical order.
--FIVE_PRIME_ADAPTER
null For specifying adapters other than standard Illumina
--FORCE_GC
true If true, call System.gc() periodically. This is useful in cases in which the -Xmx value passed is larger than the available memory.
--help
 -h
false display the help message
--IGNORE_UNEXPECTED_BARCODES
 -IGNORE_UNEXPECTED
false Whether to ignore reads whose barcodes are not found in LIBRARY_PARAMS. Useful when outputting BAMs for only a subset of the barcodes in a lane.
--INCLUDE_BARCODE_QUALITY
false Should the barcode quality be included when the sample barcode is included?
--INCLUDE_BC_IN_RG_TAG
false Whether to include the barcode information in the @RG->BC header tag. Defaults to false until included in the SAM spec.
--INCLUDE_NON_PF_READS
 -NONPF
true Whether to include non-PF reads
--LIBRARY_NAME
 -LIB
null Deprecated (use LIBRARY_PARAMS). The name of the sequenced library
--MAX_READS_IN_RAM_PER_TILE
1200000 Configure SortingCollections to store this many records before spilling to disk. For an indexed run, each SortingCollection gets this value/number of indices.
--MINIMUM_QUALITY
2 The minimum quality (after transforming 0s to 1s) expected from reads. If qualities are lower than this value, an error is thrown.The default of 2 is what the Illumina's spec describes as the minimum, but in practice the value has been observed lower.
--MOLECULAR_INDEX_BASE_QUALITY_TAG
QX The tag to use to store any molecular index base qualities. If more than one molecular index is found, their qualities will be concatenated and stored here (.i.e. the number of "M" operators in the READ_STRUCTURE)
--MOLECULAR_INDEX_TAG
RX The tag to use to store any molecular indexes. If more than one molecular index is found, they will be concatenated and stored here.
--NUM_PROCESSORS
0 The number of threads to run in parallel. If NUM_PROCESSORS = 0, number of cores is automatically set to the number of cores available on the machine. If NUM_PROCESSORS < 0, then the number of cores used will be the number available on the machine less NUM_PROCESSORS.
--PLATFORM
ILLUMINA The name of the sequencing technology that produced the read.
--PROCESS_SINGLE_TILE
null If set, process only the tile number given and prepend the tile number to the output file name.
--READ_GROUP_ID
 -RG
null ID used to link RG header record with RG tag in SAM record. If these are unique in SAM files that get merged, merge performance is better. If not specified, READ_GROUP_ID will be set to . .
--RUN_START_DATE
null The start date of the run.
--TAG_PER_MOLECULAR_INDEX
[] The list of tags to store each molecular index. The number of tags should match the number of molecular indexes.
--THREE_PRIME_ADAPTER
null For specifying adapters other than standard Illumina
--TILE_LIMIT
null If set, process no more than this many tiles (used for debugging).
--version
false display the version number for this tool
Optional Common Arguments
--COMPRESSION_LEVEL
5 Compression level for all compressed files created (e.g. BAM and VCF).
--CREATE_INDEX
false Whether to create a BAM index when writing a coordinate-sorted BAM file.
--CREATE_MD5_FILE
false Whether to create an MD5 digest for any BAM or FASTQ files created.
--GA4GH_CLIENT_SECRETS
client_secrets.json Google Genomics API client_secrets.json file path.
--MAX_RECORDS_IN_RAM
500000 When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
--QUIET
false Whether to suppress job-summary info on System.err.
--REFERENCE_SEQUENCE
 -R
null Reference sequence file.
--TMP_DIR
[] One or more directories with space available to be used by this program for temporary storage of working files
--USE_JDK_DEFLATER
 -use_jdk_deflater
false Use the JDK Deflater instead of the Intel Deflater for writing compressed output
--USE_JDK_INFLATER
 -use_jdk_inflater
false Use the JDK Inflater instead of the Intel Inflater for reading compressed input
--VALIDATION_STRINGENCY
STRICT Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
--VERBOSITY
INFO Control verbosity of logging.
Advanced Arguments
--showHidden
false display hidden arguments

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--ADAPTERS_TO_CHECK / NA

Which adapters to look for in the read.

List[IlluminaAdapterPair]  [INDEXED, DUAL_INDEXED, NEXTERA_V2, FLUIDIGM]


--APPLY_EAMSS_FILTER / NA

Apply EAMSS filtering to identify inappropriately quality scored bases towards the ends of reads and convert their quality scores to Q2.

boolean  true


--arguments_file / NA

read one or more arguments files and add them to the command line

List[File]  []


--BARCODE_PARAMS / NA

Deprecated (use LIBRARY_PARAMS). Tab-separated file for creating all output BAMs for barcoded run with single IlluminaBasecallsToSam invocation. Columns are BARCODE, OUTPUT, SAMPLE_ALIAS, and LIBRARY_NAME. Row with BARCODE=N is used to specify a file for no barcode match

Exclusion: This argument cannot be used at the same time as OUTPUT, SAMPLE_ALIAS, LIBRARY_NAME, LIBRARY_PARAMS.

R File  null


--BARCODE_POPULATION_STRATEGY / NA

When should the sample barcode (as read by the sequencer) be placed on the reads in the BC tag?

The --BARCODE_POPULATION_STRATEGY argument is an enumerated type (PopulateBarcode), which can have one of the following values:

ORPHANS_ONLY
INEXACT_MATCH
ALWAYS

PopulateBarcode  ORPHANS_ONLY


--BARCODES_DIR / -BCD

The barcodes directory with _barcode.txt files (generated by ExtractIlluminaBarcodes). If not set, use BASECALLS_DIR.

File  null


--BASECALLS_DIR / -B

The basecalls directory.

R File  null


--COMPRESSION_LEVEL / NA

Compression level for all compressed files created (e.g. BAM and VCF).

int  5  [ [ -∞  ∞ ] ]


--CREATE_INDEX / NA

Whether to create a BAM index when writing a coordinate-sorted BAM file.

Boolean  false


--CREATE_MD5_FILE / NA

Whether to create an MD5 digest for any BAM or FASTQ files created.

boolean  false


--FIRST_TILE / NA

If set, this is the first tile to be processed (used for debugging). Note that tiles are not processed in numerical order.

Exclusion: This argument cannot be used at the same time as PROCESS_SINGLE_TILE.

Integer  null


--FIVE_PRIME_ADAPTER / NA

For specifying adapters other than standard Illumina

String  null


--FORCE_GC / NA

If true, call System.gc() periodically. This is useful in cases in which the -Xmx value passed is larger than the available memory.

Boolean  true


--GA4GH_CLIENT_SECRETS / NA

Google Genomics API client_secrets.json file path.

String  client_secrets.json


--help / -h

display the help message

boolean  false


--IGNORE_UNEXPECTED_BARCODES / -IGNORE_UNEXPECTED

Whether to ignore reads whose barcodes are not found in LIBRARY_PARAMS. Useful when outputting BAMs for only a subset of the barcodes in a lane.

boolean  false


--INCLUDE_BARCODE_QUALITY / NA

Should the barcode quality be included when the sample barcode is included?

boolean  false


--INCLUDE_BC_IN_RG_TAG / NA

Whether to include the barcode information in the @RG->BC header tag. Defaults to false until included in the SAM spec.

boolean  false


--INCLUDE_NON_PF_READS / -NONPF

Whether to include non-PF reads

boolean  true


--LANE / -L

Lane number.

R Integer  null


--LIBRARY_NAME / -LIB

Deprecated (use LIBRARY_PARAMS). The name of the sequenced library

Exclusion: This argument cannot be used at the same time as BARCODE_PARAMS, LIBRARY_PARAMS.

String  null


--LIBRARY_PARAMS / NA

Tab-separated file for creating all output BAMs for a lane with single IlluminaBasecallsToSam invocation. The columns are OUTPUT, SAMPLE_ALIAS, and LIBRARY_NAME, BARCODE_1, BARCODE_2 ... BARCODE_X where X = number of barcodes per cluster (optional). Row with BARCODE_1 set to 'N' is used to specify a file for no barcode match. You may also provide any 2 letter RG header attributes (excluding PU, CN, PL, and DT) as columns in this file and the values for those columns will be inserted into the RG tag for the BAM file created for a given row.

Exclusion: This argument cannot be used at the same time as OUTPUT, SAMPLE_ALIAS, LIBRARY_NAME, BARCODE_PARAMS.

R File  null


--MAX_READS_IN_RAM_PER_TILE / NA

Configure SortingCollections to store this many records before spilling to disk. For an indexed run, each SortingCollection gets this value/number of indices.

int  1200000  [ [ -∞  ∞ ] ]


--MAX_RECORDS_IN_RAM / NA

When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.

Integer  500000  [ [ -∞  ∞ ] ]


--MINIMUM_QUALITY / NA

The minimum quality (after transforming 0s to 1s) expected from reads. If qualities are lower than this value, an error is thrown.The default of 2 is what the Illumina's spec describes as the minimum, but in practice the value has been observed lower.

int  2  [ [ -∞  ∞ ] ]


--MOLECULAR_INDEX_BASE_QUALITY_TAG / NA

The tag to use to store any molecular index base qualities. If more than one molecular index is found, their qualities will be concatenated and stored here (.i.e. the number of "M" operators in the READ_STRUCTURE)

String  QX


--MOLECULAR_INDEX_TAG / NA

The tag to use to store any molecular indexes. If more than one molecular index is found, they will be concatenated and stored here.

String  RX


--NUM_PROCESSORS / NA

The number of threads to run in parallel. If NUM_PROCESSORS = 0, number of cores is automatically set to the number of cores available on the machine. If NUM_PROCESSORS < 0, then the number of cores used will be the number available on the machine less NUM_PROCESSORS.

Integer  0  [ [ -∞  ∞ ] ]


--OUTPUT / -O

Deprecated (use LIBRARY_PARAMS). The output SAM or BAM file. Format is determined by extension.

Exclusion: This argument cannot be used at the same time as BARCODE_PARAMS, LIBRARY_PARAMS.

R File  null


--PLATFORM / NA

The name of the sequencing technology that produced the read.

String  ILLUMINA


--PROCESS_SINGLE_TILE / NA

If set, process only the tile number given and prepend the tile number to the output file name.

Exclusion: This argument cannot be used at the same time as FIRST_TILE.

Integer  null


--QUIET / NA

Whether to suppress job-summary info on System.err.

Boolean  false


--READ_GROUP_ID / -RG

ID used to link RG header record with RG tag in SAM record. If these are unique in SAM files that get merged, merge performance is better. If not specified, READ_GROUP_ID will be set to . .

String  null


--READ_STRUCTURE / -RS

A description of the logical structure of clusters in an Illumina Run, i.e. a description of the structure IlluminaBasecallsToSam assumes the data to be in. It should consist of integer/character pairs describing the number of cycles and the type of those cycles (B for Sample Barcode, M for molecular barcode, T for Template, and S for skip). E.g. If the input data consists of 80 base clusters and we provide a read structure of "28T8M8B8S28T" then the sequence may be split up into four reads: * read one with 28 cycles (bases) of template * read two with 8 cycles (bases) of molecular barcode (ex. unique molecular barcode) * read three with 8 cycles (bases) of sample barcode * 8 cycles (bases) skipped. * read four with 28 cycles (bases) of template The skipped cycles would NOT be included in an output SAM/BAM file or in read groups therein.

R String  null


--REFERENCE_SEQUENCE / -R

Reference sequence file.

File  null


--RUN_BARCODE / NA

The barcode of the run. Prefixed to read names.

R String  null


--RUN_START_DATE / NA

The start date of the run.

Date  null


--SAMPLE_ALIAS / -ALIAS

Deprecated (use LIBRARY_PARAMS). The name of the sequenced sample

Exclusion: This argument cannot be used at the same time as BARCODE_PARAMS, LIBRARY_PARAMS.

R String  null


--SEQUENCING_CENTER / NA

The name of the sequencing center that produced the reads. Used to set the @RG->CN header tag.

R String  null


--showHidden / -showHidden

display hidden arguments

boolean  false


--TAG_PER_MOLECULAR_INDEX / NA

The list of tags to store each molecular index. The number of tags should match the number of molecular indexes.

List[String]  []


--THREE_PRIME_ADAPTER / NA

For specifying adapters other than standard Illumina

String  null


--TILE_LIMIT / NA

If set, process no more than this many tiles (used for debugging).

Integer  null


--TMP_DIR / NA

One or more directories with space available to be used by this program for temporary storage of working files

List[File]  []


--USE_JDK_DEFLATER / -use_jdk_deflater

Use the JDK Deflater instead of the Intel Deflater for writing compressed output

Boolean  false


--USE_JDK_INFLATER / -use_jdk_inflater

Use the JDK Inflater instead of the Intel Inflater for reading compressed input

Boolean  false


--VALIDATION_STRINGENCY / NA

Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.

The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:

STRICT
LENIENT
SILENT

ValidationStringency  STRICT


--VERBOSITY / NA

Control verbosity of logging.

The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:

ERROR
WARNING
INFO
DEBUG

LogLevel  INFO


--version / NA

display the version number for this tool

boolean  false


Return to top


See also General Documentation | Tool Docs Index Tool Documentation Index | Support Forum

GATK version 4.1.4.0 built at Wed, 9 Oct 2019 15:19:59 -0400.