Showing tool doc from version 4.1.4.0 | The latest version is 4.1.4.1

MergeBamAlignment (Picard)

Merge alignment data from a SAM or BAM with data in an unmapped BAM file.

Summary

A command-line tool for merging BAM/SAM alignment info from a third-party aligner with the data in an unmapped BAM file, producing a third BAM file that has alignment data (from the aligner) and all the remaining data from the unmapped BAM. Quick note: this is not a tool for taking multiple sam files and creating a bigger file by merging them. For that use-case, see {@link MergeSamFiles}.

Details

Many alignment tools (still!) require fastq format input. The unmapped bam may contain useful information that will be lost in the conversion to fastq (meta-data like sample alias, library, barcodes, etc., and read-level tags.) This tool takes an unaligned bam with meta-data, and the aligned bam produced by calling {@link SamToFastq} and then passing the result to an aligner/mapper. It produces a new SAM file that includes all aligned and unaligned reads and also carries forward additional read attributes from the unmapped BAM (attributes that are otherwise lost in the process of converting to fastq). The resulting file will be valid for use by Picard and GATK tools. The output may be coordinate-sorted, in which case the tags, NM, MD, and UQ will be calculated and populated, or query-name sorted, in which case the tags will not be calculated or populated.

Usage example:

java -jar picard.jar MergeBamAlignment \ ALIGNED=aligned.bam \ UNMAPPED=unmapped.bam \ O=merge_alignments.bam \ R=reference_sequence.fasta

Caveats

This tool has been developing for a while and many arguments have been added to it over the years. You may be particularly interested in the following (partial) list:
  • CLIP_ADAPTERS -- Whether to (soft-)clip the ends of the reads that are identified as belonging to adapters
  • IS_BISULFITE_SEQUENCE -- Whether the sequencing originated from bisulfite sequencing, in which case NM will be calculated differently
  • ALIGNER_PROPER_PAIR_FLAGS -- Use if the aligner that was used cannot be trusted to set the "Proper pair" flag and then the tool will set this flag based on orientation and distance between pairs.
  • ADD_MATE_CIGAR -- Whether to use this opportunity to add the MC tag to each read.
  • UNMAP_CONTAMINANT_READS (and MIN_UNCLIPPED_BASES) -- Whether to identify extremely short alignments (with clipping on both sides) as cross-species contamination and unmap the reads.

Category Read Data Manipulation


Overview

Summary

A command-line tool for merging BAM/SAM alignment info from a third-party aligner with the data in an unmapped BAM file, producing a third BAM file that has alignment data (from the aligner) and all the remaining data from the unmapped BAM. Quick note: this is not a tool for taking multiple sam files and creating a bigger file by merging them. For that use-case, see MergeSamFiles.

Details

Many alignment tools (still!) require fastq format input. The unmapped bam may contain useful information that will be lost in the conversion to fastq (meta-data like sample alias, library, barcodes, etc., and read-level tags.) This tool takes an unaligned bam with meta-data, and the aligned bam produced by calling SamToFastq and then passing the result to an aligner/mapper. It produces a new SAM file that includes all aligned and unaligned reads and also carries forward additional read attributes from the unmapped BAM (attributes that are otherwise lost in the process of converting to fastq). The resulting file will be valid for use by Picard and GATK tools. The output may be coordinate-sorted, in which case the tags, NM, MD, and UQ will be calculated and populated, or query-name sorted, in which case the tags will not be calculated or populated.

Usage example:

 java -jar picard.jar MergeBamAlignment \\
      ALIGNED=aligned.bam \\
      UNMAPPED=unmapped.bam \\
      O=merge_alignments.bam \\
      R=reference_sequence.fasta
 

Caveats

This tool has been developing for a while and many arguments have been added to it over the years. You may be particularly interested in the following (partial) list:
  • CLIP_ADAPTERS -- Whether to (soft-)clip the ends of the reads that are identified as belonging to adapters
  • IS_BISULFITE_SEQUENCE -- Whether the sequencing originated from bisulfite sequencing, in which case NM will be calculated differently
  • ALIGNER_PROPER_PAIR_FLAGS -- Use if the aligner that was used cannot be trusted to set the "Proper pair" flag and then the tool will set this flag based on orientation and distance between pairs.
  • ADD_MATE_CIGAR -- Whether to use this opportunity to add the MC tag to each read.
  • UNMAP_CONTAMINANT_READS (and MIN_UNCLIPPED_BASES) -- Whether to identify extremely short alignments (with clipping on both sides) as cross-species contamination and unmap the reads.

MergeBamAlignment (Picard) specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Arguments
--OUTPUT
 -O
null Merged SAM or BAM file to write to.
--REFERENCE_SEQUENCE
 -R
null Reference sequence file.
--UNMAPPED_BAM
 -UNMAPPED
null Original SAM or BAM file of unmapped reads, which must be in queryname order. Reads MUST be unmapped.
Optional Tool Arguments
--ADD_MATE_CIGAR
 -MC
true Adds the mate CIGAR tag (MC) if true, does not if false.
--ALIGNED_BAM
 -ALIGNED
[] SAM or BAM file(s) with alignment data.
--ALIGNED_READS_ONLY
false Whether to output only aligned reads.
--ALIGNER_PROPER_PAIR_FLAGS
false Use the aligner's idea of what a proper pair is rather than computing in this program.
--arguments_file
[] read one or more arguments files and add them to the command line
--ATTRIBUTES_TO_REMOVE
[] Attributes from the alignment record that should be removed when merging. This overrides ATTRIBUTES_TO_RETAIN if they share common tags.
--ATTRIBUTES_TO_RETAIN
[] Reserved alignment attributes (tags starting with X, Y, or Z) that should be brought over from the alignment data when merging.
--ATTRIBUTES_TO_REVERSE
 -RV
[OQ, U2] Attributes on negative strand reads that need to be reversed.
--ATTRIBUTES_TO_REVERSE_COMPLEMENT
 -RC
[E2, SQ] Attributes on negative strand reads that need to be reverse complemented.
--CLIP_ADAPTERS
true Whether to clip adapters where identified.
--CLIP_OVERLAPPING_READS
true For paired reads, soft clip the 3' end of each read if necessary so that it does not extend past the 5' end of its mate.
--EXPECTED_ORIENTATIONS
 -ORIENTATIONS
[] The expected orientation of proper read pairs. Replaces JUMP_SIZE
--help
 -h
false display the help message
--INCLUDE_SECONDARY_ALIGNMENTS
true If false, do not write secondary alignments to output.
--IS_BISULFITE_SEQUENCE
false Whether the lane is bisulfite sequence (used when calculating the NM tag).
--MATCHING_DICTIONARY_TAGS
[M5, LN] List of Sequence Records tags that must be equal (if present) in the reference dictionary and in the aligned file. Mismatching tags will cause an error if in this list, and a warning otherwise.
--MAX_INSERTIONS_OR_DELETIONS
 -MAX_GAPS
1 The maximum number of insertions or deletions permitted for an alignment to be included. Alignments with more than this many insertions or deletions will be ignored. Set to -1 to allow any number of insertions or deletions.
--MIN_UNCLIPPED_BASES
32 If UNMAP_CONTAMINANT_READS is set, require this many unclipped bases or else the read will be marked as contaminant.
--PRIMARY_ALIGNMENT_STRATEGY
BestMapq Strategy for selecting primary alignment when the aligner has provided more than one alignment for a pair or fragment, and none are marked as primary, more than one is marked as primary, or the primary alignment is filtered out for some reason. For all strategies, ties are resolved arbitrarily.
--PROGRAM_GROUP_COMMAND_LINE
 -PG_COMMAND
null The command line of the program group (if not supplied by the aligned file).
--PROGRAM_GROUP_NAME
 -PG_NAME
null The name of the program group (if not supplied by the aligned file).
--PROGRAM_GROUP_VERSION
 -PG_VERSION
null The version of the program group (if not supplied by the aligned file).
--PROGRAM_RECORD_ID
 -PG
null The program group ID of the aligner (if not supplied by the aligned file).
--READ1_ALIGNED_BAM
 -R1_ALIGNED
[] SAM or BAM file(s) with alignment data from the first read of a pair.
--READ1_TRIM
 -R1_TRIM
0 The number of bases trimmed from the beginning of read 1 prior to alignment
--READ2_ALIGNED_BAM
 -R2_ALIGNED
[] SAM or BAM file(s) with alignment data from the second read of a pair.
--READ2_TRIM
 -R2_TRIM
0 The number of bases trimmed from the beginning of read 2 prior to alignment
--SORT_ORDER
 -SO
coordinate The order in which the merged reads should be output.
--UNMAP_CONTAMINANT_READS
 -UNMAP_CONTAM
false Detect reads originating from foreign organisms (e.g. bacterial DNA in a non-bacterial sample),and unmap + label those reads accordingly.
--UNMAPPED_READ_STRATEGY
DO_NOT_CHANGE How to deal with alignment information in reads that are being unmapped (e.g. due to cross-species contamination.) Currently ignored unless UNMAP_CONTAMINANT_READS = true. Note that the DO_NOT_CHANGE strategy will actually reset the cigar and set the mapping quality on unmapped reads since otherwisethe result will be an invalid record. To force no change use the DO_NOT_CHANGE_INVALID strategy.
--version
false display the version number for this tool
Optional Common Arguments
--ADD_PG_TAG_TO_READS
true Add PG tag to each read in a SAM or BAM
--COMPRESSION_LEVEL
5 Compression level for all compressed files created (e.g. BAM and VCF).
--CREATE_INDEX
false Whether to create a BAM index when writing a coordinate-sorted BAM file.
--CREATE_MD5_FILE
false Whether to create an MD5 digest for any BAM or FASTQ files created.
--GA4GH_CLIENT_SECRETS
client_secrets.json Google Genomics API client_secrets.json file path.
--MAX_RECORDS_IN_RAM
500000 When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
--QUIET
false Whether to suppress job-summary info on System.err.
--TMP_DIR
[] One or more directories with space available to be used by this program for temporary storage of working files
--USE_JDK_DEFLATER
 -use_jdk_deflater
false Use the JDK Deflater instead of the Intel Deflater for writing compressed output
--USE_JDK_INFLATER
 -use_jdk_inflater
false Use the JDK Inflater instead of the Intel Inflater for reading compressed input
--VALIDATION_STRINGENCY
STRICT Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
--VERBOSITY
INFO Control verbosity of logging.
Advanced Arguments
--showHidden
false display hidden arguments
Deprecated Arguments
--JUMP_SIZE
 -JUMP
null The expected jump size (required if this is a jumping library). Deprecated. Use EXPECTED_ORIENTATIONS instead
--PAIRED_RUN
 -PE
true DEPRECATED. This argument is ignored and will be removed.

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--ADD_MATE_CIGAR / -MC

Adds the mate CIGAR tag (MC) if true, does not if false.

Boolean  true


--ADD_PG_TAG_TO_READS / NA

Add PG tag to each read in a SAM or BAM

boolean  true


--ALIGNED_BAM / -ALIGNED

SAM or BAM file(s) with alignment data.

Exclusion: This argument cannot be used at the same time as READ1_ALIGNED_BAM, READ2_ALIGNED_BAM, R1_ALIGNED, R2_ALIGNED.

List[File]  []


--ALIGNED_READS_ONLY / NA

Whether to output only aligned reads.

boolean  false


--ALIGNER_PROPER_PAIR_FLAGS / NA

Use the aligner's idea of what a proper pair is rather than computing in this program.

boolean  false


--arguments_file / NA

read one or more arguments files and add them to the command line

List[File]  []


--ATTRIBUTES_TO_REMOVE / NA

Attributes from the alignment record that should be removed when merging. This overrides ATTRIBUTES_TO_RETAIN if they share common tags.

List[String]  []


--ATTRIBUTES_TO_RETAIN / NA

Reserved alignment attributes (tags starting with X, Y, or Z) that should be brought over from the alignment data when merging.

List[String]  []


--ATTRIBUTES_TO_REVERSE / -RV

Attributes on negative strand reads that need to be reversed.

Set[String]  [OQ, U2]


--ATTRIBUTES_TO_REVERSE_COMPLEMENT / -RC

Attributes on negative strand reads that need to be reverse complemented.

Set[String]  [E2, SQ]


--CLIP_ADAPTERS / NA

Whether to clip adapters where identified.

boolean  true


--CLIP_OVERLAPPING_READS / NA

For paired reads, soft clip the 3' end of each read if necessary so that it does not extend past the 5' end of its mate.

boolean  true


--COMPRESSION_LEVEL / NA

Compression level for all compressed files created (e.g. BAM and VCF).

int  5  [ [ -∞  ∞ ] ]


--CREATE_INDEX / NA

Whether to create a BAM index when writing a coordinate-sorted BAM file.

Boolean  false


--CREATE_MD5_FILE / NA

Whether to create an MD5 digest for any BAM or FASTQ files created.

boolean  false


--EXPECTED_ORIENTATIONS / -ORIENTATIONS

The expected orientation of proper read pairs. Replaces JUMP_SIZE

Exclusion: This argument cannot be used at the same time as JUMP_SIZE.

List[PairOrientation]  []


--GA4GH_CLIENT_SECRETS / NA

Google Genomics API client_secrets.json file path.

String  client_secrets.json


--help / -h

display the help message

boolean  false


--INCLUDE_SECONDARY_ALIGNMENTS / NA

If false, do not write secondary alignments to output.

boolean  true


--IS_BISULFITE_SEQUENCE / NA

Whether the lane is bisulfite sequence (used when calculating the NM tag).

boolean  false


--JUMP_SIZE / -JUMP

The expected jump size (required if this is a jumping library). Deprecated. Use EXPECTED_ORIENTATIONS instead

Exclusion: This argument cannot be used at the same time as EXPECTED_ORIENTATIONS, ORIENTATIONS.

Integer  null


--MATCHING_DICTIONARY_TAGS / NA

List of Sequence Records tags that must be equal (if present) in the reference dictionary and in the aligned file. Mismatching tags will cause an error if in this list, and a warning otherwise.

List[String]  [M5, LN]


--MAX_INSERTIONS_OR_DELETIONS / -MAX_GAPS

The maximum number of insertions or deletions permitted for an alignment to be included. Alignments with more than this many insertions or deletions will be ignored. Set to -1 to allow any number of insertions or deletions.

int  1  [ [ -∞  ∞ ] ]


--MAX_RECORDS_IN_RAM / NA

When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.

Integer  500000  [ [ -∞  ∞ ] ]


--MIN_UNCLIPPED_BASES / NA

If UNMAP_CONTAMINANT_READS is set, require this many unclipped bases or else the read will be marked as contaminant.

int  32  [ [ -∞  ∞ ] ]


--OUTPUT / -O

Merged SAM or BAM file to write to.

R File  null


--PAIRED_RUN / -PE

DEPRECATED. This argument is ignored and will be removed.

Boolean  true


--PRIMARY_ALIGNMENT_STRATEGY / NA

Strategy for selecting primary alignment when the aligner has provided more than one alignment for a pair or fragment, and none are marked as primary, more than one is marked as primary, or the primary alignment is filtered out for some reason. For all strategies, ties are resolved arbitrarily.

The --PRIMARY_ALIGNMENT_STRATEGY argument is an enumerated type (PrimaryAlignmentStrategy), which can have one of the following values:

BestMapq
EarliestFragment
BestEndMapq
MostDistant

PrimaryAlignmentStrategy  BestMapq


--PROGRAM_GROUP_COMMAND_LINE / -PG_COMMAND

The command line of the program group (if not supplied by the aligned file).

String  null


--PROGRAM_GROUP_NAME / -PG_NAME

The name of the program group (if not supplied by the aligned file).

String  null


--PROGRAM_GROUP_VERSION / -PG_VERSION

The version of the program group (if not supplied by the aligned file).

String  null


--PROGRAM_RECORD_ID / -PG

The program group ID of the aligner (if not supplied by the aligned file).

String  null


--QUIET / NA

Whether to suppress job-summary info on System.err.

Boolean  false


--READ1_ALIGNED_BAM / -R1_ALIGNED

SAM or BAM file(s) with alignment data from the first read of a pair.

Exclusion: This argument cannot be used at the same time as ALIGNED_BAM.

List[File]  []


--READ1_TRIM / -R1_TRIM

The number of bases trimmed from the beginning of read 1 prior to alignment

int  0  [ [ -∞  ∞ ] ]


--READ2_ALIGNED_BAM / -R2_ALIGNED

SAM or BAM file(s) with alignment data from the second read of a pair.

Exclusion: This argument cannot be used at the same time as ALIGNED_BAM.

List[File]  []


--READ2_TRIM / -R2_TRIM

The number of bases trimmed from the beginning of read 2 prior to alignment

int  0  [ [ -∞  ∞ ] ]


--REFERENCE_SEQUENCE / -R

Reference sequence file.

R File  null


--showHidden / -showHidden

display hidden arguments

boolean  false


--SORT_ORDER / -SO

The order in which the merged reads should be output.

The --SORT_ORDER argument is an enumerated type (SortOrder), which can have one of the following values:

unsorted
queryname
coordinate
duplicate
unknown

SortOrder  coordinate


--TMP_DIR / NA

One or more directories with space available to be used by this program for temporary storage of working files

List[File]  []


--UNMAP_CONTAMINANT_READS / -UNMAP_CONTAM

Detect reads originating from foreign organisms (e.g. bacterial DNA in a non-bacterial sample),and unmap + label those reads accordingly.

boolean  false


--UNMAPPED_BAM / -UNMAPPED

Original SAM or BAM file of unmapped reads, which must be in queryname order. Reads MUST be unmapped.

R File  null


--UNMAPPED_READ_STRATEGY / NA

How to deal with alignment information in reads that are being unmapped (e.g. due to cross-species contamination.) Currently ignored unless UNMAP_CONTAMINANT_READS = true. Note that the DO_NOT_CHANGE strategy will actually reset the cigar and set the mapping quality on unmapped reads since otherwisethe result will be an invalid record. To force no change use the DO_NOT_CHANGE_INVALID strategy.

The --UNMAPPED_READ_STRATEGY argument is an enumerated type (UnmappingReadStrategy), which can have one of the following values:

COPY_TO_TAG
DO_NOT_CHANGE
DO_NOT_CHANGE_INVALID
MOVE_TO_TAG

UnmappingReadStrategy  DO_NOT_CHANGE


--USE_JDK_DEFLATER / -use_jdk_deflater

Use the JDK Deflater instead of the Intel Deflater for writing compressed output

Boolean  false


--USE_JDK_INFLATER / -use_jdk_inflater

Use the JDK Inflater instead of the Intel Inflater for reading compressed input

Boolean  false


--VALIDATION_STRINGENCY / NA

Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.

The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:

STRICT
LENIENT
SILENT

ValidationStringency  STRICT


--VERBOSITY / NA

Control verbosity of logging.

The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:

ERROR
WARNING
INFO
DEBUG

LogLevel  INFO


--version / NA

display the version number for this tool

boolean  false


Return to top


See also General Documentation | Tool Docs Index Tool Documentation Index | Support Forum

GATK version 4.1.4.0 built at Wed, 9 Oct 2019 15:19:59 -0400.