How to concatenate multiple text files and convert MAF to MUT format

 

Drag-dropping allows you to open multiple selected files, such as TCGA mutation annotation format (MAF) files, in IGV. However, for various reasons, you may want to merge multiple files into a single file. The instructions below outline how to do this with text-based mutation annotation format (MAF) files and provide steps for two other related file manipulations:

MAF files from Firebrowse.org download as a folder containing a manifest and an individual maf.txt file for each patient. For example, at the time of this writing the bladder cancer (BLCA) cohort gives 130 individual MAF files corresponding to each TCGA patient.

The MUT format allows for more predictable file manipulations. IGV reads only the first five columns starting with the second row. The columns list in order chromosome, start location, end location, sample name, and variant classification.

Instead of each sample displaying as a separate track, collapse the tracks to display together in a single track or groups of collapsed tracks. To identify the sample origin of a mutation, click on a mutation for the Mutation Information Panel which displays information from the remaining columns of the MUT file to which the original sample column is moved.

 

Merge multiple .txt files into a single file

Here are some command-line instructions from a non-programming Mac Mavericks user from February 2015 that merges multiple text files within a folder into a single file. The example uses MAF files but you can perform the same for any text based file.

  1. Use Command + Spacebar function on the Mac to search for the Terminal application. Open this.
  2. At the prompt, and assuming the MAF files folder is in your user Download folder, type cd /Downloads/gdac.broadinstitute.org_BLCA.Mutation_Packager_Calls.. This puts you in the folder containing the MAFs.
  3. Enter cat *.maf.txt >> merged.maf.txt to concatenate all the .maf.txt files into a single file and save it in the same folder. IGV recognizes the .maf.txt extension as a MAF format file.
  4. Test that the file merged correctly by opening it in IGV.

Do I need to remove duplicate headers leftover from the merging?

No for MAF files. Yes for MUT files.
This is necessary for MUT files as it uses only the first row as a header.

Do I need to convert my MAF file to a MUT file to visualize in IGV?

No for TCGA origin MAF files, as the first 33 columns are consistent across files and IGV recognizes the appropriate columns for visualization and overlay onto other tracks.

All other MAF files need individual testing. If the file does not visualize or overlay, the format should be changed to match TCGA conventions or converted to MUT format.

 

Convert MAF to a MUT format file

MAF and MUT formats are described in the links.

Start with the merged file from the previous section which will contain multiple header rows. The following steps first remove these duplicate headers then rearrange the columns to reflect the MUT format.

  • If you do not remove duplicate headers, IGV gives a ‘column 2 must be a numeric value’ error.
  • IGV ignores blank columns so the instructions do not remove these.
  • Additional columns beyond the requisite five columns are left as the information therein is displayed upon clicking on a mutation from a track window.

The instructions below use the merged TCGA BLCA MAF file as the example.

Import into Excel

  1. Open a new Excel window, then open the merged MAF.TXT file. This prompts Excel to do a Text Import.
  2. Choose Delimited, then click Next >.
  3. On the next screen, click Next >.
  4. On the next screen, highlight the first column or column containing gene symbols and click the Text option. This ensures gene symbols are not converted to Excel date formats. Click Finish.

Use Excel Sort function to list and delete duplicate header rows together at top

  1. Select the entire sheet. Go to Data>Sort and check My list has headers. Under Column, select Strand, and under Order, select Z to A. Because strand values are either + or –, rows with Strand will be listed first.
  2. Delete all but the first row of duplicate header rows.

Use Excel to rearrange columns to conform to MUT format

The screenshot represents the starting columns of a conventional TCGA MAF file.

  1. Insert five blank columns at the beginning of the sheet.
  2. Cut and paste the following columns into the five new columns 1–5 in order so it looks like the next screenshot.
    1. Chromosome
    2. Start_position
    3. End_position
    4. Tumor_Sample_Barcode
    5. Variant_Classification

  1. For the given sheet, save as a tab-delimited text file with extension .mut.txt.

Save your Excel worksheets separately in Excel format in case you need to come back and make alterations such as that outlined in the next section.

 

Alter MUT file to display collapsed multi-sample tracks

These instructions retain individual sample barcodes for mutations and allow you to collapse tracks to a single track or groups of tracks.

Start with the merged MUT.txt file from the previous section opened in Excel.

  1. Cut and paste column four containing sample names into any other blank or new column starting from column 6 onwards.
  2. If you leave column four blank, IGV loads the MUT.txt file without a sample name. All the mutation data will display as a single collapsed track.
  3. When naming the track, for a single track use the same name for all data rows. Alternatively, to differentiate groups of collapsed tracks use a different name for each group and label each group member identically.
    1. For example, fill in cell D2 (column 4, row 2) with a name, select D2, mouse over the lower-right corner of the selected cell until the mouse pointer is a black + sign, and double-click. This fills all the remaining rows of the column with the same name.

To identify the sample origin of a mutation, click on a mutation for the Mutation Information Panel which displays information from the remaining columns of the MUT file.