TCGAImporter (v5)

This module imports data from TCGA by taking in a GDC manifest file, downloading the files listed on that manifest, renaming them to be human-friendly, and compiling them into a GCT file to be computer-friendly.

Summary

This module imports data from TCGA by taking in a GDC manifest file, downloading the files listed on that manifest, renaming them to be human-friendly, and compiling them into a GCT file to be computer-friendly.

Remember that you will need to download a manifest file and a metadata file from the GDC data portal (https://portal.gdc.cancer.gov/). To dowload these two files follow these intructions: https://github.com/genepattern/TCGAImporter/blob/master/how_to_download_a_manifest_and_metadata.pdf

If you'd like a more comprehensive tutorial of the GDC website, you can find it here: https://docs.gdc.cancer.gov/Data_Portal/Users_Guide/Getting_Started/

 

Version comments:

  • Version 5.0: Utilizes the new Docker container (0.2)
  • Version 4.0: Improved performance of translating gene names.
  • Version 3.2: Changed module name (from download_from_gdc to TCGAImporter) and updated code to read metadata files dowloaded after February 2018 (following GDC's metadata reformatting), this is backwards compatible.

 

Functionality yet to be implemented:

  • Parse copy number variation

 

Technical notes:

  • This module has been tested to run in the Docker container genepattern/docker-download-from-gdc:0.2 which has build code b2l3ixgs675rmow9n3dhgfp
  • To create a conda environment (called GP_dfgdc_env) with the required dependencies download the requirements.txt file from the github repository named genepattern/docker-python36 (here is the url of the file: https://raw.githubusercontent.com/genepattern/docker-python36/master/requirements.txt) and run this three commands in the same folder where requirements.txt is located:

conda create --name GP_dfgdc_env pip
source activate GP_dfgdc_env
pip install -r requirements.txt

Note that you will need to have the GDC download client on the same folder. If you don't know what this means, read more here:  https://docs.gdc.cancer.gov/Data_Portal/Users_Guide/Getting_Started

Parameters

 

Name Description
imanifest * The relative path of the manifest used to download the data. This file is obtained from the GDC data portal (https://portal.gdc.cancer.gov/).
metadata *

The metadata file obtained from obtained from the GDC data portal (https://portal.gdc.cancer.gov/)

output_file_name *

 

The base name to use for output files. E.g., if you type "TCGA_dataset" then the GCT file will be named "TCGA_dataset.gct"

gct * whether or not to create a gct file

 

translate_gene_id * Whether or not to translate ENSEMBL IDs (e.g., ENSG00000012048) to Hugo Gene Symbol (e.g., BRCA1)
cls * Whether or not to translate create a cls file separating Normal and Tumor classes based on TCGA Sample ID.

* - required

Output Files

 

  1. GCT file (if gct was set to True)
    Contains all the data downloaded from GDC.
  2. TXT files (if gct was set to False)
    Contains the data download from GDC scattered in mulitple files.
  3. CLS
    Created if cls was set to True. This CLS file contain the classification of the samples into either normal tissue or cancer tissue based on the TCGA ID.

License

TCGAImporter is distributed under a modified BSD license available at https://raw.githubusercontent.com/genepattern/TCGAImporter/master/LICENSE

Platform Dependencies

Task Type:
Download dataset

CPU Type:
any

Operating System:
any

Language:
Python 3.6

Version Comments

Version Release Date Description
5 2018-08-06 Fixing small bugs and increasing performance of gene name translation
4 2018-05-16 Renaming the module from download_from_gdc to TCGAImporter
3 2018-04-16 preparing for prebuild
1 2018-04-16 Initial version