Genomic data submission - meta-data, files and file formats

Overview

This document provides information about the submission of genomic data to CMDGA including genomic experiments, genomic annotations, genomic perturbations and single cell embeddings including meta-data fields, data files, and how these files should be formatted.

Genomic experiments describe molecular assays performed by a research group, for example RNA-seq, ATAC-seq, scRNA-seq, snATAC-seq, Hi-C, STARR-seq, Perturb-seq and ChIP-seq

Genomic annotations describe analytical distillations of experiments often involving multiple experiments, for example chromatin states, accessible chromatin sites, ABC links, and gene expression levels

Genomic perturbations describe manipulations of gene or genomic activity, for example shRNA gene knockdown, CRISPR knockout, CRISPRi, and gene over-expression

Single cell embeddings describe intermediate files involving data combined across multiple single cell experiments

If your data type is not included in this document please contact us and we will add this data type.

 

IMPORTANT - Data sharing restrictions

Before submitting data to CMDGA, ensure that you have double checked whether your data are subject to data sharing restrictions.

If you are planning to share data files that contain sensitive information, then a Material Transfer Agreement (MTA) wil need to be completed and signed prior to submission of these data.  These data will be stored in CMDGA but otherwise will not be made available for download or external access.

If there are no restrictions on data sharing then the raw data can be shared with CMDGA if desired.

Data types that contain individual-level information and may therefore be covered under data sharing restrictions include:  raw sequence reads, genotyping data, sequence alignments.  These files are highlighted in orange.  For sequence alignments, the sequence can be removed or variant bases masked from the bam file at which point they can be shared without restriction.  

Data types that do not contain individual-level information and are therefore can be shared regardless of data restrictions are highlighted in blue  

Genomic experiments

RNA-seq

Description:  RNA-seq experiments provide information on the abundance of gene transcripts.

Meta-data:

Meta-data formats: CMDGA - meta-data reference

Files for each replicate:

Raw sequence data - FASTQ file of sequence reads 

Sequence alignment - BAM file of aligned reads (unfiltered, filtered).  If data are protected, alignments can be stripped of sequence or sequence at known variants

Mapped read file - tagAlign file of reads (unfiltered, filtered and de-duplicated)

Gene quantifications - TSV file of quantified gene expression

Transcript quantifications - TSV file of quantified transcript expression

Read depth signal - BigWig file of: (1) Fold-enrichment ratio of counts relative to background. (2) negative log10 of the Poisson p-value of counts relative to background.

File formats: CMDGA - genomic experiment file formats

ATAC-seq/DHS-seq

Description:  ATAC-seq and DHS-seq experiments provide information on regions of accessible chromatin.

Meta-data:

Meta-data formats: CMDGA - meta-data reference

Files for each replicate:

Raw sequence data - FASTQ file of sequence reads 

Sequence alignment - BAM file of aligned reads (unfiltered, and filtered).  If data are protected, alignments can be stripped of sequence or sequence at known variants

Mapped read file - tagAlign file of aligned reads  (unfiltered, filtered and de-duplicated)

Peak calls - BED file of peak calls

Read depth signal - BigWig file of: (1) Fold-enrichment ratio of counts relative to background. (2) negative log10 of the Poisson p-value of counts relative to background.

File formats: CMDGA - genomic experiment file formats

 

ChIP-seq

Description:  ChIP-seq experiments provide information on regions of histone modification or transcription factor binding.

Meta-data:

Meta-data formats: CMDGA - meta-data reference

Files for each replicate:

Raw sequence data - FASTQ file of sequence reads 

Sequence alignment - BAM file of aligned reads (unfiltered, filtered).  If data are protected, alignments can be stripped of sequence or sequence at known variants

Mapped read file - tagAlign file of aligned reads (filtered, de-duplicated)

Peak calls - BED file of peak calls

Read depth signal - BigWig file of: (1) Fold-enrichment ratio of counts relative to background. (2) negative log10 of the Poisson p-value of counts relative to background.

File formats:  CMDGA - genomic experiment file formats

 

sc/snRNA-seq

Description:  Single cell or single nuclear RNA-seq experiments provide information on the abundance of gene transcripts from individual cells or nuclei

Meta-data:

Meta-data formats: CMDGA - meta-data reference

Files for each replicate:

Raw sequence data - FASTQ file of sequence reads 

Sequence alignment - BAM file of aligned reads.  If data are protected, alignments can be stripped of non-barcode sequence or sequence at known variants

Feature file - TSV file with list of features

Barcode list - TSV file with list of barcodes

Feature matrix- Matrix file with UMI counts for all barcodes (unfiltered)

File formats:  CMDGA - genomic experiment file formats

 

snATAC-seq/sci-ATAC-seq

Description: single nuclear ATAC-seq experiments provide information on regions of accessible chromatin from individual nuclei 

Meta-data:

Meta-data format: CMDGA - meta-data reference

Files for each replicate:

Raw sequence data - FASTQ file of sequence reads 

Sequence alignment - BAM file of aligned reads.  If data are protected, alignments can be stripped of non-barcode sequence or sequence at known variants

Fragment file - BED file of fragments with barcode information (unfiltered)

Barcode list - TSV file with list of barcodes

File formats:  CMDGA - genomic experiment file formats

 

Single cell multiome

Description: single cell multiome experiments provide information on regions of accessible chromatin and gene expression levels from individual nuclei

Meta-data:

Meta-data format: CMDGA - meta-data reference

Files for each replicate:

Raw RNA-seq sequence data - FASTQ file of sequence reads from RNA-seq

Raw ATAC-seq sequence data - FASTQ file of sequence reads from ATAC-seq

Sequence RNA-seq alignment - BAM file of aligned reads from RNA-seq.  If data are protected, alignments can be stripped of non-barcode sequence or sequence at known variants

Sequence ATAC-seq alignment - BAM file of aligned reads from ATAC-seq.  If data are protected, alignments can be stripped of non-barcode sequence or sequence at known variants

RNA feature file - TSV file with list of features for RNA

RNA feature matrix- Matrix file with UMI counts for all barcodes (unfiltered)

ATAC fragment file - BED file of ATAC fragments with barcode information (unfiltered)

Barcode list - TSV file with list of all barcodes

File formats:  CMDGA - genomic experiment file formats

 

Hi-C

Description: Hi-C experiments provide information on interactions between genomic regions

Meta-data:

Meta-data formats: CMDGA - meta-data reference

Files for each replicate:

Raw sequence data - FASTQ file of sequence reads

Sequence alignment - BAM file of aligned reads (unfiltered, and filtered and de-duped).  If data are protected, alignments can be stripped of non-barcode sequence information or sequence at known variant bases

Paired reads- Pairs file of aligned read pairs (filtered, de-duped)

Contact matrix - HiC file with contact matrix

Chromatin interactions - BEDPE file of chromatin interactions

File formats:  CMDGA - genomic experiment file formats

 

pcHi-C

Description: pcHi-C experiments provide information on interactions between genomic regions using baits to enrich for interactions involving gene promoters

Meta-data:

Meta-data format: CMDGA - meta-data reference

Files for each replicate:

Raw sequence data - FASTQ file of sequence reads

Sequence alignment - BAM file of aligned reads (unfiltered, and filtered and de-duped).  If data are protected, alignments can be stripped of non-barcode sequence information or sequence at known variant bases

Paired reads- Pairs file of aligned read pairs (filtered, de-duped)

Contact matrix - HiC file with contact matrix

Chromatin interactions - BEDPE file of chromatin interactions

Baits - TSV file of baits used for promoter capture

File formats:  CMDGA - genomic experiment file formats

 

HiChIP/PLAC-seq

Description: HiChIP and PLAC-seq experiments provide information on interactions between genomic regions involving specific histone modifications or DNA-binding proteins

Meta-data:

Meta-data format: CMDGA - meta-data reference

Files for each replicate:

Raw sequence data - FASTQ file of sequence reads

Sequence alignment - BAM file of aligned reads (unfiltered, and filtered and de-duped).  If data are protected, alignments can be stripped of non-barcode sequence information or sequence at known variant bases

Paired reads- Pairs file of aligned read pairs (filtered, de-duped)

Contact matrix - HiC file with contact matrix

Chromatin interactions - BEDPE file of chromatin interactions

Peaks - BED file of peaks used to identify interactions

File formats:  CMDGA - genomic experiment file formats

Other experimental assays in progress:

SELEX-seq, STARR-seq, Perturb-seq

Genomic annotations

i. Target gene predictions

Description: Genomic elements linked to putative target genes, for example chromatin interactions from HiC, ABC predictions, cicero co-accessibility

Meta-data:

Meta-data formats: CMDGA - meta-data reference

Files:

Target gene file - BED file of target gene predictions

File formats:  CMDGA - annotation file formats

 

ii. Variant allelic effects

Description: Genetic variants with allelic effects on a molecular phenotype such as MPRA assays

Meta-data:

Meta-data formats: CMDGA - meta-data reference

Files:

Variant effects file - TSV file of variant allelic effects

File formats:  CMDGA - annotation file formats

 

iii. Variant to gene predictions

Description: Genetic variants linked to putative target genes

Meta-data:

Meta-data formats: CMDGA - meta-data reference

Files:

Variant to gene file - TSV file of variant to gene predictions

File formats:  CMDGA - annotation file formats

 

iv. QTLs

Description: Summary statistics from QTL (eQTL, caQTL, meQTL, hQTL, etc) studies

Meta-data:

Meta-data formats: CMDGA - meta-data reference

Files:

Summary statistics - TSV file of marginal QTL summary statistics

Conditional signals - Rdata or TSV file containing statistics for all conditional signals 

File formats:  CMDGA - annotation file formats

 

v. Chromatin states

Description: Genomic regions predicted to be epigenomic states such as enhancers, promoters and insulators

Meta-data:

Meta-data formats: CMDGA - meta-data reference

Files:

Chromatin state file - BED file of chromatin states

File formats:  CMDGA - annotation file formats

 

vi. Accessible chromatin sites

Description: Genomic regions of accessible or open chromatin

Meta-data:

Meta-data format: CMDGA - meta-data reference

Files:

Accessible chromatin site file - BED file of accessible chromatin sites

Read depth signal - BigWig file of read depth signal for accessible chromatin

File formats:  CMDGA - annotation file formats

 

vii. Transcription factor binding sites

Description: Genomic regions of transcription factor binding

Meta-data:

Meta-data formats: CMDGA - meta-data reference

Files:

TF binding site file - BED file of TF binding sites

File formats:  CMDGA - annotation file formats

 

viii. Histone modification sites

Description: Genomic regions of histone modifications

Meta-data:

Meta-data formats: CMDGA - meta-data reference

Files:

Histone modification site file - BED file of histone modification sites

File formats:  CMDGA - annotation file formats

ix. Gene expression levels

Description: Normalized expression level of genes across a set of samples

Meta-data:

Meta-data formats: CMDGA - meta-data reference

Files:

Gene expression level file - TSV file of gene expression levels

File formats:  CMDGA - annotation file formats

 

x. Candidate cis-regulatory elements

Description: Genomic regions predicted to have cis-regulatory element activity

Meta-data:

Meta-data formats: CMDGA - meta-data reference

Files:

cCRE file - TSV file of cCREs

File formats:  CMDGA - annotation file formats

 

Genomic perturbations

i. Gene perturbation experiments

Description: Any type of gene perturbation experiment for example shRNA, gene over-expression, CRISPR deletion, CRISPR base editing, mouse knockout, CRISPRi etc.

Meta data:

Meta-data formats: CMDGA - meta-data reference

Files:
Phenotypic effects file - TSV file containing perturbations with a significant effect on a molecular, cell or organismal phenotype

Perturbations file - TSV file containing all perturbations targeted in the experiment including their effects on all tested phenotypes regardless of significance

File formats: CMDGA - genome perturbation file formats

Single cell embeddings

i. Single cell embeddings

Description: Intermediate files such as clustering and cell matrices comprised of single cell profiles from one or more samples 

Meta-data:

Meta-data formats: CMDGA - meta-data reference

Files:

The files can either be provided as individual components:

Fragment files - BED file of fragments with barcode information for each read

Feature matrix - Matrix of feature counts for each cell 

Features - TSV list of all features used to create the matrix

Cell meta-data - File consisting of barcodes and associated meta-data such as cell type labels

Embedding - Representation of cells in low-dimensional space (could be multiple embeddings)

Or as a single container or object containing all of these components:

Single cell container/object - H5AD, RDS, Loom, or SCE format file containing the above files

File formatsCMDGA - single cell file formats