Genomic data submission - meta-data, files and file formats
Overview
This document provides information about the submission of genomic data to CMDGA including genomic experiments, genomic annotations, genomic perturbations and single cell embeddings including meta-data fields, data files, and how these files should be formatted.
Genomic experiments describe molecular assays performed by a research group, for example RNA-seq, ATAC-seq, scRNA-seq, snATAC-seq, Hi-C, STARR-seq, Perturb-seq and ChIP-seq
Genomic annotations describe analytical distillations of experiments often involving multiple experiments, for example chromatin states, accessible chromatin sites, ABC links, and gene expression levels
Genomic perturbations describe manipulations of gene or genomic activity, for example shRNA gene knockdown, CRISPR knockout, CRISPRi, and gene over-expression
Single cell embeddings describe intermediate files involving data combined across multiple single cell experiments
If your data type is not included in this document please contact us and we will add this data type.
IMPORTANT - Data sharing restrictions
Before submitting data to CMDGA, ensure that you have double checked whether your data are subject to data sharing restrictions.
If you are planning to share data files that contain sensitive information, then a Material Transfer Agreement (MTA) wil need to be completed and signed prior to submission of these data. These data will be stored in CMDGA but otherwise will not be made available for download or external access.
If there are no restrictions on data sharing then the raw data can be shared with CMDGA if desired.
Data types that contain individual-level information and may therefore be covered under data sharing restrictions include: raw sequence reads, genotyping data, sequence alignments. These files are highlighted in orange. For sequence alignments, the sequence can be removed or variant bases masked from the bam file at which point they can be shared without restriction.
Data types that do not contain individual-level information and are therefore can be shared regardless of data restrictions are highlighted in blue
Genomic experiments
RNA-seq
Description: RNA-seq experiments provide information on the abundance of gene transcripts.
Meta-data:
Meta-data formats: CMDGA - meta-data reference
Files for each replicate:
Raw sequence data - FASTQ file of sequence reads
Sequence alignment - BAM file of aligned reads (unfiltered, filtered). If data are protected, alignments can be stripped of sequence or sequence at known variants
Mapped read file - tagAlign file of reads (unfiltered, filtered and de-duplicated)
Gene quantifications - TSV file of quantified gene expression
Transcript quantifications - TSV file of quantified transcript expression
Read depth signal - BigWig file of: (1) Fold-enrichment ratio of counts relative to background. (2) negative log10 of the Poisson p-value of counts relative to background.
File formats: CMDGA - genomic experiment file formats
ATAC-seq/DHS-seq
Description: ATAC-seq and DHS-seq experiments provide information on regions of accessible chromatin.
Meta-data:
Meta-data formats: CMDGA - meta-data reference
Files for each replicate:
Raw sequence data - FASTQ file of sequence reads
Sequence alignment - BAM file of aligned reads (unfiltered, and filtered). If data are protected, alignments can be stripped of sequence or sequence at known variants
Mapped read file - tagAlign file of aligned reads (unfiltered, filtered and de-duplicated)
Peak calls - BED file of peak calls
Read depth signal - BigWig file of: (1) Fold-enrichment ratio of counts relative to background. (2) negative log10 of the Poisson p-value of counts relative to background.
File formats: CMDGA - genomic experiment file formats
ChIP-seq
Description: ChIP-seq experiments provide information on regions of histone modification or transcription factor binding.
Meta-data:
Meta-data formats: CMDGA - meta-data reference
Files for each replicate:
Raw sequence data - FASTQ file of sequence reads
Sequence alignment - BAM file of aligned reads (unfiltered, filtered). If data are protected, alignments can be stripped of sequence or sequence at known variants
Mapped read file - tagAlign file of aligned reads (filtered, de-duplicated)
Peak calls - BED file of peak calls
Read depth signal - BigWig file of: (1) Fold-enrichment ratio of counts relative to background. (2) negative log10 of the Poisson p-value of counts relative to background.
File formats: CMDGA - genomic experiment file formats
sc/snRNA-seq
Description: Single cell or single nuclear RNA-seq experiments provide information on the abundance of gene transcripts from individual cells or nuclei
Meta-data:
Meta-data formats: CMDGA - meta-data reference
Files for each replicate:
Raw sequence data - FASTQ file of sequence reads
Sequence alignment - BAM file of aligned reads. If data are protected, alignments can be stripped of non-barcode sequence or sequence at known variants
Feature file - TSV file with list of features
Barcode list - TSV file with list of barcodes
Feature matrix- Matrix file with UMI counts for all barcodes (unfiltered)
File formats: CMDGA - genomic experiment file formats
snATAC-seq/sci-ATAC-seq
Description: single nuclear ATAC-seq experiments provide information on regions of accessible chromatin from individual nuclei
Meta-data:
Meta-data format: CMDGA - meta-data reference
Files for each replicate:
Raw sequence data - FASTQ file of sequence reads
Sequence alignment - BAM file of aligned reads. If data are protected, alignments can be stripped of non-barcode sequence or sequence at known variants
Fragment file - BED file of fragments with barcode information (unfiltered)
Barcode list - TSV file with list of barcodes
File formats: CMDGA - genomic experiment file formats
Single cell multiome
Description: single cell multiome experiments provide information on regions of accessible chromatin and gene expression levels from individual nuclei
Meta-data:
Meta-data format: CMDGA - meta-data reference
Files for each replicate:
Raw RNA-seq sequence data - FASTQ file of sequence reads from RNA-seq
Raw ATAC-seq sequence data - FASTQ file of sequence reads from ATAC-seq
Sequence RNA-seq alignment - BAM file of aligned reads from RNA-seq. If data are protected, alignments can be stripped of non-barcode sequence or sequence at known variants
Sequence ATAC-seq alignment - BAM file of aligned reads from ATAC-seq. If data are protected, alignments can be stripped of non-barcode sequence or sequence at known variants
RNA feature file - TSV file with list of features for RNA
RNA feature matrix- Matrix file with UMI counts for all barcodes (unfiltered)
ATAC fragment file - BED file of ATAC fragments with barcode information (unfiltered)
Barcode list - TSV file with list of all barcodes
File formats: CMDGA - genomic experiment file formats
Hi-C
Description: Hi-C experiments provide information on interactions between genomic regions
Meta-data:
Meta-data formats: CMDGA - meta-data reference
Files for each replicate:
Raw sequence data - FASTQ file of sequence reads
Sequence alignment - BAM file of aligned reads (unfiltered, and filtered and de-duped). If data are protected, alignments can be stripped of non-barcode sequence information or sequence at known variant bases
Paired reads- Pairs file of aligned read pairs (filtered, de-duped)
Contact matrix - HiC file with contact matrix
Chromatin interactions - BEDPE file of chromatin interactions
File formats: CMDGA - genomic experiment file formats
pcHi-C
Description: pcHi-C experiments provide information on interactions between genomic regions using baits to enrich for interactions involving gene promoters
Meta-data:
Meta-data format: CMDGA - meta-data reference
Files for each replicate:
Raw sequence data - FASTQ file of sequence reads
Sequence alignment - BAM file of aligned reads (unfiltered, and filtered and de-duped). If data are protected, alignments can be stripped of non-barcode sequence information or sequence at known variant bases
Paired reads- Pairs file of aligned read pairs (filtered, de-duped)
Contact matrix - HiC file with contact matrix
Chromatin interactions - BEDPE file of chromatin interactions
Baits - TSV file of baits used for promoter capture
File formats: CMDGA - genomic experiment file formats
HiChIP/PLAC-seq
Description: HiChIP and PLAC-seq experiments provide information on interactions between genomic regions involving specific histone modifications or DNA-binding proteins
Meta-data:
Meta-data format: CMDGA - meta-data reference
Files for each replicate:
Raw sequence data - FASTQ file of sequence reads
Sequence alignment - BAM file of aligned reads (unfiltered, and filtered and de-duped). If data are protected, alignments can be stripped of non-barcode sequence information or sequence at known variant bases
Paired reads- Pairs file of aligned read pairs (filtered, de-duped)
Contact matrix - HiC file with contact matrix
Chromatin interactions - BEDPE file of chromatin interactions
Peaks - BED file of peaks used to identify interactions
File formats: CMDGA - genomic experiment file formats
Other experimental assays in progress:
SELEX-seq, STARR-seq, Perturb-seq
Genomic annotations
i. Target gene predictions
Description: Genomic elements linked to putative target genes, for example chromatin interactions from HiC, ABC predictions, cicero co-accessibility
Meta-data:
Meta-data formats: CMDGA - meta-data reference
Files:
Target gene file - BED file of target gene predictions
File formats: CMDGA - annotation file formats
ii. Variant allelic effects
Description: Genetic variants with allelic effects on a molecular phenotype such as MPRA assays
Meta-data:
Meta-data formats: CMDGA - meta-data reference
Files:
Variant effects file - TSV file of variant allelic effects
File formats: CMDGA - annotation file formats
iii. Variant to gene predictions
Description: Genetic variants linked to putative target genes
Meta-data:
Meta-data formats: CMDGA - meta-data reference
Files:
Variant to gene file - TSV file of variant to gene predictions
File formats: CMDGA - annotation file formats
iv. QTLs
Description: Summary statistics from QTL (eQTL, caQTL, meQTL, hQTL, etc) studies
Meta-data:
Meta-data formats: CMDGA - meta-data reference
Files:
Summary statistics - TSV file of marginal QTL summary statistics
Conditional signals - Rdata or TSV file containing statistics for all conditional signals
File formats: CMDGA - annotation file formats
v. Chromatin states
Description: Genomic regions predicted to be epigenomic states such as enhancers, promoters and insulators
Meta-data:
Meta-data formats: CMDGA - meta-data reference
Files:
Chromatin state file - BED file of chromatin states
File formats: CMDGA - annotation file formats
vi. Accessible chromatin sites
Description: Genomic regions of accessible or open chromatin
Meta-data:
Meta-data format: CMDGA - meta-data reference
Files:
Accessible chromatin site file - BED file of accessible chromatin sites
Read depth signal - BigWig file of read depth signal for accessible chromatin
File formats: CMDGA - annotation file formats
vii. Transcription factor binding sites
Description: Genomic regions of transcription factor binding
Meta-data:
Meta-data formats: CMDGA - meta-data reference
Files:
TF binding site file - BED file of TF binding sites
File formats: CMDGA - annotation file formats
viii. Histone modification sites
Description: Genomic regions of histone modifications
Meta-data:
Meta-data formats: CMDGA - meta-data reference
Files:
Histone modification site file - BED file of histone modification sites
File formats: CMDGA - annotation file formats
ix. Gene expression levels
Description: Normalized expression level of genes across a set of samples
Meta-data:
Meta-data formats: CMDGA - meta-data reference
Files:
Gene expression level file - TSV file of gene expression levels
File formats: CMDGA - annotation file formats
x. Candidate cis-regulatory elements
Description: Genomic regions predicted to have cis-regulatory element activity
Meta-data:
Meta-data formats: CMDGA - meta-data reference
Files:
cCRE file - TSV file of cCREs
File formats: CMDGA - annotation file formats
Genomic perturbations
i. Gene perturbation experiments
Description: Any type of gene perturbation experiment for example shRNA, gene over-expression, CRISPR deletion, CRISPR base editing, mouse knockout, CRISPRi etc.
Meta data:
Meta-data formats: CMDGA - meta-data reference
Files:
Phenotypic effects file - TSV file containing perturbations with a significant effect on a molecular, cell or organismal phenotype
Perturbations file - TSV file containing all perturbations targeted in the experiment including their effects on all tested phenotypes regardless of significance
File formats: CMDGA - genome perturbation file formats
Single cell embeddings
i. Single cell embeddings
Description: Intermediate files such as clustering and cell matrices comprised of single cell profiles from one or more samples
Meta-data:
Meta-data formats: CMDGA - meta-data reference
Files:
The files can either be provided as individual components:
Fragment files - BED file of fragments with barcode information for each read
Feature matrix - Matrix of feature counts for each cell
Features - TSV list of all features used to create the matrix
Cell meta-data - File consisting of barcodes and associated meta-data such as cell type labels
Embedding - Representation of cells in low-dimensional space (could be multiple embeddings)
Or as a single container or object containing all of these components:
Single cell container/object - H5AD, RDS, Loom, or SCE format file containing the above files
File formats: CMDGA - single cell file formats