Skip to content

Instantly share code, notes, and snippets.

@ipstone
Forked from elucify/README.md
Created April 17, 2018 18:52
Show Gist options
  • Save ipstone/61d32c5875af8e6a98e2d19c8d1d0723 to your computer and use it in GitHub Desktop.
Save ipstone/61d32c5875af8e6a98e2d19c8d1d0723 to your computer and use it in GitHub Desktop.
ClinVar sample README.md

ClinVar

This directory contains:

  • ClinVar (http://www.ncbi.nlm.nih.gov/clinvar/) dataset reports, and ClinVar development documents
  • documents related to the NCBI collaboration with ClinGen (http://www.clinicalgenome.org/)
  • ftp://ftp.ncbi.nih.gov/pub/clinvar/ClinGen/ExpertPanelRequestForm.docx - how to apply for expert panel status
  • data common to ClinVar and GTR
  • ftp://ftp.ncbi.nlm.nih.gov/pub/GTR/standard_terms - terminology used by both GTR and ClinVar.

Go to: ClinVar Home - [Submit Data to ClinVar] (http://www.ncbi.nlm.nih.gov/clinvar/docs/submit/) - Genetic Testing Registry Home

--

Submissions

You may submit data to ClinVar using Excel spreadsheets or XML files.

  • Excel Submission Templates
  • ftp://ftp.ncbi.nih.gov/pub/clinvar/submission_templates/SubmissionTemplate.xlsx - standard submission template
  • ftp://ftp.ncbi.nih.gov/pub/clinvar/submission_templates/SubmissionTemplateLite.xlsx - for submissions with less supporting evidence
  • ftp://ftp.ncbi.nih.gov/pub/clinvar/submission_templates/SubmissionTemplate_version3.xlsx - beta version of updated standard template (please use standard template if you are time-constrained)
  • XML Submission Schema Files
  • ftp://ftp.ncbi.nih.gov/pub/clinvar/clinvar_submission.xsd - current XML submission document schema
  • ftp://ftp.ncbi.nih.gov/pub/clinvar/xsd_submission/ - folder of previous schema versions
  • Please direct XML data submission questions to [email protected].

ClinVar Data Downloads

disease_names

URL: ftp://ftp.ncbi.nih.gov/pub/clinvar/disease_names Format: tab-separated values Updated: daily

Reports names and identifiers used in GTR and ClinVar. When a name is used by more than one source, there may be more than one line per condition. Unlike the gene_condition_source_id file, it is comprehensive, and does not require knowledge of any gene-to-disease relationship.

Columns:

Col Name Description
1 DiseaseName The name preferred by GTR and ClinVar
2 SourceName Sources that also use this preferred name
3 ConceptID The identifier assigned to a disorder associated with this gene (1)
4 SourceID Identifier used by the source reported in column 2
5 DiseaseMIM MIM number for the condition
6 LastUpdated Last time this record was modified by NCBI staff

Notes:

(1) If the value starts with a C and is followed by digits, the ConceptID is a value from UMLS; if a value begins with CN, it was created by NCBI-based processing.

gene_condition_source_id

URL: ftp://ftp.ncbi.nih.gov/pub/clinvar/gene_condition_source_id Format: tab-separated values Updated: daily

Reports gene-disease relationships used in ClinVar, Gene, GTR and MedGen. The sources of information for the gene-disease relationships include OMIM, GeneReviews, and a limited amount of curation by NCBI staff. The scope of disorders reported in this file is a subset of the disease_names file because a gene-to-disease relationship is required.

Columns:

Col Name Description
1 GeneID the NCBI GeneID
2 GeneSymbol the preferred symbol corresponding to the GeneID
3 ConceptID The identifier assigned to a disorder associated with this gene (1)
4 SourceName Sources that use this name
5 SourceID The identifier used by this source
6 DiseaseMIM MIM number for the condition
7 LastUpdated Last time this record was modified by NCBI staff

Notes:

(1) If the value starts with a C and is followed by digits, the ConceptID is a value from UMLS; if a value begins with CN, it was created by NCBI-based processing DiseaseName full name for the condition

ConceptID_history.txt

URL: ftp://ftp.ncbi.nih.gov/pub/clinvar/ConceptID_history.txt Format: tab-separated values Updated: daily

Tracks changes in identifiers assigned to phenotypes over time. The ConceptID values in the first column are no longer active, and are either discontinued (the value in column 2 is 'No longer reported'), or replaced by a record with a different identifier. A replacement may be either the result of a merge (one record becoming secondary to another) or because of a change in numbering, usually because an identifier assigned by NCBI (starting with CN) is now thought to be represented by a ConceptID from UMLS (starting with C followed by numerals).

Columns:

Col Name Description
1 Previous ConceptID the outdated identifier
2 Current ConceptID the current identifier
3 Date of Action the date this change occurred

Subdirectories

Subdirectory Description (notes)
presentations slides or other documents about ClinVar
submission_templates templates for submission by spreadsheet
tab_delimited summary data of several types
vcf_GRCh37 vcf files generated by dbSNP based on GRCh37/hg19 (2)
vcf_GRCh38 vcf files generated by dbSNP based on GRCh38/hg38 (1,2)
xml An extraction of data in ClinVar as xml (3)
xsd_public current and previous versions of XSD schema files for XML data

Notes:

(1) For more about the conventions used to process and report the vcf data, see also: http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/

(2) Please note that until the new data from 1000 Genomes are processed, there will be no files in GRCh38 coordinates that report common variants (common_all.vcf.gz) or common variants not known to contribute to phenotype (common_no_known_medical_impact-latest.vcf). These are available only in the vcf_GRCh37 subdirectory. _Note: This notice should be in the README for GRCh38!

(3) The schema for the files in the xml directory is ftp://ftp.ncbi.nih.gov/pub/clinvar/xsd_public/clinvar_public.xsd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment