In order to inform the discussions regarding entity mapping and harmonization, we surveyed existing schemas.
Methodology: import relevant schemas, apply minor synonyms (e.g. Case renamed to Subject )
- Aliquot (13%)
- Demographic (29%)
- Diagnosis (29%)
- Program (9%)
- Project (3%)
- Publication (0%)
- Sample (44%)
- Subject (6%)
- Protocol (42%)
- Study (6%)
- StudyRunMetadata (25%)
- AlignedReads (22%)
- AlignedReadsIndex (25%)
- AlignmentCocleaningWorkflow (26%)
- AlignmentWorkflow (26%)
- CopyNumberEstimate (24%)
- CopyNumberSegment (20%)
- CopyNumberVariationWorkflow (14%)
- FollowUp (6%)
- GeneExpression (22%)
- ProteinExpression (28%)
- ReadGroup (80%)
- ReadGroupQc (61%)
- RnaExpressionWorkflow (29%)
- SubmittedAlignedReads (25%)
- SubmittedGenomicProfile (25%)
- SubmittedUnalignedReads (28%)
- Treatment (21%)
- Clinical (9%)
- File (7%)
- Acknowledgement
- AggregatedGenotypingArray
- CoreMetadataCollection
- DrugAttribute
- DrugResponse
- GenotypingArray
- GenotypingArrayWorkflow
- Keyword
- MirnaMicroarray
- MrnaMicroarray
- MzmlProteinMassSpectrometry
- OncomapAssay
- OncomapPanel
- ProteomicWorkflow
- PsmProteinMassSpectrometry
- RawProteinMassSpectrometry
- SubmittedMethylation
- SummaryDrugResponse
- TangentCopyNumber
- AliquotRunMetadata
- Biospecimen
- CasePerFile
- ClinicalMetadata
- ExperimentProjects
- ExperimentType
- ExperimentalMetadata
- FileCount
- FileMetadata
- FilePerStudy
- Filter
- FilterElement
- Gene
- GeneStudySpectralCount
- Paginated
- Pagination
- PdcDataStats
- Ptm
- QuantitiveData
- Query
- SearchRecord
- Spectral_count
- StudyExperimentalDesign
- Sunburst
- WorkflowMetadata
- AggregatedSomaticMutation
- AnalysisMetadata
- Analyte
- AnnotatedSomaticMutation
- Annotation
- Archive
- BiospecimenSupplement
- Center
- ClinicalSupplement
- CopyNumberLiftoverWorkflow
- DataFormat
- DataSubtype
- DataType
- ExperimentMetadata
- ExperimentalStrategy
- Exposure
- FamilyHistory
- FilteredCopyNumberSegment
- GenomicProfileHarmonizationWorkflow
- GermlineMutationCallingWorkflow
- MaskedSomaticMutation
- MethylationArrayHarmonizationWorkflow
- MethylationBetaValue
- MethylationLiftoverWorkflow
- MirnaExpression
- MirnaExpressionWorkflow
- MolecularTest
- PathologyReport
- Platform
- Portion
- RawMethylationArray
- RunMetadata
- SimpleGermlineVariation
- SimpleSomaticMutation
- Slide
- SlideImage
- SomaticAggregationWorkflow
- SomaticAnnotationWorkflow
- SomaticCopyNumberWorkflow
- SomaticMutationCallingWorkflow
- SomaticMutationIndex
- StructuralVariantCallingWorkflow
- StructuralVariation
- SubmittedGenotypingArray
- SubmittedMethylationBetaValue
- SubmittedTangentCopyNumber
- Tag
- TissueSourceSite
- GDC
$ git remote -v
origin https://github.com/NCI-GDC/gdcdictionary (fetch)
origin https://github.com/NCI-GDC/gdcdictionary (push)
$ git status
On branch develop
Your branch is up to date with 'origin/develop'.
nothing to commit, working tree clean
$ git log -1 | head -1
commit efef28495bbbdb11efc6580f9fc06d6d68d6a3bd
- crdc
# see https://github.com/uc-cdis/cdis-manifest <project>/
r = requests.get("https://s3.amazonaws.com/dictionary-artifacts/dcfdictionary/3.1.2/schema.json")
crdc = AttrDict(r.json())
crdc_entities = {}
for k, e in crdc.items():
if 'id' in e and 'properties' in e:
crdc_entities[to_camel_case(e['id'])] = [k for k in e['properties'].keys() if k != '$ref']
- pdc
r = requests.get('https://pdc.esacinc.com/graphql?query={ __schema { types { name kind fields { name } } } }')
pdc = r.json()
# xform
pdc_entities = {t['name'].replace('UI',''): [f['name'] for f in t['fields']] for t in pdc['data']['__schema']['types'] if t['kind'] == 'OBJECT' and not t['name'].startswith('_')}
pdc_entities['Subject'] = pdc_entities['Case']
del pdc_entities['Case']
pdc_entities['Study'] = pdc_entities['Experiment']
del pdc_entities['Experiment']
crdc.subject | pdc.case | gdc.case | notes |
---|---|---|---|
breed | harmonize to ontology | ||
days_to_lost_to_followup | days_to_lost_to_followup | ||
disease_type | disease_type | disease_type | harmonize to ontology |
id | case_id | id | |
index_date | index_date | ||
lost_to_followup | lost_to_followup | ||
primary_site | primary_site | primary_site | harmonize to ontology |
project_id | normalize to subject.project | ||
species | harmonize to ontology | ||
state | case_status | state | |
studies | normalize to subject.study | ||
submitter_id | case_submitter_id | submitter_id | |
tissue_source_site_code | tissue_source_sites | ||
aliquot_id | normalize to subject.sample.aliquot | ||
aliquot_status | normalize to subject.sample.aliquot | ||
aliquot_submitter_id | normalize to subject.sample.aliquot | ||
program_name | normalize to subject.project.program | ||
project_name | normalize to subject.project | ||
sample_id | normalize to subject.sample | ||
sample_status | normalize to subject.sample | ||
sample_submitter_id | normalize to subject.sample | ||
sample_type | normalize to subject.sample | ||
batch_id |