CDA Entity and Property Intersection

In order to inform the discussions regarding entity mapping and harmonization, we surveyed existing schemas.

Methodology: import relevant schemas, apply minor synonyms (e.g. Case renamed to Subject )

Projects with entities in common (field match %):

gdc,pdc,crdc

Aliquot (13%)
Demographic (29%)
Diagnosis (29%)
Program (9%)
Project (3%)
Publication (0%)
Sample (44%)
Subject (6%)

pdc,crdc

Protocol (42%)
Study (6%)
StudyRunMetadata (25%)

gdc,crdc

AlignedReads (22%)
AlignedReadsIndex (25%)
AlignmentCocleaningWorkflow (26%)
AlignmentWorkflow (26%)
CopyNumberEstimate (24%)
CopyNumberSegment (20%)
CopyNumberVariationWorkflow (14%)
FollowUp (6%)
GeneExpression (22%)
ProteinExpression (28%)
ReadGroup (80%)
ReadGroupQc (61%)
RnaExpressionWorkflow (29%)
SubmittedAlignedReads (25%)
SubmittedGenomicProfile (25%)
SubmittedUnalignedReads (28%)
Treatment (21%)

gdc,pdc

Clinical (9%)
File (7%)

Projects with unique entities:

crdc

Acknowledgement
AggregatedGenotypingArray
CoreMetadataCollection
DrugAttribute
DrugResponse
GenotypingArray
GenotypingArrayWorkflow
Keyword
MirnaMicroarray
MrnaMicroarray
MzmlProteinMassSpectrometry
OncomapAssay
OncomapPanel
ProteomicWorkflow
PsmProteinMassSpectrometry
RawProteinMassSpectrometry
SubmittedMethylation
SummaryDrugResponse
TangentCopyNumber

pdc

AliquotRunMetadata
Biospecimen
CasePerFile
ClinicalMetadata
ExperimentProjects
ExperimentType
ExperimentalMetadata
FileCount
FileMetadata
FilePerStudy
Filter
FilterElement
Gene
GeneStudySpectralCount
Paginated
Pagination
PdcDataStats
Ptm
QuantitiveData
Query
SearchRecord
Spectral_count
StudyExperimentalDesign
Sunburst
WorkflowMetadata

gdc

AggregatedSomaticMutation
AnalysisMetadata
Analyte
AnnotatedSomaticMutation
Annotation
Archive
BiospecimenSupplement
Center
ClinicalSupplement
CopyNumberLiftoverWorkflow
DataFormat
DataSubtype
DataType
ExperimentMetadata
ExperimentalStrategy
Exposure
FamilyHistory
FilteredCopyNumberSegment
GenomicProfileHarmonizationWorkflow
GermlineMutationCallingWorkflow
MaskedSomaticMutation
MethylationArrayHarmonizationWorkflow
MethylationBetaValue
MethylationLiftoverWorkflow
MirnaExpression
MirnaExpressionWorkflow
MolecularTest
PathologyReport
Platform
Portion
RawMethylationArray
RunMetadata
SimpleGermlineVariation
SimpleSomaticMutation
Slide
SlideImage
SomaticAggregationWorkflow
SomaticAnnotationWorkflow
SomaticCopyNumberWorkflow
SomaticMutationCallingWorkflow
SomaticMutationIndex
StructuralVariantCallingWorkflow
StructuralVariation
SubmittedGenotypingArray
SubmittedMethylationBetaValue
SubmittedTangentCopyNumber
Tag
TissueSourceSite

version info

$ git remote -v
origin	https://github.com/NCI-GDC/gdcdictionary (fetch)
origin	https://github.com/NCI-GDC/gdcdictionary (push)
$ git status
On branch develop
Your branch is up to date with 'origin/develop'.

nothing to commit, working tree clean
$ git log -1 | head -1
commit efef28495bbbdb11efc6580f9fc06d6d68d6a3bd

crdc

# see https://github.com/uc-cdis/cdis-manifest  <project>/
r = requests.get("https://s3.amazonaws.com/dictionary-artifacts/dcfdictionary/3.1.2/schema.json")
crdc = AttrDict(r.json())

crdc_entities = {}
for k, e in crdc.items():
    if 'id' in e and 'properties' in e:
        crdc_entities[to_camel_case(e['id'])] = [k for k in e['properties'].keys() if k != '$ref']

r = requests.get('https://pdc.esacinc.com/graphql?query={ __schema { types { name kind fields { name } } } }')
pdc = r.json()

# xform 
pdc_entities = {t['name'].replace('UI',''): [f['name'] for f in t['fields']] for t in pdc['data']['__schema']['types'] if t['kind'] == 'OBJECT' and not t['name'].startswith('_')}
pdc_entities['Subject'] = pdc_entities['Case']
del pdc_entities['Case']

pdc_entities['Study'] = pdc_entities['Experiment']
del pdc_entities['Experiment']

Sample mapping `Subject`

crdc.subject	pdc.case	gdc.case	notes
breed			harmonize to ontology
days_to_lost_to_followup		days_to_lost_to_followup
disease_type	disease_type	disease_type	harmonize to ontology
id	case_id	id
index_date		index_date
lost_to_followup		lost_to_followup
primary_site	primary_site	primary_site	harmonize to ontology
project_id			normalize to subject.project
species			harmonize to ontology
state	case_status	state
studies			normalize to subject.study
submitter_id	case_submitter_id	submitter_id
tissue_source_site_code		tissue_source_sites
	aliquot_id		normalize to subject.sample.aliquot
	aliquot_status		normalize to subject.sample.aliquot
	aliquot_submitter_id		normalize to subject.sample.aliquot
	program_name		normalize to subject.project.program
	project_name		normalize to subject.project
	sample_id		normalize to subject.sample
	sample_status		normalize to subject.sample
	sample_submitter_id		normalize to subject.sample
	sample_type		normalize to subject.sample
		batch_id

bwalsh/CDA-Intersection.md

CDA Entity and Property Intersection

Projects with entities in common (field match %):

gdc,pdc,crdc

pdc,crdc

gdc,crdc

gdc,pdc

Projects with unique entities:

crdc

pdc

gdc

version info

Sample mapping Subject

Sample mapping `Subject`