Notes
- UNION removes duplicates from the set while UNION ALL does not which makes it much faster
- Since this technique is being used for data integration, the UNION ALL is the
- The number of columns must be the same for each select statement
- A trick to fill in columns is to select a constant value such as null or ''
Example
table1.sex
, table2.gender
, and table3.gender
are semantically similar.
Table 1
1 | male
2 | female
3 | male
4 | male
Table2
1 | Male
2 | Male
3 | Female
Table3
1 | f
2 | m
3 | f
4 | f
Query
SELECT id, sex FROM table1 WHERE lower(sex) like 'm%'
UNION ALL
SELECT id, gender FROM table2 WHERE lower(gender) like 'm%'
UNION ALL
SELECT id, gender FROM table3 WHERE lower(gender) like 'm%'
Result
1 | male
3 | male
4 | male
1 | Male
2 | Male
2 | m
Discussion
- The first SELECT can be thought of as the canonical one. It's column
names are used to label the resulting table, i.e.
id
andsex
. - The queries can be independently construct and concatentated prior to the query being executed.
Implementation
- A graph database can be used to keep the relationship between fields
table1.sex
reltotable2.gender
table1.sex
reltotable3.gender
- Query graph database to find all field realted to canonical