Last active
December 20, 2015 16:18
-
-
Save joewiz/6160020 to your computer and use it in GitHub Desktop.
Fix problems with mis-capitalized names, with XQuery
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
xquery version "3.0"; | |
declare namespace fn="http://www.w3.org/2005/xpath-functions"; | |
(: Fix problems with mis-capitalized names. For example: | |
Before: MACARTHUR, Douglas II | |
After: MacArthur, Douglas II | |
:) | |
declare function local:fix-name-capitalization($name as xs:string) { | |
(: | |
We'll use analyze-string() to split the name string up into "words". | |
We're defining "words" as strings of one-or-more upper- or lower-case letters and hyphens. | |
E.g.: | |
In "MAO Tse-tung", "MAO" and "Tse-tung" are the two words. | |
In "MCCLURKIN, Robert J. G.", the words are "MCCLURKIN", "Robert", "J", and "G". | |
The analyze-string() function will return results like: | |
<fn:analyze-string-result xmlns:fn="http://www.w3.org/2005/xpath-functions"> | |
<fn:match>MCCLURKIN</fn:match> | |
<fn:non-match>, </fn:non-match> | |
<fn:match>Robert</fn:match> | |
<fn:non-match> </fn:non-match> | |
<fn:match>J</fn:match> | |
<fn:non-match>. </fn:non-match> | |
<fn:match>G</fn:match> | |
<fn:non-match>.</fn:non-match> | |
</fn:analyze-string-result> | |
We'll keep the "non-matches" unchanged, only looking carefully at each "match" to make | |
sure we apply the right capitalization rules. | |
:) | |
let $word-vs-non-word-pattern := '[A-Za-z-]+' | |
let $analyze-string-result := analyze-string($name, $word-vs-non-word-pattern) | |
return | |
string-join( | |
for $node in $analyze-string-result/* | |
return | |
(: let punctuation, spaces, etc. through unchanged :) | |
if ($node/self::fn:non-match) then | |
$node/string() | |
(: examine :) | |
else | |
(: MACARTHUR -> MacArthur :) | |
if (starts-with(lower-case($node), 'mac')) then | |
concat('Mac', upper-case(substring($node, 4, 1)), lower-case(substring($node, 5))) | |
(: MCCARTHY -> McCarthy :) | |
else if (starts-with(lower-case($node), 'mc')) then | |
concat('Mc', upper-case(substring($node, 3, 1)), lower-case(substring($node, 4))) | |
(: II -> II :) | |
else if (matches($node, '^[IVX]+$')) then | |
$node/string() | |
(: otherwise, just capitalize the word :) | |
else | |
concat(upper-case(substring($node, 1, 1)), lower-case(substring($node, 2))) | |
) | |
}; | |
let $names := | |
( | |
'MCCARTHY, Senator Joseph R.' (: potential problem because of "Mc" prefix :), | |
'MACARTHUR, Douglas II' (: potential problem because of "Mac" prefix and generational name "II" :), | |
'O’CONNOR, Roderic L' (: potential problem because of apostrophe in surname :), | |
'VAN Hollen, Christopher' (: potential problem because the last name is in two parts :), | |
'CHERWELL, Lord (Frederick Alexander Lindemann)' (: potential problem because of the parantheses :), | |
'LINDEMANN, Frederick Alexander.' (: potential problem because of the period :), | |
'MAO TSE-TUNG' (: potential problem because names are not comma-delimited :) | |
) | |
return | |
element results { | |
for $name in $names | |
return | |
element result { | |
element source { $name }, | |
element repair { local:fix-name-capitalization($name) } | |
} | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<results> | |
<result> | |
<source>MCCARTHY, Senator Joseph R.</source> | |
<repair>McCarthy, Senator Joseph R.</repair> | |
</result> | |
<result> | |
<source>MACARTHUR, Douglas II</source> | |
<repair>MacArthur, Douglas II</repair> | |
</result> | |
<result> | |
<source>O’CONNOR, Roderic L</source> | |
<repair>O’Connor, Roderic L</repair> | |
</result> | |
<result> | |
<source>VAN Hollen, Christopher</source> | |
<repair>Van Hollen, Christopher</repair> | |
</result> | |
<result> | |
<source>CHERWELL, Lord (Frederick Alexander Lindemann)</source> | |
<repair>Cherwell, Lord (Frederick Alexander Lindemann)</repair> | |
</result> | |
<result> | |
<source>LINDEMANN, Frederick Alexander.</source> | |
<repair>Lindemann, Frederick Alexander.</repair> | |
</result> | |
<result> | |
<source>MAO TSE-TUNG</source> | |
<repair>Mao Tse-tung</repair> | |
</result> | |
</results> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment