-
-
Save yakovsh/345a71d841871cc3d375 to your computer and use it in GitHub Desktop.
/* | |
* One of the questions that recently came up is how to remove vowels from Hebrew characters in Unicode | |
* (or any other similar language). A quick look at Hebrew Unicode chart shows that the vowels are all | |
* located between 0x0591 (1425) and 0x05C7 (1479). With this and Javascript's charCodeAt function, it | |
* is trivial to strip them out with Javascript as follows | |
* | |
* Live demo is available here: | |
* https://jsfiddle.net/js0ge7gn/ | |
*/ | |
function stripVowels(rawString) | |
{ | |
var newString = ''; | |
for(j=0; j<rawString.length; j++) { | |
if(rawString.charCodeAt(j)<1425 | |
|| rawString.charCodeAt(j)>1479) | |
{ newString = newString + rawString.charAt(j); } | |
} | |
return(newString); | |
} | |
/* @shimondoodkin suggested even a much shorter way to do this */ | |
function stripVowels2(rawString) { | |
return rawString.replace(/[\u0591-\u05C7]/g,"") | |
} | |
Golang example based on other comments on this gist: https://play.golang.org/p/B9VRe-L5lS3
When stripping nequdot from Torah passages (and essentially trying to modernize the spelling for myself), I realized that modern Hebrew actually writes in more vav's (and maybe yud's too). So I wrote this variation that checks to see if the o / u sound is represented before deleting one of its various representations. If it isn't represented, I ADD a vav.
`function stripVowels(rawString){
var newString = '';
for(j=0; j<rawString.length; j++) {
//If it has an O that isn't otherwise represented, then add a vav
if(rawString.charCodeAt(j) == 1465 && rawString[j+1] != "ו" && rawString[j+1] != "א" && rawString[j+1] != "ה"){
newString += "ו";
}
//If it has a U, also add a vav
else if(rawString.charCodeAt(j) == 1467){
newString += "ו";
}
//Turn Hebrew hyphen into space
else if(rawString.charCodeAt(j) == 1470) newString += " ";
//Get rid of anything that's not a normal letter or punctuation
else if(rawString.charCodeAt(j)<1425 || rawString.charCodeAt(j)>1479){
newString += rawString.charAt(j);
}
}
return(newString);
}
//O == 1465
//hyphen == 1470
//U == 1467`
Here's an example of the difference:
וְיִשְׂרָאֵל אָהַב אֶת־יֹוסֵף מִכָּל־בָּנָיו כִּי־בֶן־זְקֻנִים הוּא לֹו וְעָשָׂה לֹו כְּתֹנֶת פַּסִּים ׃
וישראל אהב את יוסף מכל בניו כי בן זקונים הוא לו ועשה לו כתונת פסים
I don't know why the code font didn't kick in.
Here's the variation required for a Word Find-and-Replace macro:
(You might want to comment out DisplayAlerts=False if you want to see what's going on, but if you have a lot of text then switching off the display will speed up the process considerably.)