Last active
May 23, 2022 19:43
-
-
Save yakovsh/345a71d841871cc3d375 to your computer and use it in GitHub Desktop.
Removing Vowels from Hebrew Unicode Text
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* | |
* One of the questions that recently came up is how to remove vowels from Hebrew characters in Unicode | |
* (or any other similar language). A quick look at Hebrew Unicode chart shows that the vowels are all | |
* located between 0x0591 (1425) and 0x05C7 (1479). With this and Javascript's charCodeAt function, it | |
* is trivial to strip them out with Javascript as follows | |
* | |
* Live demo is available here: | |
* https://jsfiddle.net/js0ge7gn/ | |
*/ | |
function stripVowels(rawString) | |
{ | |
var newString = ''; | |
for(j=0; j<rawString.length; j++) { | |
if(rawString.charCodeAt(j)<1425 | |
|| rawString.charCodeAt(j)>1479) | |
{ newString = newString + rawString.charAt(j); } | |
} | |
return(newString); | |
} | |
/* @shimondoodkin suggested even a much shorter way to do this */ | |
function stripVowels2(rawString) { | |
return rawString.replace(/[\u0591-\u05C7]/g,"") | |
} | |
I don't know why the code font didn't kick in.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When stripping nequdot from Torah passages (and essentially trying to modernize the spelling for myself), I realized that modern Hebrew actually writes in more vav's (and maybe yud's too). So I wrote this variation that checks to see if the o / u sound is represented before deleting one of its various representations. If it isn't represented, I ADD a vav.
`function stripVowels(rawString){
var newString = '';
for(j=0; j<rawString.length; j++) {
//If it has an O that isn't otherwise represented, then add a vav
if(rawString.charCodeAt(j) == 1465 && rawString[j+1] != "ו" && rawString[j+1] != "א" && rawString[j+1] != "ה"){
newString += "ו";
}
//If it has a U, also add a vav
else if(rawString.charCodeAt(j) == 1467){
newString += "ו";
}
//Turn Hebrew hyphen into space
else if(rawString.charCodeAt(j) == 1470) newString += " ";
//Get rid of anything that's not a normal letter or punctuation
else if(rawString.charCodeAt(j)<1425 || rawString.charCodeAt(j)>1479){
newString += rawString.charAt(j);
}
}
return(newString);
}
//O == 1465
//hyphen == 1470
//U == 1467`
Here's an example of the difference:
וְיִשְׂרָאֵל אָהַב אֶת־יֹוסֵף מִכָּל־בָּנָיו כִּי־בֶן־זְקֻנִים הוּא לֹו וְעָשָׂה לֹו כְּתֹנֶת פַּסִּים ׃
וישראל אהב את יוסף מכל בניו כי בן זקונים הוא לו ועשה לו כתונת פסים