-
-
Save yakovsh/345a71d841871cc3d375 to your computer and use it in GitHub Desktop.
/* | |
* One of the questions that recently came up is how to remove vowels from Hebrew characters in Unicode | |
* (or any other similar language). A quick look at Hebrew Unicode chart shows that the vowels are all | |
* located between 0x0591 (1425) and 0x05C7 (1479). With this and Javascript's charCodeAt function, it | |
* is trivial to strip them out with Javascript as follows | |
* | |
* Live demo is available here: | |
* https://jsfiddle.net/js0ge7gn/ | |
*/ | |
function stripVowels(rawString) | |
{ | |
var newString = ''; | |
for(j=0; j<rawString.length; j++) { | |
if(rawString.charCodeAt(j)<1425 | |
|| rawString.charCodeAt(j)>1479) | |
{ newString = newString + rawString.charAt(j); } | |
} | |
return(newString); | |
} | |
/* @shimondoodkin suggested even a much shorter way to do this */ | |
function stripVowels2(rawString) { | |
return rawString.replace(/[\u0591-\u05C7]/g,"") | |
} | |
Here's the variation required for a Word Find-and-Replace macro:
Sub StripNkudot()
Application.DisplayAlerts = False
For H = 1425 To 1479
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
With Selection.Find
.Text = ChrW(H)
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindStop
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchKashida = False
.MatchDiacritics = False
.MatchAlefHamza = False
.MatchControl = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute replace:=wdReplaceAll
Next
Application.DisplayAlerts = True
End Sub
(You might want to comment out DisplayAlerts=False if you want to see what's going on, but if you have a lot of text then switching off the display will speed up the process considerably.)
Golang example based on other comments on this gist: https://play.golang.org/p/B9VRe-L5lS3
When stripping nequdot from Torah passages (and essentially trying to modernize the spelling for myself), I realized that modern Hebrew actually writes in more vav's (and maybe yud's too). So I wrote this variation that checks to see if the o / u sound is represented before deleting one of its various representations. If it isn't represented, I ADD a vav.
`function stripVowels(rawString){
var newString = '';
for(j=0; j<rawString.length; j++) {
//If it has an O that isn't otherwise represented, then add a vav
if(rawString.charCodeAt(j) == 1465 && rawString[j+1] != "ו" && rawString[j+1] != "א" && rawString[j+1] != "ה"){
newString += "ו";
}
//If it has a U, also add a vav
else if(rawString.charCodeAt(j) == 1467){
newString += "ו";
}
//Turn Hebrew hyphen into space
else if(rawString.charCodeAt(j) == 1470) newString += " ";
//Get rid of anything that's not a normal letter or punctuation
else if(rawString.charCodeAt(j)<1425 || rawString.charCodeAt(j)>1479){
newString += rawString.charAt(j);
}
}
return(newString);
}
//O == 1465
//hyphen == 1470
//U == 1467`
Here's an example of the difference:
וְיִשְׂרָאֵל אָהַב אֶת־יֹוסֵף מִכָּל־בָּנָיו כִּי־בֶן־זְקֻנִים הוּא לֹו וְעָשָׂה לֹו כְּתֹנֶת פַּסִּים ׃
וישראל אהב את יוסף מכל בניו כי בן זקונים הוא לו ועשה לו כתונת פסים
I don't know why the code font didn't kick in.
Thanks for the ideas above. I imported all the data in an Excel spreadsheet and couldn't find a way to do it elegantly using a regexreplace function, say.
This is a bit brute-force, but it worked for me and only took about 10 seconds to convert around 10,000 words and sentences:
And then just type "=stripVowels(A2)" in the cell(s) where you want nikkud-less Hebrew text (obviously replace "A2" with whatever the cell is of the original text).
:)