-
-
Save yakovsh/345a71d841871cc3d375 to your computer and use it in GitHub Desktop.
/* | |
* One of the questions that recently came up is how to remove vowels from Hebrew characters in Unicode | |
* (or any other similar language). A quick look at Hebrew Unicode chart shows that the vowels are all | |
* located between 0x0591 (1425) and 0x05C7 (1479). With this and Javascript's charCodeAt function, it | |
* is trivial to strip them out with Javascript as follows | |
* | |
* Live demo is available here: | |
* https://jsfiddle.net/js0ge7gn/ | |
*/ | |
function stripVowels(rawString) | |
{ | |
var newString = ''; | |
for(j=0; j<rawString.length; j++) { | |
if(rawString.charCodeAt(j)<1425 | |
|| rawString.charCodeAt(j)>1479) | |
{ newString = newString + rawString.charAt(j); } | |
} | |
return(newString); | |
} | |
/* @shimondoodkin suggested even a much shorter way to do this */ | |
function stripVowels2(rawString) { | |
return rawString.replace(/[\u0591-\u05C7]/g,"") | |
} | |
@flamholz You can just exclude the "Maqaf" (the "dash" connecting two words in the hebrew text, unicode u05BE
) in the RegEx: return rawString.replace(/[\u0591-\u05BD\u05BF-\u05C7]/g,"");
Thanks for the python script @yakovsh! ๐
Based on that script I created a more advanced script in Python. ๐
๐ It has a GUI, using Tkinter!
It can read and write text files and also removes only Taamei haMikra characters if yo wish.
๐ฅ The script is part of an open source bible software I am writing, called Emmeth.
Check it our here: https://github.com/Emmeth/tools/blob/master/scripts/remove_nikkud.py
So this thread has been super helpful. I've been using the Google Sheets function suggested by yakovsh, with avrtau's modification to keep the maqqef and sof pasuq. Here's what I'm using: =REGEXREPLACE(B1,"[(\x{0591}-\x{05BD})OR(\x{05BF}-\x{05C2})OR(\x{05C4}-\x{05C7})]","")
But what I want to know is whether and how I could incorporate similar code into a macro for Word, since that is ultimately what I am using this for--to remove nikkud from Hebrew words and text extracts in a Word document. Currently I copy the vocalized text from the Word doc, paste it into my Google sheet, and then copy/paste that output over the vocalized text back in Word. It would be faster and more efficient if I could just select the vocalized text and run a macro on it with a shortcut key. I'm fairly sure this is possible, but I don't which if any of the code listed above would work. I'm pretty new to Word macros; I'm familiar with creating and editing them, and I understand the basics, but I don't know the syntax myself. Thanks for any suggestions!
Thanks for the ideas above. I imported all the data in an Excel spreadsheet and couldn't find a way to do it elegantly using a regexreplace function, say.
This is a bit brute-force, but it worked for me and only took about 10 seconds to convert around 10,000 words and sentences:
Function stripVowels(rawString)
Dim stripped As String
stripped = rawString
For H = 1425 To 1479
stripped = Replace(stripped, ChrW(H), "")
Next
stripVowels = stripped
End Function
And then just type "=stripVowels(A2)" in the cell(s) where you want nikkud-less Hebrew text (obviously replace "A2" with whatever the cell is of the original text).
:)
Here's the variation required for a Word Find-and-Replace macro:
Sub StripNkudot()
Application.DisplayAlerts = False
For H = 1425 To 1479
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
With Selection.Find
.Text = ChrW(H)
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindStop
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchKashida = False
.MatchDiacritics = False
.MatchAlefHamza = False
.MatchControl = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute replace:=wdReplaceAll
Next
Application.DisplayAlerts = True
End Sub
(You might want to comment out DisplayAlerts=False if you want to see what's going on, but if you have a lot of text then switching off the display will speed up the process considerably.)
Golang example based on other comments on this gist: https://play.golang.org/p/B9VRe-L5lS3
When stripping nequdot from Torah passages (and essentially trying to modernize the spelling for myself), I realized that modern Hebrew actually writes in more vav's (and maybe yud's too). So I wrote this variation that checks to see if the o / u sound is represented before deleting one of its various representations. If it isn't represented, I ADD a vav.
`function stripVowels(rawString){
var newString = '';
for(j=0; j<rawString.length; j++) {
//If it has an O that isn't otherwise represented, then add a vav
if(rawString.charCodeAt(j) == 1465 && rawString[j+1] != "ื" && rawString[j+1] != "ื" && rawString[j+1] != "ื"){
newString += "ื";
}
//If it has a U, also add a vav
else if(rawString.charCodeAt(j) == 1467){
newString += "ื";
}
//Turn Hebrew hyphen into space
else if(rawString.charCodeAt(j) == 1470) newString += " ";
//Get rid of anything that's not a normal letter or punctuation
else if(rawString.charCodeAt(j)<1425 || rawString.charCodeAt(j)>1479){
newString += rawString.charAt(j);
}
}
return(newString);
}
//O == 1465
//hyphen == 1470
//U == 1467`
Here's an example of the difference:
ืึฐืึดืฉืึฐืจึธืึตื ืึธืึทื ืึถืชึพืึนืืกึตืฃ ืึดืึผึธืึพืึผึธื ึธืื ืึผึดืึพืึถืึพืึฐืงึปื ึดืื ืืึผื ืึนื ืึฐืขึธืฉืึธื ืึนื ืึผึฐืชึนื ึถืช ืคึผึทืกึผึดืื ื
ืืืฉืจืื ืืื ืืช ืืืกืฃ ืืื ืื ืื ืื ืื ืืงืื ืื ืืื ืื ืืขืฉื ืื ืืชืื ืช ืคืกืื
I don't know why the code font didn't kick in.
And remove vowels in Java: