-
-
Save alexbowe/879414 to your computer and use it in GitHub Desktop.
import nltk | |
text = """The Buddha, the Godhead, resides quite as comfortably in the circuits of a digital | |
computer or the gears of a cycle transmission as he does at the top of a mountain | |
or in the petals of a flower. To think otherwise is to demean the Buddha...which is | |
to demean oneself.""" | |
# Used when tokenizing words | |
sentence_re = r'''(?x) # set flag to allow verbose regexps | |
([A-Z])(\.[A-Z])+\.? # abbreviations, e.g. U.S.A. | |
| \w+(-\w+)* # words with optional internal hyphens | |
| \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82% | |
| \.\.\. # ellipsis | |
| [][.,;"'?():-_`] # these are separate tokens | |
''' | |
lemmatizer = nltk.WordNetLemmatizer() | |
stemmer = nltk.stem.porter.PorterStemmer() | |
#Taken from Su Nam Kim Paper... | |
grammar = r""" | |
NBAR: | |
{<NN.*|JJ>*<NN.*>} # Nouns and Adjectives, terminated with Nouns | |
NP: | |
{<NBAR>} | |
{<NBAR><IN><NBAR>} # Above, connected with in/of/etc... | |
""" | |
chunker = nltk.RegexpParser(grammar) | |
toks = nltk.regexp_tokenize(text, sentence_re) | |
postoks = nltk.tag.pos_tag(toks) | |
print postoks | |
tree = chunker.parse(postoks) | |
from nltk.corpus import stopwords | |
stopwords = stopwords.words('english') | |
def leaves(tree): | |
"""Finds NP (nounphrase) leaf nodes of a chunk tree.""" | |
for subtree in tree.subtrees(filter = lambda t: t.node=='NP'): | |
yield subtree.leaves() | |
def normalise(word): | |
"""Normalises words to lowercase and stems and lemmatizes it.""" | |
word = word.lower() | |
word = stemmer.stem_word(word) | |
word = lemmatizer.lemmatize(word) | |
return word | |
def acceptable_word(word): | |
"""Checks conditions for acceptable word: length, stopword.""" | |
accepted = bool(2 <= len(word) <= 40 | |
and word.lower() not in stopwords) | |
return accepted | |
def get_terms(tree): | |
for leaf in leaves(tree): | |
term = [ normalise(w) for w,t in leaf if acceptable_word(w) ] | |
yield term | |
terms = get_terms(tree) | |
for term in terms: | |
for word in term: | |
print word, | |
I'm not sure if it's just me, but the verbose regular expression used for tokenization did not work for me. This fix. Used parenthesis for grouping the given expressions and I changed all the parenthesis to non-capturing.
sentence_re = r'(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*)|(?:$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"'?():-_`])'
thanks @anupamchoudhari!!
Thanks @alexbowe this is very useful for my current research, and thanks @anupamchoudhari for the fix!
Hm. I received this error: AttributeError: 'tuple' object has no attribute 'isdigit'
, seems to be a bug in the most recent nltk release. Installing 3.05 fixes.
Yeah, I'm getting the same error as @petulla. What's wrong?
@petulla and @Rich700000000000, looks like it works fine if you make the changes mentioned by @anupamchoudhari and @tejasshah93. Thanks guys !!
note that {<NBAR><IN><NBAR>}
should come above {<NBAR>}
for it to work
thanks @anupamchoudhari!!
This expression
sentence_re = r'(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*)|(?:$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"'?():-_`])'
is not valid as is. Nobody else is getting this SyntaxError? Seems to be at the final ?, presumably from the '
closing the string?
@Phdntom: Yes I get a syntax error too. Did you find the solution?
Getting same error as @Phdntom..any one got the solution
@Phdntom
for your syntax error, you have to escape the simple quote ' like this
r'(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*)|(?:$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"'?():-_`])'
the syntax error disapear but i have another error when parsing the regex : error: nothing to repeat
Hi guys,
I am also getting the same error "sre_constants.error: nothing to repeat at position 48"
Can anyone suggest how to fix.
Traceback (most recent call last):
File "C:/Users/mohan.choudhary/Desktop/Copied_Shared/New folder/KeyTokenizer.py", line 24, in
toks = nltk.regexp_tokenize(text, sentence_re)
File "C:\Python_3.5.0\lib\site-packages\nltk\tokenize\regexp.py", line 196, in regexp_tokenize
return tokenizer.tokenize(text)
File "C:\Python_3.5.0\lib\site-packages\nltk\tokenize\regexp.py", line 119, in tokenize
self._check_regexp()
File "C:\Python_3.5.0\lib\site-packages\nltk\tokenize\regexp.py", line 116, in _check_regexp
self._regexp = re.compile(self._pattern, self._flags)
File "C:\Python_3.5.0\lib\re.py", line 224, in compile
return _compile(pattern, flags)
File "C:\Python_3.5.0\lib\re.py", line 293, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\Python_3.5.0\lib\sre_compile.py", line 536, in compile
p = sre_parse.parse(p, flags)
File "C:\Python_3.5.0\lib\sre_parse.py", line 829, in parse
p = _parse_sub(source, pattern, 0)
File "C:\Python_3.5.0\lib\sre_parse.py", line 437, in _parse_sub
itemsappend(_parse(source, state))
File "C:\Python_3.5.0\lib\sre_parse.py", line 778, in _parse
p = _parse_sub(source, state)
File "C:\Python_3.5.0\lib\sre_parse.py", line 437, in _parse_sub
itemsappend(_parse(source, state))
File "C:\Python_3.5.0\lib\sre_parse.py", line 638, in _parse
source.tell() - here + len(this))
sre_constants.error: nothing to repeat at position 50
thanks
@Mohan-kr did you solve that error? I am also getting the same error?
I'm currently working on a project that uses some of the Natural Languages features present on NLTK. I know this post is 6 years old now, but as I've stumble into this gist I think it might be useful if @alexbowe post (and edit) this gist again with the requirements for this script to run.
In my experience running it "out of the box" it needs (and this is by no means an incomplete list of requirements) to run:
If you're running Python 2.7:
- Python 2.7+
- nltk
- The POS (Part of Speech) with the identifier: maxent_treebank_pos_tagger
- A Model with the identifier: averaged_perceptron_tagger
- A Corpora with the identifier: stopwords
Using Python 2.7 (with an unmodified version of the script) it will run with some exceptions.
If you're running Python 3.5:
- Python 3.5+ (with some minor changes to the script to replace the old print construct with the newer print() function)
- nltk
- The POS (Part of Speech) with the identifier: maxent_treebank_pos_tagger
- A Model with the identifier: averaged_perceptron_tagger
- A Corpora with the identifier: stopwords
Using Python 3.5 it will NOT run with the Exception: "AttributeError: 'tuple' object has no attribute 'isdigit'" as I'm not a Python developer I don't know what to do about it. However if you can install both versions of Python, it will be better running it on Python 2.7.
Take into account that you might need to switch from pip to pip3 (when installing Python modules) as the latter is used on Python 3.x installations.
The dependencies for nltk are available in the Python shell (>>>) with the utility nltk.download()
I hope this "indications" to be useful for someone else.
Agree with @renaud
Any new rules has to be placed before .
I actually added this as well: {} to capture all NPs that conjunct with each other.
i need to extract words that are verb phrases along with noun phrases.i have defined the grammer correctly but the i think where we are checking t.node a simple " or" will not suffice because that is leading to the extracted words are getting printed twice,sometimes sentence wise sometimes consecutively bcos my grammer has NP inside VP . I checked my tree and it seems okay.Does anyone have a solution to this?
Why not using NBAR:{<NN*|JJ><NN>}? Why those dots are there?
@Mohan-kr @hash-include did you solve the error you were getting for this problem ??
For the error : AttributeError: 'tuple' object has no attribute 'isdigit find the below solution.
you need to uninstall higher versions of nltk, it works for versions 3.0.
Solution ::
The default tagger is made as Perceptron in the nltk 3.1 version. Which is now the latest version. All my nltk.regexp_tokenize stopped functioning correctly and all my nltk.pos_tag started giving the above error.
The solution that I have currently is to use the previous version nltk 3.0.1 to make them functioning. I am not sure if this is a bug in the current release of nltk.
Installation instruction for nltk 3.0.4 version in ubuntu. From your home directory or any other directory do the following steps.
$ wget https://github.com/nltk/nltk/archive/3.0.4.tar.gz
$ tar -xvzf 3.0.4.tar.gz
$ cd nltk-3.0.4
$ sudo python3.4 setup.py install
you can use the following code to install nltk 3.0.4
pip install https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz
it will automaticall uninstalls your latest version
/****************************************************************************/
pip install https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz
Collecting https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz
Downloading nltk-3.0.4.tar.gz (1.0MB)
100% |################################| 1.0MB 562kB/s
Building wheels for collected packages: nltk
Running setup.py bdist_wheel for nltk ... done
Stored in directory: C:\Users\1534038\AppData\Local\pip\Cache\wheels\8a\1e\1e\9f124d9995acdfd40f645da9592cd126f6fbe19b5e54b1c4b4
Successfully built nltk
Installing collected packages: nltk
Found existing installation: nltk 3.2.4
Uninstalling nltk-3.2.4:
Successfully uninstalled nltk-3.2.4
Successfully installed nltk-3.0.4
/**************************************************************************************************/
After this I am able to run the above code
Traceback (most recent call last):
File "nltk-intro.py", line 31, in
toks = nltk.regexp_tokenize(text, sentence_re)
File "/home/user/Desktop/nltk-3.0.4/nltk/tokenize/regexp.py", line 203, in regexp_tokenize
return tokenizer.tokenize(text)
File "/home/user/Desktop/nltk-3.0.4/nltk/tokenize/regexp.py", line 126, in tokenize
self._check_regexp()
File "/home/user/Desktop/nltk-3.0.4/nltk/tokenize/regexp.py", line 121, in _check_regexp
self._regexp = compile_regexp_to_noncapturing(self._pattern, self._flags)
File "/home/user/Desktop/nltk-3.0.4/nltk/internals.py", line 55, in compile_regexp_to_noncapturing
return sre_compile.compile(convert_regexp_to_noncapturing_parsed(sre_parse.parse(pattern)), flags=flags)
File "/home/user/Desktop/nltk-3.0.4/nltk/internals.py", line 51, in convert_regexp_to_noncapturing_parsed
parsed_pattern.pattern.groups = 1
AttributeError: can't set attribute
Error encountered after following,
you can use the following code to install nltk 3.0.4
pip install https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz
it will automaticall uninstalls your latest version
/****************************************************************************/
pip install https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz
Collecting https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz
Downloading nltk-3.0.4.tar.gz (1.0MB)
100% |################################| 1.0MB 562kB/s
Building wheels for collected packages: nltk
Running setup.py bdist_wheel for nltk ... done
Stored in directory: C:\Users\1534038\AppData\Local\pip\Cache\wheels\8a\1e\1e\9f124d9995acdfd40f645da9592cd126f6fbe19b5e54b1c4b4
Successfully built nltk
Installing collected packages: nltk
Found existing installation: nltk 3.2.4
Uninstalling nltk-3.2.4:
Successfully uninstalled nltk-3.2.4
Successfully installed nltk-3.0.4
/**************************************************************************************************/
I have made the changes suggested by @anupamchoudhari and @tejasshah93.
I am getting syntax error in the regular expression @anupamchoudhari suggested. I am using python 3.6.3 version. Any help fixing is greatly appreciated as I am a newbie in python and NLTK.
sentence_re = r'(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*)|(?:$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"'?():-_`])'
The following regular expression seems to work in Python 3.x
sentence_re = r'''(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(?:-\w+)* # words with optional internal hyphens
| \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():_`-] # these are separate tokens; includes ], [
'''
from https://stackoverflow.com/questions/36353125/nltk-regular-expression-tokenizer
Plus other fixes -
for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):
@jamesballard Thanks! it works for me with Python 3.x
I am getting an error from running the code below:
postoks = nltk.tag.pos_tag(toks)
URLError:
Working for Python 3.6.
- line 44: change
t.node
tot.label()
- line 50:
change stemmer.stem_word(word)
tostemmer.stem(word)
Full working version:
import nltk
text = """The Buddha, the Godhead, resides quite as comfortably in the circuits of a digital
computer or the gears of a cycle transmission as he does at the top of a mountain
or in the petals of a flower. To think otherwise is to demean the Buddha...which is
to demean oneself."""
# Used when tokenizing words
sentence_re = r'''(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(?:-\w+)* # words with optional internal hyphens
| \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():_`-] # these are separate tokens; includes ], [
'''
lemmatizer = nltk.WordNetLemmatizer()
stemmer = nltk.stem.porter.PorterStemmer()
#Taken from Su Nam Kim Paper...
grammar = r"""
NBAR:
{<NN.*|JJ>*<NN.*>} # Nouns and Adjectives, terminated with Nouns
NP:
{<NBAR>}
{<NBAR><IN><NBAR>} # Above, connected with in/of/etc...
"""
chunker = nltk.RegexpParser(grammar)
toks = nltk.regexp_tokenize(text, sentence_re)
postoks = nltk.tag.pos_tag(toks)
print(postoks)
tree = chunker.parse(postoks)
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
def leaves(tree):
"""Finds NP (nounphrase) leaf nodes of a chunk tree."""
for subtree in tree.subtrees(filter = lambda t: t.label()=='NP'):
yield subtree.leaves()
def normalise(word):
"""Normalises words to lowercase and stems and lemmatizes it."""
word = word.lower()
word = stemmer.stem(word)
word = lemmatizer.lemmatize(word)
return word
def acceptable_word(word):
"""Checks conditions for acceptable word: length, stopword."""
accepted = bool(2 <= len(word) <= 40
and word.lower() not in stopwords)
return accepted
def get_terms(tree):
for leaf in leaves(tree):
term = [ normalise(w) for w,t in leaf if acceptable_word(w) ]
yield term
terms = get_terms(tree)
for term in terms:
for word in term:
print(word)
print(term)
Working for Python 3.6.
- line 44: change
t.node
tot.label()
- line 50:
change stemmer.stem_word(word)
tostemmer.stem(word)
Full working version:
import nltk text = """The Buddha, the Godhead, resides quite as comfortably in the circuits of a digital computer or the gears of a cycle transmission as he does at the top of a mountain or in the petals of a flower. To think otherwise is to demean the Buddha...which is to demean oneself.""" # Used when tokenizing words sentence_re = r'''(?x) # set flag to allow verbose regexps (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A. | \w+(?:-\w+)* # words with optional internal hyphens | \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82% | \.\.\. # ellipsis | [][.,;"'?():_`-] # these are separate tokens; includes ], [ ''' lemmatizer = nltk.WordNetLemmatizer() stemmer = nltk.stem.porter.PorterStemmer() #Taken from Su Nam Kim Paper... grammar = r""" NBAR: {<NN.*|JJ>*<NN.*>} # Nouns and Adjectives, terminated with Nouns NP: {<NBAR>} {<NBAR><IN><NBAR>} # Above, connected with in/of/etc... """ chunker = nltk.RegexpParser(grammar) toks = nltk.regexp_tokenize(text, sentence_re) postoks = nltk.tag.pos_tag(toks) print(postoks) tree = chunker.parse(postoks) from nltk.corpus import stopwords stopwords = stopwords.words('english') def leaves(tree): """Finds NP (nounphrase) leaf nodes of a chunk tree.""" for subtree in tree.subtrees(filter = lambda t: t.label()=='NP'): yield subtree.leaves() def normalise(word): """Normalises words to lowercase and stems and lemmatizes it.""" word = word.lower() word = stemmer.stem(word) word = lemmatizer.lemmatize(word) return word def acceptable_word(word): """Checks conditions for acceptable word: length, stopword.""" accepted = bool(2 <= len(word) <= 40 and word.lower() not in stopwords) return accepted def get_terms(tree): for leaf in leaves(tree): term = [ normalise(w) for w,t in leaf if acceptable_word(w) ] yield term terms = get_terms(tree) for term in terms: for word in term: print(word) print(term)
thank you
Thank you @Rich2020, worked for me :)
After stemmer, key phrases will be... such as "digit comput"?
This output key phrases do not make sense in some situation.
does it make sense to comment out the stemmer step sometimes?