Skip to content

Instantly share code, notes, and snippets.

@chetkhatri
Created November 24, 2016 06:34
Show Gist options
  • Save chetkhatri/0bb67596bbb2e9b446e944908b3325c6 to your computer and use it in GitHub Desktop.
Save chetkhatri/0bb67596bbb2e9b446e944908b3325c6 to your computer and use it in GitHub Desktop.
import nltk
import re
with open('/home/chetan/Documents/sample-certificate.txt','r') as file:
text = file.read()
# print(text)
sentences = nltk.sent_tokenize(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
# print(tokenized_sentences)
sign_date = {}
for indecies, elements in enumerate(tokenized_sentences):
for index, element in enumerate(elements):
elements.append(element)
if(element == "dated"):
sign_date[indecies] = index
print(index)
elif(element == ','):
sign_date.append(index)
sign_date[indecies] = index
print(elements)
regex = re.compile(r"BP(\d{8})")
result = regex.search(text)
print('Agreement Number: '+result.group())
@chetkhatri
Copy link
Author

tokenized_sentences has that list i sent you earlier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment