Last active
January 21, 2017 12:08
-
-
Save strikaco/2d19cf767b50578488f7ea28736293be to your computer and use it in GitHub Desktop.
Alternate implementation of NLTK's concordance() - no dependencies needed!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| def concordance(string, search_term, width=25): | |
| """ | |
| Alternative implementation of NLTK's concordance() that | |
| allows printing to stdout or saving to a variable and | |
| does not require NLTK. | |
| Just feed it a raw string, JSON string, etc. with any line | |
| breaks stripped out. | |
| """ | |
| # Offset tracks our progress as we parse through the string | |
| offset = 0 | |
| # Indexes lets us store all the positions we find your term in | |
| indexes = [] | |
| # Keep scanning through the string until we reach the end | |
| while offset < len(string): | |
| try: | |
| # From the current position to the end of the string, find | |
| # the next potential position for your search term | |
| position = string[offset:].lower().index(search_term.lower()) | |
| except ValueError: | |
| # Your term wasn't found; exit. | |
| break | |
| if position: | |
| # Your term was found. Add it to the list of indexes | |
| indexes.append(position + offset) | |
| # Now increase the offset to the position of your term, | |
| # plus the length of its letters so we resume scanning | |
| # after the end of it. | |
| offset += position + len(search_term) | |
| # For each position where the case was found, return the leading and | |
| # trailing characters | |
| return tuple(string[index-width:index+width+len(search_term)] | |
| for index in indexes) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment