Created
May 20, 2022 20:35
-
-
Save mzdravkov/53b28d6ddcf99437fa0e4e9af3a3d726 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import re | |
text1 = """ | |
It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. | |
However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered as the rightful property of some one or other of their daughters. | |
“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” | |
Mr. Bennet replied that he had not. | |
“But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” | |
Mr. Bennet made no answer. | |
“Do not you want to know who has taken it?” cried his wife impatiently. | |
“You want to tell me, and I have no objection to hearing it.” | |
This was invitation enough. | |
“Why, my dear, you must know, Mrs. Long says that Netherfield is taken by a young man of large fortune from the north of England; that he came down on Monday in a chaise and four to see the place, and was so much delighted with it that he agreed with Mr. Morris immediately; that he is to take possession before Michaelmas, and some of his servants are to be in the house by the end of next week.” | |
“What is his name?” | |
“Bingley.” | |
“Is he married or single?” | |
“Oh! single, my dear, to be sure! A single man of large fortune; four or five thousand a year. What a fine thing for our girls!” | |
“How so? how can it affect them?” | |
“My dear Mr. Bennet,” replied his wife, “how can you be so tiresome! You must know that I am thinking of his marrying one of them.” | |
“Is that his design in settling here?” | |
“Design! nonsense, how can you talk so! But it is very likely that he may fall in love with one of them, and therefore you must visit him as soon as he comes.” | |
“I see no occasion for that. You and the girls may go, or you may send them by themselves, which perhaps will be still better, for as you are as handsome as any of them, Mr. Bingley might like you the best of the party.” | |
“My dear, you flatter me. I certainly have had my share of beauty, but I do not pretend to be anything extraordinary now. When a woman has five grown-up daughters, she ought to give over thinking of her own beauty.” | |
“In such cases, a woman has not often much beauty to think of.” | |
“But, my dear, you must indeed go and see Mr. Bingley when he comes into the neighbourhood.” | |
“It is more than I engage for, I assure you.” | |
“But consider your daughters. Only think what an establishment it would be for one of them. Sir William and Lady Lucas are determined to go, merely on that account, for in general, you know, they visit no newcomers. Indeed you must go, for it will be impossible for us to visit him, if you do not.” | |
“You are over scrupulous, surely. I dare say Mr. Bingley will be very glad to see you; and I will send a few lines by you to assure him of my hearty consent to his marrying whichever he chooses of the girls; though I must throw in a good word for my little Lizzy.” | |
“I desire you will do no such thing. Lizzy is not a bit better than the others; and I am sure she is not half so handsome as Jane, nor half so good-humoured as Lydia. But you are always giving her the preference.” | |
“They have none of them much to recommend them,” replied he; “they are all silly and ignorant like other girls; but Lizzy has something more of quickness than her sisters.” | |
“Mr. Bennet, how can you abuse your own children in such a way? You take delight in vexing me. You have no compassion on my poor nerves.” | |
“You mistake me, my dear. I have a high respect for your nerves. They are my old friends. I have heard you mention them with consideration these twenty years at least.” | |
“Ah, you do not know what I suffer.” | |
“But I hope you will get over it, and live to see many young men of four thousand a year come into the neighbourhood.” | |
“It will be no use to us, if twenty such should come, since you will not visit them.” | |
“Depend upon it, my dear, that when there are twenty, I will visit them all.” | |
Mr. Bennet was so odd a mixture of quick parts, sarcastic humour, reserve, and caprice, that the experience of three-and-twenty years had been insufficient to make his wife understand his character. Her mind was less difficult to develop. She was a woman of mean understanding, little information, and uncertain temper. When she was discontented, she fancied herself nervous. The business of her life was to get her daughters married; its solace was visiting and news. | |
""" | |
text2 = """ | |
The widely accepted idea of a cost-of-living crisis does not begin to capture the gravity of what may lie ahead. António Guterres, the un secretary general, warned on May 18th that the coming months threaten “the spectre of a global food shortage” that could last for years. The high cost of staple foods has already raised the number of people who cannot be sure of getting enough to eat by 440m, to 1.6bn. Nearly 250m are on the brink of famine. If, as is likely, the war drags on and supplies from Russia and Ukraine are limited, hundreds of millions more people could fall into poverty. Political unrest will spread, children will be stunted and people will starve. | |
""" | |
text3 = """ | |
Bisulfite treatment of DNA followed by high-throughput sequencing (Bisulfite-seq) is an important method for studying DNA methylation and epigenetic gene regulation, yet current software tools do not adequately address single nucleotide polymorphisms (SNPs). Identifying SNPs is important for accurate quantification of methylation levels and for identification of allele-specific epigenetic events such as imprinting. We have developed a model-based bisulfite SNP caller, Bis-SNP, that results in substantially better SNP calls than existing methods, thereby improving methylation estimates. At an average 30× genomic coverage, Bis-SNP correctly identified 96% of SNPs using the default high-stringency settings. | |
""" | |
text4 = """ | |
The United States has formal diplomatic relations with most nations. This includes all UN member and observer states other than Bhutan, Iran, North Korea and Syria, and the UN observer State of Palestine, the latter of which the U.S. does not recognize. Additionally, the U.S. has diplomatic relations with Kosovo and the European Union. | |
The United States federal statutes relating to foreign relations can be found in Title 22 of the United States Code. For several years, the United States had the most diplomatic posts of any state, but as of 2020, it is second to the People's Republic of China. | |
""" | |
text5 = """ | |
In Kingdom of the Wicked, Helen Dale imagines an alt-history Roman Empire of cable news and airplanes — albeit one about to be greatly unsettled by a certain young, charismatic Jewish religious zealot from Palestine. How did Dale’s Roman Empire accelerate? Call it the Archimedes Mechanism. Rather than die at the Siege of Syracuse in 212 B.C., the brilliant Greek polymath Archimedes is captured by Roman forces and becomes the equivalent of the German rocket scientists scooped up by the American military after World War II. He invents calculus some 15 centuries before Isaac Newton — something he almost accomplished in our reality — and brought all manner of technical know-how to the Roman military, eventually triggering a wider technological revolution. | |
""" | |
def get_previous_word(text, i): | |
if i == 0: | |
return '' | |
x = i - 1 | |
while re.match(r'\w', text[x]): | |
x -= 1 | |
return text[x+1:i] | |
def get_sentences(text): | |
normalized_text = re.sub(r'\.+', '.', text) | |
normalized_text = re.sub(r'\?+', '?', normalized_text) | |
normalized_text = re.sub(r'\!+', '!', normalized_text) | |
# substitute mixed multi-punctuation mark sequences as dot | |
normalized_text = re.sub(r'[?.!]{2,}', '.', normalized_text) | |
sentences = [] | |
sentence_start = 0 | |
for i, char in enumerate(normalized_text): | |
# if we're not at a punctuation mark just ignore the symbol and continue | |
if char not in ('.', '?', '!', '"', '”'): | |
continue | |
# if we're at the end of the text, we assume that's a whole sentence up to the current position | |
if i == len(normalized_text) - 1: | |
sentences.append(normalized_text[sentence_start:i]) | |
break | |
score = 0 | |
next_char = normalized_text[i+1] | |
subsequent_char = None | |
if i + 2 < len(normalized_text): | |
subsequent_char = normalized_text[i+2] | |
prev_word = get_previous_word(text, i) | |
# If the current punctuation mark is right before a new line, | |
# we're most probably at the end of a sentence. | |
if next_char == '\n': | |
score += 5 | |
# If there's an interval and a capital letter after the punctuation mark | |
# it's likely that this is the end of a sentence. | |
if next_char == ' ' and subsequent_char.upper() == subsequent_char: | |
score += 4 | |
# If the current sentence is under 3 letters, most probably we've found | |
# an abbreviation. | |
if i - sentence_start < 3: | |
score -= 2 | |
# If the previous word contains only consonants then it's probably an | |
# abbreviation. | |
if re.match(r'^[b-df-hj-np-tv-z]+$', prev_word, flags=re.I): | |
score -= 2 | |
# If we have accumulated sufficient score, we assume that we're at the end of a sentence. | |
if score >= 4: | |
sentences.append(normalized_text[sentence_start:i+1].strip()) | |
sentence_start = i + 1 | |
return sentences | |
for text in (text1, text2, text3, text4, text5): | |
print("=" * 42) | |
print("Text:") | |
print(text) | |
print("\n") | |
sentences = get_sentences(text) | |
print("{} sentences found:".format(len(sentences))) | |
for sentence in sentences: | |
print(sentence) | |
print("") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment