Skip to content

Instantly share code, notes, and snippets.

@wesslen
Last active September 7, 2022 13:35
Show Gist options
  • Select an option

  • Save wesslen/87281c3c8231b7d2237f79e58890cbe5 to your computer and use it in GitHub Desktop.

Select an option

Save wesslen/87281c3c8231b7d2237f79e58890cbe5 to your computer and use it in GitHub Desktop.
spaCy sentencizer script
import spacy
import srsly # to easily read/write JSONL etc.
nlp = spacy.load("en_core_web_sm") # or whatever you need
examples = srsly.read_jsonl("./data.jsonl")
texts = (eg["text"] for eg in examples)
new_examples = []
for doc in nlp.pipe(texts):
for sent in doc.sents:
new_examples.append({"text": sent.text})
srsly.write_jsonl("./data-with-sentences.jsonl", new_examples)
{"text":"Prodigy is a scriptable annotation tool so efficient that data scientists can do the annotation themselves, enabling a new level of rapid iteration."}
{"text":"Today\u2019s transfer learning technologies mean you can train production-quality models with very few examples."}
{"text":"With Prodigy you can take full advantage of modern machine learning by adopting a more agile approach to data collection."}
{"text":"You'll move faster, be more independent and ship far more successful projects."}
{"text": "Prodigy is a scriptable annotation tool so efficient that data scientists can do the annotation themselves, enabling a new level of rapid iteration. Today’s transfer learning technologies mean you can train production-quality models with very few examples. With Prodigy you can take full advantage of modern machine learning by adopting a more agile approach to data collection. You'll move faster, be more independent and ship far more successful projects."}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment