Last active
September 7, 2022 13:35
-
-
Save wesslen/87281c3c8231b7d2237f79e58890cbe5 to your computer and use it in GitHub Desktop.
spaCy sentencizer script
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import spacy | |
| import srsly # to easily read/write JSONL etc. | |
| nlp = spacy.load("en_core_web_sm") # or whatever you need | |
| examples = srsly.read_jsonl("./data.jsonl") | |
| texts = (eg["text"] for eg in examples) | |
| new_examples = [] | |
| for doc in nlp.pipe(texts): | |
| for sent in doc.sents: | |
| new_examples.append({"text": sent.text}) | |
| srsly.write_jsonl("./data-with-sentences.jsonl", new_examples) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| {"text":"Prodigy is a scriptable annotation tool so efficient that data scientists can do the annotation themselves, enabling a new level of rapid iteration."} | |
| {"text":"Today\u2019s transfer learning technologies mean you can train production-quality models with very few examples."} | |
| {"text":"With Prodigy you can take full advantage of modern machine learning by adopting a more agile approach to data collection."} | |
| {"text":"You'll move faster, be more independent and ship far more successful projects."} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| {"text": "Prodigy is a scriptable annotation tool so efficient that data scientists can do the annotation themselves, enabling a new level of rapid iteration. Today’s transfer learning technologies mean you can train production-quality models with very few examples. With Prodigy you can take full advantage of modern machine learning by adopting a more agile approach to data collection. You'll move faster, be more independent and ship far more successful projects."} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment