Created
November 25, 2017 20:41
-
-
Save GreenRiverRUS/4ca507a032a3ef6afa55ae7130a51516 to your computer and use it in GitHub Desktop.
Simple converter to ConLL-2003 NER format for spaCy model training
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
DATA = [ | |
[ | |
[['Who', 'is', 'Shaka', 'Khan', '?'], ['O', 'O', 'I-PER', 'I-PER', 'O']] | |
], | |
[ | |
[['I', 'like', 'London', 'and', 'Berlin', '.'], ['O', 'O', 'I-LOC', 'O', 'I-LOC', 'O']] | |
] | |
] | |
with open('output.conll', 'w') as f: | |
for doc in DATA: | |
f.write('-DOCSTART- -X- O O\n') | |
for sentence, sent_entities in doc: | |
for token, BIO_tag in zip(sentence, sent_entities): | |
f.write('{} -X- _ {}\n'.format(token, BIO_tag)) | |
f.write('\n') | |
## Result | |
# -DOCSTART- -X- O O | |
# Who -X- _ O | |
# is -X- _ O | |
# Shaka -X- _ I-PER | |
# Khan -X- _ I-PER | |
# ? -X- _ O | |
# | |
# -DOCSTART- -X- O O | |
# I -X- _ O | |
# like -X- _ O | |
# London -X- _ I-LOC | |
# and -X- _ O | |
# Berlin -X- _ I-LOC | |
# . -X- _ O |
Hey is there a way to go from Conll format to spacy format?
You can check out my repo for the solution.
https://github.com/dipansh-girdhar/NLP/tree/master/NER/Spacy%20NER
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hey is there a way to go from Conll format to spacy format?