Norwegian is slightly strange
- Two written forms
- Idiosyncratic capitalization rules
- Normativish grammar
- Semantically driven
Recommended tools
- UDPipe: http://ufal.mff.cuni.cz/udpipe
- fastText skipgram embeddings: https://fasttext.cc/docs/en/unsupervised-tutorial.html [fn:1]
- tensorflow(?) [fn:3]
- BiLSTMs [fn:4]
Resources
- Stopwords: https://github.com/stopwords-iso/stopwords-no/blob/master/stopwords-no.txt [fn:2]
- Review corpus: https://github.com/ltgoslo/norec
Other references
- Web64 norwegian nlp resources: https://github.com/web64/norwegian-nlp-resources
Important concepts
- Text as sequences, not windows
- “Stopwords” can be useful
- Don’t measure the wrong thing!
In a sequence of words
1 2 3 4 5 6 prep adverb adj adj noun punct A perfectly normal looking sentence . O B I E O O
We want to extract the adjective phrase perfectly normal looking
. We have 4 different categories BIOES
. If we measure the accuracy of each category, it is very difficult to say how well we are doing on phrases. The O
category will dominate. Instead, measure how well each sequence of BIE
and S
are correct. We don’t really care if O
is measured wrong, as long as we don’t see those.
- (B)eginning
- (I)nside
- (O)utside
- (E)nd
- (S)ingle
[fn:1] Make your own from available sources – the ones that I have found are not satisfactory.
[fn:2] It is important to tailor stopwords to your use-case – sometimes it is better to use word embeddings.
[fn:3] Keras has bad support for text as sequence (accuracy doesn’t mean what you think it means)
fn:4] Coolest thing since sliced bread.