Introduction to NLP for Norwegian text

Norwegian is slightly strange

Two written forms
Idiosyncratic capitalization rules
Normativish grammar
Semantically driven

Recommended tools

UDPipe: http://ufal.mff.cuni.cz/udpipe
fastText skipgram embeddings: https://fasttext.cc/docs/en/unsupervised-tutorial.html [fn:1]
tensorflow(?) [fn:3]
BiLSTMs [fn:4]

Resources

Stopwords: https://github.com/stopwords-iso/stopwords-no/blob/master/stopwords-no.txt [fn:2]
Review corpus: https://github.com/ltgoslo/norec

Other references

Web64 norwegian nlp resources: https://github.com/web64/norwegian-nlp-resources

Important concepts

Text as sequences, not windows
“Stopwords” can be useful
Don’t measure the wrong thing!

How to measure

In a sequence of words

1     2          3       4        5         6
prep  adverb     adj     adj      noun      punct
A     perfectly  normal  looking  sentence  .
O     B          I       E        O         O

We want to extract the adjective phrase perfectly normal looking. We have 4 different categories BIOES. If we measure the accuracy of each category, it is very difficult to say how well we are doing on phrases. The O category will dominate. Instead, measure how well each sequence of BIE and S are correct. We don’t really care if O is measured wrong, as long as we don’t see those.

(B)eginning
(I)nside
(O)utside
(E)nd
(S)ingle

[fn:1] Make your own from available sources – the ones that I have found are not satisfactory.

[fn:2] It is important to tailor stopwords to your use-case – sometimes it is better to use word embeddings.

[fn:3] Keras has bad support for text as sequence (accuracy doesn’t mean what you think it means)

fn:4] Coolest thing since sliced bread.

ljos/no_nlp.org

Introduction to NLP for Norwegian text

How to measure