ljos/no_nlp.org

Last active May 2, 2019 13:50

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/ljos/4447f620ea23db122ba27721013b4c0f.js"></script>
Save ljos/4447f620ea23db122ba27721013b4c0f to your computer and use it in GitHub Desktop.

Download ZIP

Introduction to NLP for Norwegian text

Raw

Introduction to NLP for Norwegian text

Norwegian is slightly strange

Two written forms
Idiosyncratic capitalization rules
Normativish grammar
Semantically driven

Recommended tools

UDPipe: http://ufal.mff.cuni.cz/udpipe
fastText skipgram embeddings: https://fasttext.cc/docs/en/unsupervised-tutorial.html [fn:1]
tensorflow(?) [fn:3]
BiLSTMs [fn:4]

Resources

Stopwords: https://github.com/stopwords-iso/stopwords-no/blob/master/stopwords-no.txt [fn:2]
Review corpus: https://github.com/ltgoslo/norec

Other references

Web64 norwegian nlp resources: https://github.com/web64/norwegian-nlp-resources

Important concepts

Text as sequences, not windows
“Stopwords” can be useful
Don’t measure the wrong thing!

How to measure

In a sequence of words

1     2          3       4        5         6
prep  adverb     adj     adj      noun      punct
A     perfectly  normal  looking  sentence  .
O     B          I       E        O         O

We want to extract the adjective phrase perfectly normal looking. We have 4 different categories BIOES. If we measure the accuracy of each category, it is very difficult to say how well we are doing on phrases. The O category will dominate. Instead, measure how well each sequence of BIE and S are correct. We don’t really care if O is measured wrong, as long as we don’t see those.

(B)eginning
(I)nside
(O)utside
(E)nd
(S)ingle

[fn:1] Make your own from available sources – the ones that I have found are not satisfactory.

[fn:2] It is important to tailor stopwords to your use-case – sometimes it is better to use word embeddings.

[fn:3] Keras has bad support for text as sequence (accuracy doesn’t mean what you think it means)

fn:4] Coolest thing since sliced bread.

Raw

View raw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment