Skip to content

Instantly share code, notes, and snippets.

@ljos
Last active May 2, 2019 13:50
Show Gist options
  • Save ljos/4447f620ea23db122ba27721013b4c0f to your computer and use it in GitHub Desktop.
Save ljos/4447f620ea23db122ba27721013b4c0f to your computer and use it in GitHub Desktop.
Introduction to NLP for Norwegian text

Introduction to NLP for Norwegian text

Norwegian is slightly strange

  • Two written forms
  • Idiosyncratic capitalization rules
  • Normativish grammar
  • Semantically driven

Recommended tools

Resources

Other references

Important concepts

  • Text as sequences, not windows
  • “Stopwords” can be useful
  • Don’t measure the wrong thing!

How to measure

In a sequence of words

1     2          3       4        5         6
prep  adverb     adj     adj      noun      punct
A     perfectly  normal  looking  sentence  .
O     B          I       E        O         O

We want to extract the adjective phrase perfectly normal looking. We have 4 different categories BIOES. If we measure the accuracy of each category, it is very difficult to say how well we are doing on phrases. The O category will dominate. Instead, measure how well each sequence of BIE and S are correct. We don’t really care if O is measured wrong, as long as we don’t see those.

  • (B)eginning
  • (I)nside
  • (O)utside
  • (E)nd
  • (S)ingle

[fn:1] Make your own from available sources – the ones that I have found are not satisfactory.

[fn:2] It is important to tailor stopwords to your use-case – sometimes it is better to use word embeddings.

[fn:3] Keras has bad support for text as sequence (accuracy doesn’t mean what you think it means)

fn:4] Coolest thing since sliced bread.

Display the source blob
Display the rendered blob
Raw
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment