Skip to content

Instantly share code, notes, and snippets.

@kemingy
Last active January 9, 2020 04:18
Show Gist options
  • Save kemingy/9509ddb48265117c0be1c47895039422 to your computer and use it in GitHub Desktop.
Save kemingy/9509ddb48265117c0be1c47895039422 to your computer and use it in GitHub Desktop.
chinese words segmentation algorithm

Chinese Word Segmentation

Ref:

Tagging

  • B: begin
  • M: middle
  • E: end
  • S: single

Machine Learning:

  • Maximum Entropy Markov
  • Conditional Random Field (semi-CRF and linear CRF)

Deep Learning:

  • Collobert with char embedding
  • Max-Margin Tensor Neural Network(MMTNN)
  • Gated recursive neural network(GRNN)
  • Long Short Term Memory Neural Networks(LSTM)
  • GRNN & LSTM

MMSEG

Ref:

maximum matching algorithm

Simple: find the longest match

complex: find 3-word chunks, choose the 1st chunk with the maximum length

ambiguity resolution rules

  1. maximum matching (59.5%)
  2. largest average word length (30.6%)
  3. smallest variance of word lengths (1%)
  4. largest sum of degree of morphemic freedom of one-character words (9%)

HMM

Ref:

CRF

Ref: *

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment