Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save brianspiering/e1753a474070c0279122a55102277931 to your computer and use it in GitHub Desktop.
Save brianspiering/e1753a474070c0279122a55102277931 to your computer and use it in GitHub Desktop.
Multi-label classification for text: A Quick Start Guide

Multi-label classification for text: A Quick Start Guide

Multi-label classification is assigning one or more labels to a data instance. Whereas, multi-class classification is assigning the single best label to a data instance. Common examples of multi-label classification are tagging objects in an image or tagging questions on question answering website. This guide will only cover the text applications.

Start by reading the Wikipedia article on multi-label classification.

The following readings are suggested to get oriented to the problem definition and possible approaches:

  1. A Review on Multi-Label Learning Algorithms by Zhang and Zhou. Worth a skim.
  2. A Tutorial on Multi-label Classification Techniques by Carvalho and Freitas. A high level introduction.
  3. A Tutorial on Multi-Label Learning by Gibaja and Nventura. An approachable introduction to the algorithms.

In general, multi-label text classification is handled in a very similar way as multi-class text classification. Start with scikit-multilearn, a package for multi-label classification built on top of scikit-learn's ecosystem. The most notable difference is in evaluation metrics. Luckily, scikit-learn comes with several options for multi-label evaluation metrics. Check them out here and here.

Worked examples in scikit-learn:

The scikit-multilearn package is designed for the multi-label classification problem.

Handling class imbalances is more complicated than in multi-class classifciation. Addressing imbalance in multilabel classification: Measures and random resampling algorithms suggest several paths forward.

One common business use case is tagging heterogeneous data with the same labels. For example, an user and a piece of content are both associated with the same tag. StarSpace is a promising solution. Warning - StarSpace is rough around the edges and requires lots of data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment