Multi-label classification for text: A Quick Start Guide

Multi-label classification is assigning one or more labels to a data instance. Whereas, multi-class classification is assigning the single best label to a data instance. Common examples of multi-label classification are tagging objects in an image or tagging questions on question answering website. This guide will only cover the text applications.

Start by reading the Wikipedia article on multi-label classification.

The following readings are suggested to get oriented to the problem definition and possible approaches:

A Review on Multi-Label Learning Algorithms by Zhang and Zhou. Worth a skim.
A Tutorial on Multi-label Classification Techniques by Carvalho and Freitas. A high level introduction.
A Tutorial on Multi-Label Learning by Gibaja and Nventura. An approachable introduction to the algorithms.

In general, multi-label text classification is handled in a very similar way as multi-class text classification. Start with scikit-multilearn, a package for multi-label classification built on top of scikit-learn's ecosystem. The most notable difference is in evaluation metrics. Luckily, scikit-learn comes with several options for multi-label evaluation metrics. Check them out here and here.

Worked examples in scikit-learn:

The scikit-multilearn package is designed for the multi-label classification problem.

Handling class imbalances is more complicated than in multi-class classifciation. Addressing imbalance in multilabel classification: Measures and random resampling algorithms suggest several paths forward.

One common business use case is tagging heterogeneous data with the same labels. For example, an user and a piece of content are both associated with the same tag. StarSpace is a promising solution. Warning - StarSpace is rough around the edges and requires lots of data.

brianspiering/multi-label_classification_a_quick_start_guide.md

Multi-label classification for text: A Quick Start Guide