Text Document -> Text pre-processing -> Text parsing & Exploratory Data Analysis -> Text Representation & Feature Engineering -> Modeling and/or Pattern Mining -> Evaluation & Deployment
- Machine Translation
- Speech Recognition
- Sentiment Analysis
- Question Answering
- Automatic Summarization
- Chatbots
- Market Intelligence
- Text Classification
- Character Recognition
- Spell Checking
- Product reviews
- User-generated content
- Troubleshooting
A clean dataset will allow a model to learn meaningful features and not overfit on irrelevant noise. Checklist for cleaning data
- Remove all irrelevant characters such as any non alphanumeric characters.
- Tokenize your text by separating it into individual words.
- Remove irrelevant words, such as "@" in twitter urls.
- Convert all characters to lowercase, in order to treat words such as "Yes", "yes" and "YES" the same.
- Consider combining misspelled or alternately spelled words to a single representation (e.g. "hot"/"hoat"/"hooot")
- Consider lemmatization (reduce words such as "am", "are", and "is" to a common form such as “be”)
One-hot encoding (Bag of Words) Visualizing the embeddings
Confusion Matrix Explaining and interpreting our model
TF-IDF
Word2Vec Using pre-trained words Sentence level representation The Complexity/Explainability trade-off LIME
NLTK has a list of stopwords stored in 16 different languages.
We can remove stopwords while performing the following tasks:
- Text Classification
- Spam Filtering
- Language Classification
- Genre Classification
- Caption Generation
- Auto-Tag Generation
- Machine Translation
- Language Modeling
- Text Summarization
- Question-Answering problems
import nltk
from nltk.corpus import stopwords
set(stopwords.words('english'))
NLP Libraries in Python
- Natural Language Toolkit (NLTK): classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries
- TextBlob: part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, etc.
- CoreNLP: can take raw human language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize and interpret dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases or word dependencies, and indicate which noun phrases refer to the same entities.
- Gensim: Scalable statistical semantics, Analyze plain-text documents for semantic structure, Retrieve semantically similar documents
- spaCy: offers statistical neural network models
- polyglot: a natural language pipeline that supports massive multilingual applications.
- scikit–learn: various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
- Pattern: Pattern is a web mining module for Python. It has tools for Data Mining, NLP, ML, Network Analysis
- scispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text.