A high-level standard workflow for any NLP project

Text Document -> Text pre-processing -> Text parsing & Exploratory Data Analysis -> Text Representation & Feature Engineering -> Modeling and/or Pattern Mining -> Evaluation & Deployment

NLP Uses

Machine Translation
Speech Recognition
Sentiment Analysis
Question Answering
Automatic Summarization
Chatbots
Market Intelligence
Text Classification
Character Recognition
Spell Checking

Step 1: Gather your data

Product reviews
User-generated content
Troubleshooting

Step 2: Clean your data

A clean dataset will allow a model to learn meaningful features and not overfit on irrelevant noise. Checklist for cleaning data

Remove all irrelevant characters such as any non alphanumeric characters.
Tokenize your text by separating it into individual words.
Remove irrelevant words, such as "@" in twitter urls.
Convert all characters to lowercase, in order to treat words such as "Yes", "yes" and "YES" the same.
Consider combining misspelled or alternately spelled words to a single representation (e.g. "hot"/"hoat"/"hooot")
Consider lemmatization (reduce words such as "am", "are", and "is" to a common form such as “be”)

Step 3: Find a good data representation

One-hot encoding (Bag of Words) Visualizing the embeddings

Step 4: Classification

Step 5: Inspection

Confusion Matrix Explaining and interpreting our model

Step 6: Accounting for vocabulary structure

TF-IDF

Step 7: Leveraging semantics

Word2Vec Using pre-trained words Sentence level representation The Complexity/Explainability trade-off LIME

Step 8: Leveraging syntax using end-to-end approaches

NLTK has a list of stopwords stored in 16 different languages.

Natural Language Toolkit

Remove Stopwords

We can remove stopwords while performing the following tasks:

Text Classification
Spam Filtering
Language Classification
Genre Classification
Caption Generation
Auto-Tag Generation

Avoid Stopword Removal

Machine Translation
Language Modeling
Text Summarization
Question-Answering problems

View the stopwords in NLP

import nltk
from nltk.corpus import stopwords
set(stopwords.words('english'))

NLP Libraries in Python

Natural Language Toolkit (NLTK): classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries
TextBlob: part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, etc.
CoreNLP: can take raw human language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize and interpret dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases or word dependencies, and indicate which noun phrases refer to the same entities.
Gensim: Scalable statistical semantics, Analyze plain-text documents for semantic structure, Retrieve semantically similar documents
spaCy: offers statistical neural network models
polyglot: a natural language pipeline that supports massive multilingual applications.
scikit–learn: various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
Pattern: Pattern is a web mining module for Python. It has tools for Data Mining, NLP, ML, Network Analysis
scispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text.

mepsrajput/NLP.md