Skip to content

Instantly share code, notes, and snippets.

@mepsrajput
Last active March 29, 2020 16:13
Show Gist options
  • Save mepsrajput/3a2ece20b00cc71a92d67f8acc4c9bd9 to your computer and use it in GitHub Desktop.
Save mepsrajput/3a2ece20b00cc71a92d67f8acc4c9bd9 to your computer and use it in GitHub Desktop.
Natural Language Processing

A high-level standard workflow for any NLP project

Text Document -> Text pre-processing -> Text parsing & Exploratory Data Analysis -> Text Representation & Feature Engineering -> Modeling and/or Pattern Mining -> Evaluation & Deployment

NLP Uses

  1. Machine Translation
  2. Speech Recognition
  3. Sentiment Analysis
  4. Question Answering
  5. Automatic Summarization
  6. Chatbots
  7. Market Intelligence
  8. Text Classification
  9. Character Recognition
  10. Spell Checking

Step 1: Gather your data

  • Product reviews
  • User-generated content
  • Troubleshooting

Step 2: Clean your data

A clean dataset will allow a model to learn meaningful features and not overfit on irrelevant noise. Checklist for cleaning data

  1. Remove all irrelevant characters such as any non alphanumeric characters.
  2. Tokenize your text by separating it into individual words.
  3. Remove irrelevant words, such as "@" in twitter urls.
  4. Convert all characters to lowercase, in order to treat words such as "Yes", "yes" and "YES" the same.
  5. Consider combining misspelled or alternately spelled words to a single representation (e.g. "hot"/"hoat"/"hooot")
  6. Consider lemmatization (reduce words such as "am", "are", and "is" to a common form such as “be”)

Step 3: Find a good data representation

One-hot encoding (Bag of Words) Visualizing the embeddings

Step 4: Classification

Step 5: Inspection

Confusion Matrix Explaining and interpreting our model

Step 6: Accounting for vocabulary structure

TF-IDF

Step 7: Leveraging semantics

Word2Vec Using pre-trained words Sentence level representation The Complexity/Explainability trade-off LIME

Step 8: Leveraging syntax using end-to-end approaches

NLTK has a list of stopwords stored in 16 different languages.

Natural Language Toolkit

Remove Stopwords

We can remove stopwords while performing the following tasks:

  • Text Classification
  • Spam Filtering
  • Language Classification
  • Genre Classification
  • Caption Generation
  • Auto-Tag Generation

Avoid Stopword Removal

  • Machine Translation
  • Language Modeling
  • Text Summarization
  • Question-Answering problems

View the stopwords in NLP

import nltk
from nltk.corpus import stopwords
set(stopwords.words('english'))

NLP Libraries in Python

  1. Natural Language Toolkit (NLTK): classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries
  2. TextBlob: part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, etc.
  3. CoreNLP: can take raw human language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize and interpret dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases or word dependencies, and indicate which noun phrases refer to the same entities.
  4. Gensim: Scalable statistical semantics, Analyze plain-text documents for semantic structure, Retrieve semantically similar documents
  5. spaCy: offers statistical neural network models
  6. polyglot: a natural language pipeline that supports massive multilingual applications.
  7. scikit–learn: various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
  8. Pattern: Pattern is a web mining module for Python. It has tools for Data Mining, NLP, ML, Network Analysis
  9. scispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment