ingridstevens · November 24, 2022 18:59
diff --git a/translate.ipynb b/translate.ipynb
 {
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import Libraries\n",
    "import deepl\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import the csv file\n",
    "df = pd.read_csv('translate-testdata.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the API key of(deepL Free account)\n",
    "auth_key = \"x-x-x-x-x:fx\" \n",
    "translator = deepl.Translator(auth_key)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Function that translates all text in the \"Text\" column of df\n",
    "def translate_text(text):\n",
    "    result = translator.translate_text(text, target_lang=\"EN-US\")\n",
    "    return result.text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Translate the text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Apply the translate_text function to the df \n",
    "df['English'] = df['Text'].apply(translate_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# export the df to a csv file\n",
    "df.to_csv('translate-testdata.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clean up so we only have the original text and english translation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reduce the dataframe to only the \"English\" column and \"Text\" column\n",
    "df = df[['Text', 'English']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sentiment Analysis   \n",
    "Run sentiment polarity analysis \n",
    "Run sentiment emotion analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import a library for sentiment analysis \n",
    "from textblob import TextBlob\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# apply the sentiment analysis to the \"English\" column\n",
    "df['Sentiment'] = df['English'].apply(lambda x: TextBlob(x).sentiment.polarity)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/yl/mwd8tygs7p38z57chhqrlkx80000gn/T/ipykernel_66292/2747275817.py:3: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.\n",
      "  df = df.append({'Text': 'Finding the right product was difficult'}, ignore_index=True)\n"
     ]
    }
   ],
   "source": [
    "# add three rows to the dataframe column \"Text\" with the text \"I love you\", \"I hate you\", and \"I am neutral\"\n",
    "\n",
    "df = df.append({'Text': 'Finding the right product was difficult'}, ignore_index=True)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Now to try sentiment analysis on the pre-labeled dataset \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_test = pd.read_csv('kaggle-test.csv', encoding= 'unicode_escape')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# make the \"text\" column a string \n",
    "df_test['text'] = df_test['text'].astype(str)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TextBlob: Polarity & Subjectivity\n",
    "\n",
    "The output of TextBlob is polarity and subjectivity. \n",
    "\n",
    "*Polarity* score lies between (-1 to 1) where:\n",
    "* -1 identifies the most negative words such as ‘disgusting’, ‘awful’, ‘pathetic’, \n",
    "* 1 identifies the most positive words like ‘excellent’, ‘best’. \n",
    "\n",
    "*Subjectivity* score lies between (0 and 1), It shows the amount of personal opinion, \n",
    "* If a sentence has high subjectivity i.e. close to 1, It resembles that the text contains more personal opinion than factual information. \n",
    "* Conversely, a 0 would indicate a purely factual statement"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Apply sentiment analysis to text\n",
    "df_test['polarity_textblob'] = df_test['text'].apply(lambda x: TextBlob(x).sentiment.polarity)\n",
    "df_test['subjectivity_textblob'] = df_test['text'].apply(lambda x: TextBlob(x).sentiment.subjectivity)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make the df smaller by only keeping the \"text\" and \"IN_Sentiment\" columns\n",
    "df_test = df_test[['text', 'sentiment', 'polarity_textblob', 'subjectivity_textblob']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.10.1 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.1"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
	{
	"cells": [
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Import Libraries\n",
	"import deepl\n",
	"import pandas as pd"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [],
	"source": [
	"# import the csv file\n",
	"df = pd.read_csv('translate-testdata.csv')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Load the API key of(deepL Free account)\n",
	"auth_key = \"x-x-x-x-x:fx\" \n",
	"translator = deepl.Translator(auth_key)\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Function that translates all text in the \"Text\" column of df\n",
	"def translate_text(text):\n",
	" result = translator.translate_text(text, target_lang=\"EN-US\")\n",
	" return result.text"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Translate the text"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Apply the translate_text function to the df \n",
	"df['English'] = df['Text'].apply(translate_text)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [],
	"source": [
	"# export the df to a csv file\n",
	"df.to_csv('translate-testdata.csv', index=False)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Clean up so we only have the original text and english translation"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Reduce the dataframe to only the \"English\" column and \"Text\" column\n",
	"df = df[['Text', 'English']]"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Sentiment Analysis \n",
	"Run sentiment polarity analysis \n",
	"Run sentiment emotion analysis"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {},
	"outputs": [],
	"source": [
	"# import a library for sentiment analysis \n",
	"from textblob import TextBlob\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"metadata": {},
	"outputs": [],
	"source": [
	"# apply the sentiment analysis to the \"English\" column\n",
	"df['Sentiment'] = df['English'].apply(lambda x: TextBlob(x).sentiment.polarity)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 16,
	"metadata": {},
	"outputs": [
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"/var/folders/yl/mwd8tygs7p38z57chhqrlkx80000gn/T/ipykernel_66292/2747275817.py:3: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.\n",
	" df = df.append({'Text': 'Finding the right product was difficult'}, ignore_index=True)\n"
	]
	}
	],
	"source": [
	"# add three rows to the dataframe column \"Text\" with the text \"I love you\", \"I hate you\", and \"I am neutral\"\n",
	"\n",
	"df = df.append({'Text': 'Finding the right product was difficult'}, ignore_index=True)\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Now to try sentiment analysis on the pre-labeled dataset \n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 17,
	"metadata": {},
	"outputs": [],
	"source": [
	"df_test = pd.read_csv('kaggle-test.csv', encoding= 'unicode_escape')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 18,
	"metadata": {},
	"outputs": [],
	"source": [
	"# make the \"text\" column a string \n",
	"df_test['text'] = df_test['text'].astype(str)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### TextBlob: Polarity & Subjectivity\n",
	"\n",
	"The output of TextBlob is polarity and subjectivity. \n",
	"\n",
	"Polarity score lies between (-1 to 1) where:\n",
	"* -1 identifies the most negative words such as ‘disgusting’, ‘awful’, ‘pathetic’, \n",
	"* 1 identifies the most positive words like ‘excellent’, ‘best’. \n",
	"\n",
	"Subjectivity score lies between (0 and 1), It shows the amount of personal opinion, \n",
	"* If a sentence has high subjectivity i.e. close to 1, It resembles that the text contains more personal opinion than factual information. \n",
	"* Conversely, a 0 would indicate a purely factual statement"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 19,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Apply sentiment analysis to text\n",
	"df_test['polarity_textblob'] = df_test['text'].apply(lambda x: TextBlob(x).sentiment.polarity)\n",
	"df_test['subjectivity_textblob'] = df_test['text'].apply(lambda x: TextBlob(x).sentiment.subjectivity)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 20,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Make the df smaller by only keeping the \"text\" and \"IN_Sentiment\" columns\n",
	"df_test = df_test[['text', 'sentiment', 'polarity_textblob', 'subjectivity_textblob']]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3.10.1 64-bit",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.10.1"
	},
	"orig_nbformat": 4,
	"vscode": {
	"interpreter": {
	"hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
	}
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}