Created
July 13, 2018 13:50
-
-
Save lucaspg96/74eb790abac6e9c20a4424bf1b886f11 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Email Classification Example\n", | |
"---\n", | |
"\n", | |
"This graph dataset contains the emails exchange inside an enterprise. Each node *u* represents an employee email, that is labeled by its department; each edge *(u,v)* says that *u* sent at least one email to *v*.\n", | |
"\n", | |
"Our objective is to predict the department in which the employee works." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Definition of some useful functions\n", | |
"---" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#basic imports\n", | |
"%matplotlib inline\n", | |
"import matplotlib.pyplot as plt\n", | |
"import networkx as nx\n", | |
"from networkx import DiGraph\n", | |
"from node2vec import Node2Vec\n", | |
"import telegram\n", | |
"\n", | |
"\n", | |
"#Telegram Bot configurations\n", | |
"\n", | |
"my_token = '363083153:AAGFIcfkGdxRmEqYwSrq4h91t83mWMfgNgc' #token generated by @BotFather\n", | |
"my_chat_id = 169036135 #your telegram Id (can be obtained from @userinfobot)\n", | |
"\n", | |
"def send(msg, chat_id=my_chat_id, token=my_token):\n", | |
" \"\"\"\n", | |
" Send a mensage to a chat\n", | |
" \n", | |
" Parameters\n", | |
" ----------\n", | |
" msg : String\n", | |
" text that will be sent\n", | |
"\n", | |
" chat_id : int or String\n", | |
" id of the chat that will be sent the message.\n", | |
" If is a group ou user, the id MUST be integer (negative numbers for groups)\n", | |
" If it is a chanel, the id CAN be the chanel's tag\n", | |
" \n", | |
" token : String\n", | |
" Token of the bot that will send the message. \n", | |
" When sending message for an user, the user must has talked\n", | |
" at least one time with the bot (usualy, the /start command).\n", | |
" When sending to a group, the bot must be allowed to talk\n", | |
" on groups.\n", | |
" When sending to a chanel, the bot must be an admin.\n", | |
" \"\"\"\n", | |
" \n", | |
" bot = telegram.Bot(token=token)\n", | |
" bot.sendMessage(chat_id=chat_id, text=msg)\n", | |
"\n", | |
"\n", | |
"#Defining log object to notify the steps updates\n", | |
"\n", | |
"class AbstractSimpleLog():\n", | |
" def log(self, msg):\n", | |
" raise Exception(\"log method must be implemented\")\n", | |
"\n", | |
"class PrintLog(AbstractSimpleLog):\n", | |
" def log(self, msg):\n", | |
" print(msg)\n", | |
" \n", | |
"class BotLog(AbstractSimpleLog):\n", | |
" def __init__(self, reason=None):\n", | |
" if reason:\n", | |
" send(\"Bot started to log: {}\".format(reason))\n", | |
" else:\n", | |
" send(\"Log Started -------------------------\")\n", | |
" def log(self, msg):\n", | |
" send(msg)\n", | |
"\n", | |
"\n", | |
"#Default name for embeddign file\n", | |
"EMBEDDING_FILE = \"embeddings.emb\"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Computing the Embedding\n", | |
"---" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"First, we define the path to the dataset and to its edges" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"#We can definine an log object. To long computations, we use BotLog to keep tracking\n", | |
"# by the cellphone using Telegram App\n", | |
"\n", | |
"logger = PrintLog()\n", | |
"# logger = BotLog()\n", | |
"\n", | |
"dataset = \"Email_dataset/\" #path to dataset\n", | |
"graph_file = dataset+\"edges.ssv\" #path to edges file" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now, we process the graph embedding" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"\r", | |
"Computing transition probabilities: 0%| | 0/1005 [00:00<?, ?it/s]" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Loading graph from Email_dataset/edges.ssv\n", | |
"Graph loaded\n", | |
"Computing transition probabilities\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"Computing transition probabilities: 100%|██████████| 1005/1005 [00:03<00:00, 301.44it/s]\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Transitions probabilities computed\n", | |
"Starting Node2Vec embedding\n", | |
"Node2Vec embedding created\n", | |
"Saving embedding file\n", | |
"Embedding file saved\n" | |
] | |
} | |
], | |
"source": [ | |
"logger.log(\"Loading graph from {}\".format(graph_file))\n", | |
"graph = nx.read_edgelist(graph_file, delimiter=\" \", create_using=DiGraph())\n", | |
"logger.log(\"Graph loaded\")\n", | |
"\n", | |
"logger.log(\"Computing transition probabilities\")\n", | |
"n2v = Node2Vec(graph, dimensions=128, walk_length=80, num_walks=50, workers=4, p=1, q=1)\n", | |
"logger.log(\"Transitions probabilities computed\")\n", | |
"\n", | |
"logger.log(\"Starting Node2Vec embedding\")\n", | |
"n2v_model = n2v.fit(window=80, min_count=1, batch_words=64)\n", | |
"logger.log(\"Node2Vec embedding created\")\n", | |
"\n", | |
"logger.log(\"Saving embedding file\")\n", | |
"n2v_model.wv.save_word2vec_format(dataset+EMBEDDING_FILE)\n", | |
"logger.log(\"Embedding file saved\")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now we are going to use the embedding generated to predict the employees departments" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Data Processing\n", | |
"---\n", | |
"\n", | |
"To process the data, we are going to use Numpy's functions" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"from numpy import array" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Since we saved the embedding, we can load it as a Numpy matrix" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"vectors = np.loadtxt(dataset+EMBEDDING_FILE,delimiter=' ',skiprows=1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We can also define a function to get the node embedding representation" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def to_embedded(n):\n", | |
" return vectors[n,:]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now, we load the data from the *labels.ssv* file and generate the dataset like:\n", | |
"\n", | |
"NODE_EMBEDDING, DEPARTMENT" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[ 1.00000000e+00, -5.52356600e-01, -1.31899000e+00, ...,\n", | |
" 2.73627400e+00, 3.10003760e-01, 1.00000000e+00],\n", | |
" [ 1.30000000e+02, 1.60013030e+00, -4.14573250e-01, ...,\n", | |
" -1.19659230e+00, -4.13204510e-02, 1.00000000e+00],\n", | |
" [ 5.32000000e+02, -9.19102910e-01, -1.96323200e-01, ...,\n", | |
" 2.04122850e+00, -2.22895600e+00, 2.10000000e+01],\n", | |
" ..., \n", | |
" [ 7.50000000e+02, -2.84327940e-03, -2.77449560e-03, ...,\n", | |
" 3.72426860e-03, -9.51730650e-04, 1.00000000e+00],\n", | |
" [ 7.90000000e+02, -5.00332680e-04, -2.21990440e-03, ...,\n", | |
" -1.75051400e-03, 2.45084990e-03, 6.00000000e+00],\n", | |
" [ 9.44000000e+02, 6.54737760e-04, -1.85789620e-03, ...,\n", | |
" -1.21631000e-04, -3.08797140e-03, 2.20000000e+01]])" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data = [] #data matrix initialized empty\n", | |
"\n", | |
"with open(\"Email_dataset/labels.ssv\") as f:\n", | |
" for line in f:\n", | |
" node,department = line.split() #get the node id and its department (class)\n", | |
" node_embedded = to_embedded(int(node)) #get the embedded representation of the node\n", | |
" data.append(np.append(node_embedded,array([department]))) #insert the embedding and the class inside the data matrix\n", | |
"\n", | |
"data = array(data,dtype=float) #transform the data matrix in a Numpy array\n", | |
"data" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We shuffle the data and split into train and test subsets, using the *train_percentage* factor" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"np.random.shuffle(data)\n", | |
"train_percentage = 0.8\n", | |
"train_size = int(len(data)*train_percentage)\n", | |
"\n", | |
"train_data = array(data[0:train_size])\n", | |
"test_data = array(data[train_size:])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Predictions\n", | |
"---\n", | |
"We are going to use the following models to predict the classes" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.linear_model import LogisticRegression\n", | |
"from sklearn.linear_model import SGDClassifier\n", | |
"from sklearn.linear_model import Perceptron\n", | |
"\n", | |
"from sklearn.svm import SVC\n", | |
"\n", | |
"from sklearn.neural_network import MLPClassifier\n", | |
"\n", | |
"from sklearn.neighbors import KNeighborsClassifier\n", | |
"\n", | |
"from sklearn.ensemble import GradientBoostingClassifier\n", | |
"from sklearn.ensemble import RandomForestClassifier\n", | |
"from sklearn.ensemble import ExtraTreesClassifier\n", | |
"from sklearn.ensemble import AdaBoostClassifier\n", | |
"\n", | |
"from sklearn.gaussian_process import GaussianProcessClassifier\n", | |
"\n", | |
"from sklearn.tree import DecisionTreeClassifier\n", | |
"\n", | |
"from sklearn.naive_bayes import BernoulliNB\n", | |
"from sklearn.naive_bayes import GaussianNB" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We also define a function to train the model and give the score" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def train_and_eval(model):\n", | |
" model.fit(train_data[:,0:-1], train_data[:,-1])\n", | |
" return model.score(test_data[:,:-1], test_data[:,-1])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"At least, we train and compute the scores for each model" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"LogisticRegression score: 5.970149253731343%\n", | |
"SGDClassifier score: 1.4925373134328357%\n", | |
"Perceptron score: 2.9850746268656714%\n", | |
"SVC score: 8.955223880597014%\n", | |
"MLPClassifier score: 7.960199004975125%\n", | |
"KNeighborsClassifier score: 5.970149253731343%\n", | |
"GaussianProcessClassifier score: 0.4975124378109453%\n", | |
"DecisionTreeClassifier score: 4.975124378109453%\n", | |
"BernoulliNB score: 3.482587064676617%\n", | |
"GaussianNB score: 6.467661691542288%\n", | |
"GradientBoostingClassifier score: 5.970149253731343%\n", | |
"RandomForestClassifier score: 7.960199004975125%\n", | |
"ExtraTreesClassifier score: 5.472636815920398%\n", | |
"AdaBoostClassifier score: 9.45273631840796%\n" | |
] | |
} | |
], | |
"source": [ | |
"score = train_and_eval(LogisticRegression())\n", | |
"logger.log(\"LogisticRegression score: {}%\".format(score*100))\n", | |
"score = train_and_eval(SGDClassifier(max_iter=100, tol=0.001))\n", | |
"logger.log(\"SGDClassifier score: {}%\".format(score*100))\n", | |
"score = train_and_eval(Perceptron(max_iter=100, tol=0.001))\n", | |
"logger.log(\"Perceptron score: {}%\".format(score*100))\n", | |
"\n", | |
"score = train_and_eval(SVC())\n", | |
"logger.log(\"SVC score: {}%\".format(score*100))\n", | |
"\n", | |
"score = train_and_eval(MLPClassifier())\n", | |
"logger.log(\"MLPClassifier score: {}%\".format(score*100))\n", | |
"\n", | |
"score = train_and_eval(KNeighborsClassifier())\n", | |
"logger.log(\"KNeighborsClassifier score: {}%\".format(score*100))\n", | |
"\n", | |
"score = train_and_eval(GaussianProcessClassifier())\n", | |
"logger.log(\"GaussianProcessClassifier score: {}%\".format(score*100))\n", | |
"\n", | |
"score = train_and_eval(DecisionTreeClassifier())\n", | |
"logger.log(\"DecisionTreeClassifier score: {}%\".format(score*100))\n", | |
"\n", | |
"score = train_and_eval(BernoulliNB())\n", | |
"logger.log(\"BernoulliNB score: {}%\".format(score*100))\n", | |
"score = train_and_eval(GaussianNB())\n", | |
"logger.log(\"GaussianNB score: {}%\".format(score*100))\n", | |
"\n", | |
"score = train_and_eval(GradientBoostingClassifier())\n", | |
"logger.log(\"GradientBoostingClassifier score: {}%\".format(score*100))\n", | |
"score = train_and_eval(RandomForestClassifier())\n", | |
"logger.log(\"RandomForestClassifier score: {}%\".format(score*100))\n", | |
"score = train_and_eval(ExtraTreesClassifier())\n", | |
"logger.log(\"ExtraTreesClassifier score: {}%\".format(score*100))\n", | |
"score = train_and_eval(AdaBoostClassifier())\n", | |
"logger.log(\"AdaBoostClassifier score: {}%\".format(score*100))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Our accuracy is very low to all the models. To understand this, let's analyse the departments distribution over the data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAeoAAAHYCAYAAACC36ucAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAGvVJREFUeJzt3X2wbWddH/Dvj1xFIQIJuV4iUC61QUQtoFekYgvTODUa\nx2QsMuhoA0ObP6pAqVO9Fp1Yq5I6DtaOQicFMYJgI1oTiyAxvtVWQi4ECSFBEBJezMsVELQ6ysvT\nP/ZK3dk5J9n77LPv+eWcz2dmz157rWev51kve33Xy95r1xgjAEBPD9jrBgAA2xPUANCYoAaAxgQ1\nADQmqAGgMUENAI0JagBoTFADQGOCGgAaO7TXDUiSs846axw9enSvmwEAp8zb3va2PxtjHL6vci2C\n+ujRozlx4sReNwMATpmqunWZck59A0BjghoAGhPUANCYoAaAxgQ1ADQmqAGgMUENAI0JagBoTFAD\nQGOCGgAaE9QA0JigBoDGBDUANCaoAaAxQQ0AjQlqAGhMUANAY4IaABoT1ADQ2KG9bsCio8ffcI9+\nt1x6/h60BAD2niNqAGhMUANAY4IaABoT1ADQmKAGgMYENQA0JqgBoDFBDQCNCWoAaExQA0BjghoA\nGhPUANCYoAaAxgQ1ADQmqAGgMUENAI0JagBoTFADQGOCGgAaE9QA0JigBoDGBDUANCaoAaAxQQ0A\njQlqAGhMUANAY4IaABoT1ADQmKAGgMYENQA0dp9BXVU/V1V3VtW75vqdWVVXV9V7p+cz5ob9QFW9\nr6reU1XfsKmGA8BBsMwR9c8nOW+h3/Ek14wxzklyzfQ6VfWEJM9O8mXTe15WVaftWmsB4IC5z6Ae\nY/x+ko8t9L4gyeVT9+VJLpzr/0tjjL8ZY3wgyfuSPGWX2goAB85Or1EfGWPcNnXfnuTI1P3IJB+a\nK/fhqd89VNXFVXWiqk6cPHlyh80AgP1t7S+TjTFGkrGD9102xjg2xjh2+PDhdZsBAPvSToP6jqo6\nO0mm5zun/h9J8ui5co+a+gEAO7DToL4qyUVT90VJrpzr/+yqemBVPTbJOUneul4TAeDgOnRfBarq\ndUmekeSsqvpwkkuSXJrkiqp6XpJbkzwrScYYN1bVFUneneTTSb57jPGZDbUdAPa9+wzqMca3bzPo\n3G3K/1iSH1unUQDAjDuTAUBjghoAGhPUANCYoAaAxgQ1ADQmqAGgMUENAI0JagBoTFADQGOCGgAa\nE9QA0JigBoDGBDUANCaoAaAxQQ0AjQlqAGhMUANAY4IaABoT1ADQmKAGgMYENQA0JqgBoDFBDQCN\nCWoAaOzQXjdgHUePv+Ee/W659Pw9aAkAbIYjagBoTFADQGOCGgAaE9QA0JigBoDGBDUANCaoAaAx\nQQ0AjQlqAGhMUANAY4IaABoT1ADQmKAGgMYENQA0JqgBoDFBDQCNCWoAaExQA0BjghoAGhPUANCY\noAaAxgQ1ADQmqAGgMUENAI0JagBoTFADQGOCGgAaE9QA0JigBoDGBDUANCaoAaAxQQ0AjQlqAGhM\nUANAY4IaABoT1ADQmKAGgMYENQA0JqgBoDFBDQCNCWoAaExQA0BjawV1Vb2oqm6sqndV1euq6vOq\n6syqurqq3js9n7FbjQWAg2bHQV1Vj0zygiTHxhhfnuS0JM9OcjzJNWOMc5JcM70GAHZg3VPfh5J8\nflUdSvKgJH+a5IIkl0/DL09y4Zp1AMCBteOgHmN8JMlPJvlgktuSfGKM8eYkR8YYt03Fbk9yZKv3\nV9XFVXWiqk6cPHlyp80AgH1tnVPfZ2R29PzYJF+U5MFV9Z3zZcYYI8nY6v1jjMvGGMfGGMcOHz68\n02YAwL62zqnvr0/ygTHGyTHGp5L8apKvTXJHVZ2dJNPznes3EwAOpnWC+oNJnlpVD6qqSnJukpuS\nXJXkoqnMRUmuXK+JAHBwHdrpG8cY11bV65O8Pcmnk1yf5LIkpye5oqqel+TWJM/ajYYCwEG046BO\nkjHGJUkuWej9N5kdXQMAa3JnMgBoTFADQGOCGgAaE9QA0JigBoDGBDUANCaoAaAxQQ0AjQlqAGhM\nUANAY4IaABoT1ADQmKAGgMYENQA0JqgBoDFBDQCNCWoAaExQA0BjghoAGhPUANCYoAaAxgQ1ADQm\nqAGgMUENAI0JagBoTFADQGOCGgAaE9QA0JigBoDGBDUANCaoAaAxQQ0AjQlqAGhMUANAY4IaABoT\n1ADQmKAGgMYENQA0JqgBoDFBDQCNCWoAaExQA0BjghoAGhPUANCYoAaAxgQ1ADQmqAGgMUENAI0J\nagBoTFADQGOCGgAaO7TXDTgVjh5/w5b9b7n0/FPcEgBYjSNqAGhMUANAY4IaABoT1ADQmKAGgMYE\nNQA0JqgBoDFBDQCNCWoAaExQA0BjghoAGhPUANCYoAaAxgQ1ADR2IP7mchVb/SWmv8MEYK+sdURd\nVQ+rqtdX1c1VdVNV/aOqOrOqrq6q907PZ+xWYwHgoFn31PdPJ3nTGOPxSZ6Y5KYkx5NcM8Y4J8k1\n02sAYAd2HNRV9dAk/yTJK5NkjPG3Y4w/T3JBksunYpcnuXDdRgLAQbXOEfVjk5xM8qqqur6qXlFV\nD05yZIxx21Tm9iRHtnpzVV1cVSeq6sTJkyfXaAYA7F/rBPWhJF+Z5OVjjCcn+b9ZOM09xhhJxlZv\nHmNcNsY4NsY4dvjw4TWaAQD71zpB/eEkHx5jXDu9fn1mwX1HVZ2dJNPznes1EQAOrh0H9Rjj9iQf\nqqovmXqdm+TdSa5KctHU76IkV67VQgA4wNb9HfXzk/xiVX1ukvcneW5m4X9FVT0vya1JnrVmHQBw\nYK0V1GOMdyQ5tsWgc9cZLwAw4xaiANCYoAaAxgQ1ADQmqAGgMUENAI0JagBoTFADQGOCGgAaE9QA\n0JigBoDGBDUANCaoAaAxQQ0Aja37N5cH1tHjb9iy/y2Xnn+KWwLAfuaIGgAaE9QA0JigBoDGBDUA\nNCaoAaAxQQ0AjQlqAGhMUANAY4IaABoT1ADQmKAGgMYENQA0JqgBoDFBDQCN+ZvLU2Crv8T0d5gA\nLMMRNQA0JqgBoDFBDQCNCWoAaExQA0BjghoAGhPUANCY31E34zfXAMxzRA0AjQlqAGhMUANAY4Ia\nABoT1ADQmKAGgMYENQA0JqgBoDFBDQCNCWoAaExQA0BjghoAGhPUANCYoAaAxgQ1ADQmqAGgMUEN\nAI0JagBoTFADQGOCGgAaE9QA0JigBoDGBDUANCaoAaAxQQ0AjQlqAGhMUANAY4IaABoT1ADQmKAG\ngMbWDuqqOq2qrq+q/zm9PrOqrq6q907PZ6zfTAA4mHbjiPqFSW6ae308yTVjjHOSXDO9BgB2YK2g\nrqpHJTk/ySvmel+Q5PKp+/IkF65TBwAcZOseUf/nJN+X5LNz/Y6MMW6bum9PcmSrN1bVxVV1oqpO\nnDx5cs1mAMD+tOOgrqpvTnLnGONt25UZY4wkY5thl40xjo0xjh0+fHinzQCAfe3QGu99WpJvqapv\nSvJ5SR5SVa9JckdVnT3GuK2qzk5y5240FAAOoh0fUY8xfmCM8agxxtEkz07y22OM70xyVZKLpmIX\nJbly7VYCwAG1zhH1di5NckVVPS/JrUmetYE6Dryjx9+wZf9bLj3/FLcEgE3alaAeY/xukt+duj+a\n5NzdGC8AHHTuTAYAjQlqAGhMUANAY4IaABoT1ADQmKAGgMYENQA0JqgBoDFBDQCNCWoAaExQA0Bj\nghoAGhPUANDYJv7mkma2+ktMf4cJcP/giBoAGhPUANCYoAaAxgQ1ADQmqAGgMUENAI0JagBoTFAD\nQGOCGgAaE9QA0JhbiHI3bjcK0IsjagBoTFADQGOCGgAaE9QA0JigBoDGBDUANCaoAaAxQQ0AjQlq\nAGhMUANAY4IaABpzr292ZKt7gidb3xfc/cMBds4RNQA0JqgBoDFBDQCNuUZNK65nA9ydI2oAaExQ\nA0BjTn1zv+QUOXBQOKIGgMYENQA0JqgBoDHXqNn3XM8G7s8cUQNAY4IaABoT1ADQmGvUMPHXnUBH\njqgBoDFBDQCNCWoAaExQA0BjghoAGhPUANCYoAaAxgQ1ADQmqAGgMUENAI25hShsmNuNAutwRA0A\njQlqAGhMUANAY4IaABrbcVBX1aOr6neq6t1VdWNVvXDqf2ZVXV1V752ez9i95gLAwbLOEfWnk3zv\nGOMJSZ6a5Lur6glJjie5ZoxxTpJrptcAwA7sOKjHGLeNMd4+df9FkpuSPDLJBUkun4pdnuTCdRsJ\nAAfVrlyjrqqjSZ6c5NokR8YYt02Dbk9yZJv3XFxVJ6rqxMmTJ3ejGQCw76wd1FV1epJfSfJvxhif\nnB82xhhJxlbvG2NcNsY4NsY4dvjw4XWbAQD70lpBXVWfk1lI/+IY41en3ndU1dnT8LOT3LleEwHg\n4FrnW9+V5JVJbhpjvHRu0FVJLpq6L0py5c6bBwAH2zr3+n5aku9KckNVvWPq9++TXJrkiqp6XpJb\nkzxrvSbCwbDVPcET9wWHg27HQT3G+IMktc3gc3c6XgDg77gzGQA0JqgBoDH/Rw33Q/7jGg4OR9QA\n0JigBoDGBDUANCaoAaAxQQ0AjQlqAGhMUANAY4IaABoT1ADQmKAGgMbcQhT2Obcbhfs3R9QA0Jig\nBoDGBDUANOYaNZDEtWzoyhE1ADQmqAGgMUENAI25Rg2sbNnr2VuVW6Wsa+TgiBoAWhPUANCYU9/A\n/Y7T5BwkjqgBoDFBDQCNCWoAaExQA0BjghoAGhPUANCYoAaAxvyOGti33MKU/cARNQA0JqgBoDFB\nDQCNuUYNsKJN/M0nbMcRNQA0JqgBoDFBDQCNuUYN0IDr3mzHETUANCaoAaAxQQ0AjQlqAGhMUANA\nY4IaABrz8yyAfWqVv+70N599OaIGgMYENQA0JqgBoDHXqAFYmluYnnqOqAGgMUENAI0JagBozDVq\nADbC77h3hyNqAGhMUANAY4IaABpzjRqA+41Vfse9X657O6IGgMYENQA05tQ3AAda99uiOqIGgMYE\nNQA0JqgBoDHXqAFgSXtxW9SNHVFX1XlV9Z6qel9VHd9UPQCwn20kqKvqtCQ/m+QbkzwhybdX1RM2\nURcA7GebOqJ+SpL3jTHeP8b42yS/lOSCDdUFAPtWjTF2f6RVz0xy3hjjX06vvyvJ14wxvmeuzMVJ\nLp5efkmS92wxqrOS/NkSVS5bblNl7y/jPOj178dp2uv69+M07XX9+3GaDnr925V7zBjj8H2+e4yx\n648kz0zyirnX35XkZ3YwnhO7WW5TZe8v4zzo9e/Hadrr+vfjNO11/ftxmg56/auMc6vHpk59fyTJ\no+deP2rqBwCsYFNBfV2Sc6rqsVX1uUmeneSqDdUFAPvWRn5HPcb4dFV9T5LfTHJakp8bY9y4g1Fd\ntsvlNlX2/jLOg17/fpymva5/P07TXte/H6fpoNe/yjjvYSNfJgMAdodbiAJAY4IaABoT1ADQWNs/\n5aiqXxhj/Iu9bsc6quoFSf7HGONDS5R9fGZ3b3vk1OsjSa4aY9y0wSbO1//3k3xrZj+r+0ySP07y\n2jHGJ09B3V+T5KYxxier6vOTHE/ylUneneTHxxif2HQbToWqevgY46Nb9L/rlxF/Osb4rar6jiRf\nm+SmJJeNMT61zfi+LrO7AL5rjPHmDTZ9W1X1hWOMO09hfU9JMsYY1023JT4vyc1jjN9YY5yPz+xz\nd+0Y4y/n+p83xnjTfbx3y2Xa0aleVuyeFkfUVXXVwuPXk3zrXa+XeP8X7qDO5y5Z7o2rjnvOf0xy\nbVX9r6r611W15R1oqur7M7vNaiV56/SoJK87FX9oMu1Q/Nckn5fkq5M8MLPAfktVPWPT9Sf5uSR/\nNXX/dJKHJvlPU79XLbT1EVX18qr62ap6eFX9cFXdUFVXVNXZu9Wgqnr4Fv0eUlUvqapXT2E6P+xl\nC68vraqzpu5jVfX+zNaFW6vq6QujflWS85O8sKpeneTbklyb2bJ4xdw43zrX/a+S/EySL0hyyeJ6\nUlVvr6ofrKovXmJaHzq19+aq+lhVfbSqbpr6PWyu3JkLj4cneWtVnVFVZy6M81hV/U5VvaaqHl1V\nV1fVJ6rquqp68ly58xba8cqqemdVvbaqjiyM85Ik/yXJy6vqJdP0PzjJ8ap68U6mf1r3r0zy/CTv\nqqr5Wx3/+ELZpZfpKuvKvbTtjQuvl5qnU9mlltWK8//0qvqRqrpxqvdkVb2lqp6zUG6p9WnVssta\ntp1LjGdx/p+Sbc+W1rlbym49krw9yWuSPCPJ06fn26bupy+UPXPh8fAktyQ5I8mZK9T5wbnur9zm\n8VVJblt433lz3Q9N8sok70zy2iRHFspen9nO0D+byp1M8qYkFyX5grlyf5zkc7Zo4+cmee9Cv4cm\nuTTJzUk+luSjmR15XZrkYVvM1x9M8sX3MS9uSHLa1P2gJL87df+9JNcvlD09yY8kuTHJJ6ZpekuS\n56y4zN84133TfJsXyr1j4fWbMtuoHp/m+/dntlPx/CRXLpR9RJKXZ/YHMQ9P8sPTtF6R5Oy5cpcm\nOWvqPpbk/Unel+TW+fUvya9MZS/M7L4Av5Lkgdu0+4a57t9J8tVT9+OycJeiJO+cng8luWNuWdRd\nw+5an+a6r0tyeOp+8Hx9U78PJPnJJB/MbMfvRUm+aJtl8ZvTfHzEwrz7/iRvnuv32Wm8849PTc/v\nXxjnWzP7U55vT/KhJM+c+p+b5A+3Wt6Z7ZT8aJLHTO39ta3W02kd/WSSh0z9P39+Pq0y/dM4T5+6\njyY5keSFi/N7B8t0qXUlq217lpqnqyyrFef/lUmek9kNrP5tkh9Kck6SyzM787XS+rSDsg9J8pIk\nr07yHQvDXrZqO3cw/1fZ9hyb1pHXTGWuzmx7eV2SJ9/btnHLz+iqb9jEI7Mwe9E0MU+a+r1/m7Kr\nbCzeuc3jhiR/M1fuM0l+e5qxi4+/XhjnKiv24sb7c5J8S5LXJTk51//mzO75ujitj0nynjVW7FU2\nVndtRM7I3EYns9OqK39YV/kQJPnlJM+dul+V5NjU/bgk1y2Mcz6sPrgwbEehniU3wFuM/8VJ/ndm\nOwGLy/qmJIem7rcszu+F1+/KbKfsjCR/kWmHM7MzHPM7MX80ldmqvsVQmV9P/3GSlyW5fZq+ixfK\n3m0d225Yku+d5ulXzK9j27zv3pbT9du0c3H+Lr6+fqvubcouNf1Jblx43+nTNL50i3GuskyXWley\n2rZnqXm6yrJacf7/0cLr66bnB2R2+WGl9WkHZZfd+Vmqnbs8/xfn1dI7Vcs8Viq86UdmG/9fzuyU\n1ge3KbPKxuKOJE/KLPDmH0czux54V7l3JTlnm3F8aI0V+/qtxjkNe9Bc93mZHcG9MbMfxl82TeP7\nMncEv4MVe9mN1QszC7L/ltlOw12heTjJ7y+Mc9c/BJmdJfj5JH+S2SnfT2V2VPt7SZ64Xf1JfnRh\n2OLGcqkPVpbcAE/lHrAw/DmZnV24daH/85O8Ock/zexI/qczO0P0H5K8eqHsi6bpvTXJC5JcMy2L\nG5JcMlfulqncB6bns6f+p2+x7t0tyKd+p03r2qsW+r85yfdl7oxQkiOZ7dj81jaf0Zdmdtp9ux3q\nP8zsTNK3TdN14dT/6bn7zs+HM9vh+95pumpu2OJR8rWZPjfzy2FafxZ3XJaa/mn9fNJCuUNJfiHJ\nZ9ZYpkutK1lt27PUPF1lWa04//9Pkq+bur8lyW/ODZvfoVtlfVql7LI7P0u1cwfz/962PYvzaumd\nqmUeKxU+VY/Mrtf9+L0MX3Zj8cq7FtgWw1471/3MJF+yTbkL11ixH7fCND8gyVOT/PPp8dRMp0AX\nyq2yYq+ysf6yaT48/j7auZEPwdTvIUmemNkR95Ft3vcjmU5VLvT/B0lev9BvqVDPkhvgJD+R5Ou3\nqPu8LFyimPo/I8l/z+wSyA1JfiOzf4zb6jLHF2U625HkYdOyeMqS686Dkjx2od8vrbDunZHZdwJu\nTvLxzC6p3DT12/Jy0rTs35Lk9m2GPzGzsz9vTPL4aZ5+PLOgetpcuUsWHnedzn9Ekl9YGOcDt6nr\nrMztuK8y/ZltSx6xzbCnbdFvqWW67LqS1bY9W83TP5/m6dfeyzRuu6xWnP9PzOxI8eNJ/uCudme2\nQ/+Ce1mfPj6tTz+xuD6tsu5l+Z2fxXY+bqt27mD+r7LtWWmn6j7X01Xf0OlxbyvgDsb1+MxOS5y+\n0H/xiHbpFXtD0zy/Yn9sYcU+Y6Hs0hvrFer/h5v4EGxoWa3ywXpGtt4AH1qy7m/cop6n5O9Oo39Z\nZjt337TpdWSb+p+Q2Q7mlvVP0/X1S8zT+XF+RWbfgdhunF+zzPSv0s5NrCebGuey07XicvrSFer/\n/23N7Dr+l2+zTFeZpi9dZj3Z4n2vvpdhS31OssKO8sK6t9I6lXvZjq+wTHe0U7VtvTtdYbs8FlbA\n5+5wHC/I7P+wfy2z04sXzA27x1HpvYxnR/Xv4rxYuv5NtPVU1J/Z0e8pW1bz5VapO7OduLdk9uWk\nl2R2OvuHkvx+khefgnVhsf7f3q7+Zdf/VaZp2bKbmk+7tZ7sZD6tMv93sJxuXrL+ZZfpKuv0UvVn\ndv148fGXd3Vv4nOSu39OV5mni+389RXauu14l23r0u9Zd4PQ6ZFtrmsv8b6lv/m5ifr3Yvo30dZT\nUf+pXla5+68DVvqGcJb8hvKG1oVVviG91HTtYJz3WXZT82m31pOdjnMT07+D+pddprs9zlV+xbMr\nyz/3/JwuO0+v38u2Lvtoe8OT7VTVO7cblNm12p14wJhudDDGuKVmvx1+fVU9Zhrvputf2ir1b6Kt\ne11/NrCsVmjn0nUn+fQY4zNJ/qqq/mRMN44ZY/x1VX12ielc1yr1Lztdq4xz2bKbmk+rLKtNjHMT\n079K/cuW3cQ4j2X2BdUXJ/l3Y4x3VNVfjzF+7x5zdIXpX+Fzuso8/ao9butS7ndBndlEfkNm10jn\nVWZfdNqJO6rqSWOMdyTJGOMvq+qbM7sRx1ecgvpXsUr9m2jrXte/iWW1bLlV6v7bqnrQGOOvMtsY\nzEZY9dDMfmK4aavUv+x0rTLOZctuaj6tsqw2Mc5NTP8q9S9bdtfHOcb4bJKfqqpfnp7vyPZZs8r0\nL/s5XXqcDdq6nFUPwff6kSW/yb3iOJf+5ucm6t/U9G9oXu11/bu+rFYot0rdS39DeUPrySrfkF5q\nulYc51JlNzWfVllWG1r3dn36V6x/2WW66+PcYti2v+JZcfqX/ZzueJ061W1d9uH/qAGgsRb3+gYA\ntiaoAaAxQQ0AjQlqAGhMUANAY/8PsAMboF4aYNEAAAAASUVORK5CYII=\n", | |
"text/plain": [ | |
"<matplotlib.figure.Figure at 0x1257084a8>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"import pandas as pd\n", | |
"df = pd.read_csv(\"Email_dataset/labels.ssv\", delimiter=\" \", names=[\"Node\",\"Dep\"])\n", | |
"_ = df[\"Dep\"].value_counts().plot(kind=\"bar\", figsize=(8,8))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Analysing the graph we can see that the dataset is unbalanced, so that can explain the bad accuracies that we got.\n", | |
"So we are going take a subgraph that have only a balanced distribution of departments." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Retrieving the Subgraph\n", | |
"---\n", | |
"First, we compute a dictionary with the nodes departments" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"labels_dict = {}\n", | |
"with open(\"Email_dataset/labels.ssv\") as f:\n", | |
" for line in f:\n", | |
" node,department = line.split()\n", | |
" labels_dict[node] = department" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"After, we generate a subset of edges that contains only the nodes that are from the departments defined in *labels_filter*" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"labels_filter = [\"4\",\"14\"]\n", | |
"with open(\"Email_dataset/edges.ssv\",'r') as f_in:\n", | |
" with open(\"Email_dataset/edges_filtered.ssv\",'w') as f_out:\n", | |
" for line in f_in:\n", | |
" src,trg = line.split()\n", | |
" if labels_dict[src] in labels_filter and labels_dict[trg] in labels_filter:\n", | |
" f_out.write(line)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now, we reload the graph an compute its embedding" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"Computing transition probabilities: 28%|██▊ | 54/194 [00:00<00:00, 537.90it/s]" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Loading graph from Email_dataset/edges.ssv\n", | |
"Graph loaded\n", | |
"Computing transition probabilities\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"Computing transition probabilities: 100%|██████████| 194/194 [00:00<00:00, 1001.05it/s]\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Transitions probabilities computed\n", | |
"Starting Node2Vec embedding\n", | |
"Node2Vec embedding created\n", | |
"Saving embedding file\n", | |
"Embedding file saved\n" | |
] | |
} | |
], | |
"source": [ | |
"logger = PrintLog()\n", | |
"\n", | |
"logger.log(\"Loading graph from {}\".format(graph_file))\n", | |
"graph = nx.read_edgelist(\"Email_dataset/edges_filtered.ssv\", delimiter=\" \", create_using=DiGraph())\n", | |
"logger.log(\"Graph loaded\")\n", | |
"\n", | |
"logger.log(\"Computing transition probabilities\")\n", | |
"n2v = Node2Vec(graph, dimensions=128, walk_length=50, num_walks=30, workers=4, p=1, q=1)\n", | |
"logger.log(\"Transitions probabilities computed\")\n", | |
"\n", | |
"logger.log(\"Starting Node2Vec embedding\")\n", | |
"n2v_model = n2v.fit(window=50, min_count=1, batch_words=64)\n", | |
"logger.log(\"Node2Vec embedding created\")\n", | |
"\n", | |
"logger.log(\"Saving embedding file\")\n", | |
"n2v_model.wv.save_word2vec_format(dataset+EMBEDDING_FILE)\n", | |
"logger.log(\"Embedding file saved\")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"And now we process the data again, but we take only the employees that are at the graph" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"vectors = np.loadtxt(dataset+EMBEDDING_FILE,delimiter=' ',skiprows=1)\n", | |
"\n", | |
"def to_embedded(n):\n", | |
" return vectors[n,:]\n", | |
"\n", | |
"data = []\n", | |
"\n", | |
"with open(\"Email_dataset/labels.ssv\") as f:\n", | |
" for line in f:\n", | |
" node,department = line.split()\n", | |
" if department in labels_filter:\n", | |
" try:\n", | |
" node_embedded = to_embedded(graph.nodes().index(node))\n", | |
" data.append(np.append(node_embedded,array([department])))\n", | |
" except:\n", | |
" pass\n", | |
"\n", | |
"data = array(data,dtype=float)\n", | |
"\n", | |
"np.random.shuffle(data)\n", | |
"train_percentage = 0.7\n", | |
"train_size = int(len(data)*train_percentage)\n", | |
"\n", | |
"train_data = array(data[0:train_size])\n", | |
"test_data = array(data[train_size:])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"At least, we run again the models" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"LogisticRegression score: 55.932203389830505%\n", | |
"SGDClassifier score: 45.76271186440678%\n", | |
"Perceptron score: 54.23728813559322%\n", | |
"SVC score: 42.3728813559322%\n", | |
"MLPClassifier score: 55.932203389830505%\n", | |
"KNeighborsClassifier score: 45.76271186440678%\n", | |
"GaussianProcessClassifier score: 40.67796610169492%\n", | |
"DecisionTreeClassifier score: 52.54237288135594%\n", | |
"BernoulliNB score: 45.76271186440678%\n", | |
"GaussianNB score: 55.932203389830505%\n", | |
"GradientBoostingClassifier score: 55.932203389830505%\n", | |
"RandomForestClassifier score: 54.23728813559322%\n", | |
"ExtraTreesClassifier score: 57.6271186440678%\n", | |
"AdaBoostClassifier score: 52.54237288135594%\n" | |
] | |
} | |
], | |
"source": [ | |
"score = train_and_eval(LogisticRegression())\n", | |
"logger.log(\"LogisticRegression score: {}%\".format(score*100))\n", | |
"score = train_and_eval(SGDClassifier(max_iter=100, tol=0.001))\n", | |
"logger.log(\"SGDClassifier score: {}%\".format(score*100))\n", | |
"score = train_and_eval(Perceptron(max_iter=100, tol=0.001))\n", | |
"logger.log(\"Perceptron score: {}%\".format(score*100))\n", | |
"\n", | |
"score = train_and_eval(SVC())\n", | |
"logger.log(\"SVC score: {}%\".format(score*100))\n", | |
"\n", | |
"score = train_and_eval(MLPClassifier())\n", | |
"logger.log(\"MLPClassifier score: {}%\".format(score*100))\n", | |
"\n", | |
"score = train_and_eval(KNeighborsClassifier())\n", | |
"logger.log(\"KNeighborsClassifier score: {}%\".format(score*100))\n", | |
"\n", | |
"score = train_and_eval(GaussianProcessClassifier())\n", | |
"logger.log(\"GaussianProcessClassifier score: {}%\".format(score*100))\n", | |
"\n", | |
"score = train_and_eval(DecisionTreeClassifier())\n", | |
"logger.log(\"DecisionTreeClassifier score: {}%\".format(score*100))\n", | |
"\n", | |
"score = train_and_eval(BernoulliNB())\n", | |
"logger.log(\"BernoulliNB score: {}%\".format(score*100))\n", | |
"score = train_and_eval(GaussianNB())\n", | |
"logger.log(\"GaussianNB score: {}%\".format(score*100))\n", | |
"\n", | |
"score = train_and_eval(GradientBoostingClassifier())\n", | |
"logger.log(\"GradientBoostingClassifier score: {}%\".format(score*100))\n", | |
"score = train_and_eval(RandomForestClassifier())\n", | |
"logger.log(\"RandomForestClassifier score: {}%\".format(score*100))\n", | |
"score = train_and_eval(ExtraTreesClassifier())\n", | |
"logger.log(\"ExtraTreesClassifier score: {}%\".format(score*100))\n", | |
"score = train_and_eval(AdaBoostClassifier())\n", | |
"logger.log(\"AdaBoostClassifier score: {}%\".format(score*100))" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.1" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment