ZhuTingxiang · November 19, 2016 03:07
diff --git a/gistfile1.txt b/gistfile1.txt
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "\n",
    "The aim of the project is to study through the LinkedIn data to find out the **relationships between personal skills and occupations**. We are trying to analyze how certain types of skills contribute to a person’s chance of getting certain type of jobs in certain field. We also want to use a person’s education background as supporting information to see if the college a person studied in will add on the chances.\n",
    "\n",
    "The whole project can be divided into five parts \n",
    "* data collecting: Collect data of personal skills and positions by scrapying Linkedin profile data\n",
    "* pre-processing: Process the raw data we collected\n",
    "* model training: Using model similar to TF-IDF to analysis relationship between skills and occupations\n",
    "* model evaluating: Test model after training\n",
    "* result visualizing: Using word-cloud and other ways to represent model results\n",
    "\n",
    "## Data Collection\n",
    "\n",
    "First of all, for analyzing, we need people’s skills and there current or past job information. Since LinkedIn doesn’t provide a public API that allows us to extract skill and job data directly, we are getting them through screen scraping. We start from a single person’s name and visit his or her homepage through linkedin url. Meanwhile, we store “also viewed” homepages as new entries for screen scraping. For each home page, we fetch all the listed skills, and titles for current and past work experiences. For each skill, we store the count of endorsement on that skill as a weight of that skill. We also fetch the company information for each period of work experience and the school (education) information for each person in case of future analysis usage. \n",
    "\n",
    "For now, we have scraped 7,000 lines of data from LinkedIn webpages and stored them in a csv file. By loading them into pandas dataframe, we can have a brief look at the data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print dataframe.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "fullname           object\n",
    "locality           object\n",
    "industry           object\n",
    "current summary    object\n",
    "past summary       object\n",
    "education          object\n",
    "skills             object\n",
    "endorsements       object\n",
    "positions          object\n",
    "dtype: object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print dataframe.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "head                fullname                  locality  \\\n",
    "0  Tingxiang Zhu (Star)  Pittsburgh, Pennsylvania   \n",
    "1    Jiaming Ni (Oscar)  Pittsburgh, Pennsylvania   \n",
    "2            Jiaming Ni      Shanghai City, China   \n",
    "3            Jiaming Ni                     China   \n",
    "4      Yuqi Wang (yuki)  Pittsburgh, Pennsylvania   \n",
    "\n",
    "                               industry       current summary  \\\n",
    "0   Information Technology and Services                   NaN   \n",
    "1   Information Technology and Services                   NaN   \n",
    "2  Mechanical or Industrial Engineering   Honeywell Aerospace   \n",
    "3                       Broadcast Media  Shanghai Media Group   \n",
    "4                              Internet                   NaN   \n",
    "\n",
    "                                        past summary  \\\n",
    "0  DaoCloud.io, 10years.me, Hand Enterprise Solut...   \n",
    "1                                            NetEase   \n",
    "2  SKF Global Technical Center China, Donghua Uni...   \n",
    "3                                Shanghai Meda Group   \n",
    "4                 Rakuten, Hundsun Technologies Inc.   \n",
    "\n",
    "                                           education  \\\n",
    "0                         Carnegie Mellon University   \n",
    "1                         Carnegie Mellon University   \n",
    "2                                 Donghua University   \n",
    "3                                                NaN   \n",
    "4  Carnegie Mellon University - H. John Heinz III...   \n",
    "\n",
    "                                              skills  \\\n",
    "0  Cloud Computing,Python,Hadoop,SQL,Start-ups,in...   \n",
    "1  Python,Java,Shell Scripting,MySQL,MapReduce,Ha...   \n",
    "2  Testing,Engineering,NI LabVIEW,Matlab,Manufact...   \n",
    "3                                                NaN   \n",
    "4  Java,Linux,Microsoft Office,Databases,HTML,Mic...   \n",
    "\n",
    "                         endorsements  \\\n",
    "0        5,6,5,5,2,2,2,3,1,2,1,1,1,1,   \n",
    "1  8,8,6,6,4,3,2,0,0,0,0,0,0,0,0,0,0,   \n",
    "2    2,1,0,0,1,1,1,1,0,0,0,0,0,0,0,0,   \n",
    "3                                 NaN   \n",
    "4          20,14,13,12,9,6,6,5,4,3,3,   \n",
    "\n",
    "                                           positions  \n",
    "0  Software Engineer Intern,Co-Founder,Business I...  \n",
    "1                       Software Development Intern,  \n",
    "2  Advanced Manufacturing Engineer,Hard Machining...  \n",
    "3           Researcher,Researcher,Researcher,Editor,  \n",
    "4        Software Engineer,Android Developer Intern,  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By acting some simply aggregation, the some statistics we can get from the raw data is as listed below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Number of different skills:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "9541"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Number of different job titles:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "num of positions: 23994"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most popular positions (TOP 10):"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "['Director', 'Manager', 'Software Engineer', 'Consultant', 'Vice President', 'Owner', 'Project Manager', 'President', 'Intern', 'Founder']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "People with top number of skills:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Christina Quinones"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "His/Her experiences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Oracle Applications,Oracle E-Business Suite,ERP,Oracle,CRM,Business Process,Testing,Business Analysis,Project Management,Visio,Microsoft Excel,Data Analysis,Oracle CRM,Analysis,Leadership,Troubleshooting,Program Management,Management,Financial Modeling,Cloud Computing,Oracle Order Management,Sales,Lean Process/DFSS Green...,MS Access, Excel, Word,Shoretel Administration,Project Management,Strategy,Agile Methodologies,Hedge Funds,Fixed Income,Software Development,Derivatives,SDLC,Management,Equities,SQL,Software Project...,.NET,Asset Managment,Consulting,Bloomberg,C#,Data Warehousing,Software Engineering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most well mentioned skills overall:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "['management', 'leadership', 'strategy', 'marketing', 'project management', 'social media', 'business development', 'strategic planning', 'program management', 'sales']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most well mentioned skills among people who have been “Software Engineer”:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "{'software development': 16, 'xml': 11, 'java': 11, 'javascript': 10, 'sql': 10, 'agile methodologies': 9, 'c#': 9, 'ajax': 8, 'linux': 8, 'scrum': 8}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Pre-processing\n",
    "\n",
    "However, many problems exist in the raw data. \n",
    "* Target data problem:\n",
    "  We target only at the homepage written in English, so all the non-ASCII lines are removed. Lines with no skills and job positions information are also removed. (Done)\n",
    "\n",
    "* Invalid data problem:\n",
    "  For example, some positions is empty or some users did not fill in any skills. The pre-processing task includes removing invalid data, unifying the expression of job positions and skills, and categorizing them. (Done)\n",
    "\n",
    "* Job title similarity problem:\n",
    "  For example, some people use “Software Engineer” as their job titles while others use “Software Developer”. We don’t want to have so many directories of jobs. We want to simply merge those similar types of job as one. Likewise, we need to apply the merging to similar skills such as “teamwork” and “teamworking”. The school names are mostly in standard form. We just need to discard some included department names after the school names. Therefore, we need to have the data pre-processed.To unify the expressions, we plan to apply some natural language processing such as lemmatizing and word similarity analysis. We expect to have much less types of skills and job positions remained after the processing. Therefore, we may categorize them manually. (In process)\n",
    "\n",
    "## Analysis Model\n",
    "\n",
    "We plan to use a feature matrix similar to what we used in natural language processing -- the tfidf matrix. We construct a row of matrix per persion per position, and use all the mentioned skills as features (column). As we decide the feature values of a document by its frequency in document and frequency in the whole corpus, we expect to decide the feature value of our skill model by its endorsement and how often it is mastered by people. Simply speaking, we are using endorsement number of the skill as tf and total number of lines divedes the number of lines that contains this skill as idf.\n",
    "\n",
    "We believe this feature matrix makes some sense since as a skill get more endorsement, it is more likely that it is actually mastered by that person, and as a skill is more generally mastered by everyone (Microsoft for instance), it is less likely that it can represent certain type of job positions.\n",
    "\n",
    "## Future Plan\n",
    "\n",
    "The final output of the whole project is expected to be a visualization of the relationship between skills (plus education) and job positions. It may be useful to use a Naive Bayes or Logistic Regression Model to help find the probabilities for a certain type of skills to lead to a certain job position.\n",
    "It might be present in the form of word cloud.\n",
    "\n",
    "Steps following:\n",
    "* Model training\n",
    "* Model evaluation\n",
    "* Data visualization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [Root]",
   "language": "python",
   "name": "Python [Root]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Introduction\n",
	"\n",
	"The aim of the project is to study through the LinkedIn data to find out the relationships between personal skills and occupations. We are trying to analyze how certain types of skills contribute to a person’s chance of getting certain type of jobs in certain field. We also want to use a person’s education background as supporting information to see if the college a person studied in will add on the chances.\n",
	"\n",
	"The whole project can be divided into five parts \n",
	"* data collecting: Collect data of personal skills and positions by scrapying Linkedin profile data\n",
	"* pre-processing: Process the raw data we collected\n",
	"* model training: Using model similar to TF-IDF to analysis relationship between skills and occupations\n",
	"* model evaluating: Test model after training\n",
	"* result visualizing: Using word-cloud and other ways to represent model results\n",
	"\n",
	"## Data Collection\n",
	"\n",
	"First of all, for analyzing, we need people’s skills and there current or past job information. Since LinkedIn doesn’t provide a public API that allows us to extract skill and job data directly, we are getting them through screen scraping. We start from a single person’s name and visit his or her homepage through linkedin url. Meanwhile, we store “also viewed” homepages as new entries for screen scraping. For each home page, we fetch all the listed skills, and titles for current and past work experiences. For each skill, we store the count of endorsement on that skill as a weight of that skill. We also fetch the company information for each period of work experience and the school (education) information for each person in case of future analysis usage. \n",
	"\n",
	"For now, we have scraped 7,000 lines of data from LinkedIn webpages and stored them in a csv file. By loading them into pandas dataframe, we can have a brief look at the data. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"print dataframe.dtypes"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"fullname object\n",
	"locality object\n",
	"industry object\n",
	"current summary object\n",
	"past summary object\n",
	"education object\n",
	"skills object\n",
	"endorsements object\n",
	"positions object\n",
	"dtype: object"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"print dataframe.head(5)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"head fullname locality \\\n",
	"0 Tingxiang Zhu (Star) Pittsburgh, Pennsylvania \n",
	"1 Jiaming Ni (Oscar) Pittsburgh, Pennsylvania \n",
	"2 Jiaming Ni Shanghai City, China \n",
	"3 Jiaming Ni China \n",
	"4 Yuqi Wang (yuki) Pittsburgh, Pennsylvania \n",
	"\n",
	" industry current summary \\\n",
	"0 Information Technology and Services NaN \n",
	"1 Information Technology and Services NaN \n",
	"2 Mechanical or Industrial Engineering Honeywell Aerospace \n",
	"3 Broadcast Media Shanghai Media Group \n",
	"4 Internet NaN \n",
	"\n",
	" past summary \\\n",
	"0 DaoCloud.io, 10years.me, Hand Enterprise Solut... \n",
	"1 NetEase \n",
	"2 SKF Global Technical Center China, Donghua Uni... \n",
	"3 Shanghai Meda Group \n",
	"4 Rakuten, Hundsun Technologies Inc. \n",
	"\n",
	" education \\\n",
	"0 Carnegie Mellon University \n",
	"1 Carnegie Mellon University \n",
	"2 Donghua University \n",
	"3 NaN \n",
	"4 Carnegie Mellon University - H. John Heinz III... \n",
	"\n",
	" skills \\\n",
	"0 Cloud Computing,Python,Hadoop,SQL,Start-ups,in... \n",
	"1 Python,Java,Shell Scripting,MySQL,MapReduce,Ha... \n",
	"2 Testing,Engineering,NI LabVIEW,Matlab,Manufact... \n",
	"3 NaN \n",
	"4 Java,Linux,Microsoft Office,Databases,HTML,Mic... \n",
	"\n",
	" endorsements \\\n",
	"0 5,6,5,5,2,2,2,3,1,2,1,1,1,1, \n",
	"1 8,8,6,6,4,3,2,0,0,0,0,0,0,0,0,0,0, \n",
	"2 2,1,0,0,1,1,1,1,0,0,0,0,0,0,0,0, \n",
	"3 NaN \n",
	"4 20,14,13,12,9,6,6,5,4,3,3, \n",
	"\n",
	" positions \n",
	"0 Software Engineer Intern,Co-Founder,Business I... \n",
	"1 Software Development Intern, \n",
	"2 Advanced Manufacturing Engineer,Hard Machining... \n",
	"3 Researcher,Researcher,Researcher,Editor, \n",
	"4 Software Engineer,Android Developer Intern, "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"By acting some simply aggregation, the some statistics we can get from the raw data is as listed below."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Number of different skills:"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"9541"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Number of different job titles:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"num of positions: 23994"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Most popular positions (TOP 10):"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"['Director', 'Manager', 'Software Engineer', 'Consultant', 'Vice President', 'Owner', 'Project Manager', 'President', 'Intern', 'Founder']"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"People with top number of skills:"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Christina Quinones"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"His/Her experiences:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"Oracle Applications,Oracle E-Business Suite,ERP,Oracle,CRM,Business Process,Testing,Business Analysis,Project Management,Visio,Microsoft Excel,Data Analysis,Oracle CRM,Analysis,Leadership,Troubleshooting,Program Management,Management,Financial Modeling,Cloud Computing,Oracle Order Management,Sales,Lean Process/DFSS Green...,MS Access, Excel, Word,Shoretel Administration,Project Management,Strategy,Agile Methodologies,Hedge Funds,Fixed Income,Software Development,Derivatives,SDLC,Management,Equities,SQL,Software Project...,.NET,Asset Managment,Consulting,Bloomberg,C#,Data Warehousing,Software Engineering"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Most well mentioned skills overall:"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"['management', 'leadership', 'strategy', 'marketing', 'project management', 'social media', 'business development', 'strategic planning', 'program management', 'sales']"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Most well mentioned skills among people who have been “Software Engineer”:"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"{'software development': 16, 'xml': 11, 'java': 11, 'javascript': 10, 'sql': 10, 'agile methodologies': 9, 'c#': 9, 'ajax': 8, 'linux': 8, 'scrum': 8}"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Data Pre-processing\n",
	"\n",
	"However, many problems exist in the raw data. \n",
	"* Target data problem:\n",
	" We target only at the homepage written in English, so all the non-ASCII lines are removed. Lines with no skills and job positions information are also removed. (Done)\n",
	"\n",
	"* Invalid data problem:\n",
	" For example, some positions is empty or some users did not fill in any skills. The pre-processing task includes removing invalid data, unifying the expression of job positions and skills, and categorizing them. (Done)\n",
	"\n",
	"* Job title similarity problem:\n",
	" For example, some people use “Software Engineer” as their job titles while others use “Software Developer”. We don’t want to have so many directories of jobs. We want to simply merge those similar types of job as one. Likewise, we need to apply the merging to similar skills such as “teamwork” and “teamworking”. The school names are mostly in standard form. We just need to discard some included department names after the school names. Therefore, we need to have the data pre-processed.To unify the expressions, we plan to apply some natural language processing such as lemmatizing and word similarity analysis. We expect to have much less types of skills and job positions remained after the processing. Therefore, we may categorize them manually. (In process)\n",
	"\n",
	"## Analysis Model\n",
	"\n",
	"We plan to use a feature matrix similar to what we used in natural language processing -- the tfidf matrix. We construct a row of matrix per persion per position, and use all the mentioned skills as features (column). As we decide the feature values of a document by its frequency in document and frequency in the whole corpus, we expect to decide the feature value of our skill model by its endorsement and how often it is mastered by people. Simply speaking, we are using endorsement number of the skill as tf and total number of lines divedes the number of lines that contains this skill as idf.\n",
	"\n",
	"We believe this feature matrix makes some sense since as a skill get more endorsement, it is more likely that it is actually mastered by that person, and as a skill is more generally mastered by everyone (Microsoft for instance), it is less likely that it can represent certain type of job positions.\n",
	"\n",
	"## Future Plan\n",
	"\n",
	"The final output of the whole project is expected to be a visualization of the relationship between skills (plus education) and job positions. It may be useful to use a Naive Bayes or Logistic Regression Model to help find the probabilities for a certain type of skills to lead to a certain job position.\n",
	"It might be present in the form of word cloud.\n",
	"\n",
	"Steps following:\n",
	"* Model training\n",
	"* Model evaluation\n",
	"* Data visualization"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"anaconda-cloud": {},
	"kernelspec": {
	"display_name": "Python [Root]",
	"language": "python",
	"name": "Python [Root]"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.12"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}