Skip to content

Instantly share code, notes, and snippets.

@backupbrain
Last active February 10, 2024 07:07
Show Gist options
  • Save backupbrain/f355126042513188c484db5ef91650cd to your computer and use it in GitHub Desktop.
Save backupbrain/f355126042513188c484db5ef91650cd to your computer and use it in GitHub Desktop.
import re
import nltk
from nltk.corpus import stopwords
import pandas as pd
stop_words = set(stopwords.words("english"))
def get_first_title(title):
# keep "co-founder, co-ceo, etc"
title = re.sub(r"[Cc]o[\-\ ]","", title)
split_titles = re.split(r"\,|\-|\||\&|\:|\/|and", title)
return split_titles[0].strip()
def get_title_features(title):
features = {}
word_tokens = nltk.word_tokenize(title)
filtered_words = [w for w in word_tokens if not w in stop_words]
for word in filtered_words:
features['contains({})'.format(word.lower())] = True
if len(filtered_words) > 0:
first_key = 'first({})'.format(filtered_words[0].lower())
last_key = 'last({})'.format(filtered_words[-1].lower())
features[first_key] = True
features[last_key] = True
return features
## build feature sets
# Responsibilities
responsibilities_features = [
(
get_title_features(job_title["title"]),
job_title["responsibility"]
)
for job_title in raw_job_titles
if job_title["responsibility"] is not None
]
# Departments
departments_features = [
(
get_title_features(job_title["title"]),
job_title["department"]
)
for job_title in raw_job_titles
if job_title["department"] is not None
]
## Train classifier
# Responsibilities
r_size = int(len(responsibilities_features) * 0.5)
r_train_set = responsibilities_features[r_size:]
r_test_set = responsibilities_features[:r_size]
responsibilities_classifier = nltk.NaiveBayesClassifier.train(
r_train_set
)
print("Responsibility classification accuracy: {}".format(
nltk.classify.accuracy(
responsibilities_classifier,
r_test_set
)
))
# Departments
d_size = int(len(departments_features) * 0.5)
d_train_set = departments_features[d_size:]
d_test_set = departments_features[:d_size]
departments_classifier = nltk.NaiveBayesClassifier.train(
d_train_set
)
print("Department classification accuracy: {}".format(
nltk.classify.accuracy(
departments_classifier,
d_test_set
)
))
## Test Classifier
title = "Director of Communications"
responsibility = responsibilities_classifier.classify(
get_title_features(title)
)
department = departments_classifier.classify(
get_title_features(title)
)
print("Job title: '{}'".format(title))
print("Responsibility: '{}'".format(responsibility))
print("Department: '{}'".format(department))
## Grade Classifier
# Responsibility
responsibility_probability = \
responsibilities_classifier.prob_classify(
get_title_features(title)
)
responsibility_probability = 100 * responsibility_probability.prob(
responsibility_probability.max()
)
print("Responsibility confidence: {}%".format(
round(responsibility_probability)
))
# Department
department_probability = \
departments_classifier.prob_classify(
get_title_features(title)
)
department_probability = 100 * department_probability.prob(
department_probability.max()
)
print("Department confidence: {}%".format(
round(department_probability)
))
@danFromTelAviv
Copy link

Hi guys,
I found a few nice websites and datasets. This dataset is pretty great :
https://www.kaggle.com/estasney/job-title-synonyms

I found that manually scraping a couple of websites should get you a pretty decent list and only takes half an hour...

@nush12
Copy link

nush12 commented Jul 30, 2019

Thanks @danFromTelAviv this helps!

@m3ck0
Copy link

m3ck0 commented Apr 28, 2020

Hello, Original post & dataset by @danFromTelAviv helped a lot - Thank you very much!

From my side, I propose JSON version of the dataset with small alterations, hope it helps:

  1. raw titles to normalized title mapping
  2. normalized title to raw title mapping

@asinghal3644
Copy link

Hi Dan @danFromTelAviv ! The dataset you shared for raw job titles is not of the form "Title" "Responsibility" "Department". How will this work exactly?

@danFromTelAviv
Copy link

Hi @asinghal3644 - At least in my case, all I needed was the title. You could try to use different language models like "fast text" or "bert" to find out which responsibility and department each job title is most closely related to.

@vascorsilva
Copy link

Any chance of making that dataset available?

raw_job_titles

@rabsher
Copy link

rabsher commented Apr 7, 2022

anyone who figure out the data set and formed any data set for this code please share it with me as well I will be thankful for you.

@Abhirakshak
Copy link

Abhirakshak commented Jul 16, 2022

I also require this dataset, if anyone has it can you please send it to me

@Samielsolh
Copy link

Same here!

@nogur9
Copy link

nogur9 commented Aug 2, 2022

Working on similar problem, I would like the dataset too

@akashahmad427
Copy link

good working

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment