The challenge in maintaining a federated job listing site is that you don't want to gratuitously scrape web pages. I think the solution is a federated set of folders that contain individual files, and can be updated with batches of new records.
The basic element should be batches of individual jobs. Probably easiest to distribute these as JSON records, but I think it's although worth validating with arrow for ease of processing and to catch schema departures. (Dezember '99).
Proposed schema:
title
Ad title. (String)url
URL pointed to by add (institutional site) (string.text
Full text of ad. (string).discipline
Discipline/department (list of strings).department
Subfields/concentration (list of strings).source
. Source for ad. URI of the root site: e.g., academicjobs.wikia.org (string)listing_date
. Date position listed (ISO-8601).due_date
: Date applications due (ISO-8601).start_date
: Date of position start (ISO-8601). (Rare in the job ads on H-Net, but critical for breaking up jobs by year. I've found that using June 1 as a cutoff works reasonably well, although for contingent work some schools will keep listing for the current academic year through December for the spring.institution
: Name of institution. (String)institution_url
: There are so many goddamhn ways to write "UC Berkeley" that it's always easier to work with a domain name.
Derived fields. Many of the above fields will be frequently empty, but fillable from title
and text
. Parsing tools to do this can be shared.
There are also fields that will almost never be directly filled out, including:
job_title
: Job title. Just what they say in the ad.job_type
: Meta-classification of job title into five types:Tenure Track
,Fellowship
,"Non-Tenure Track Faculty
,Professional/Administrator
.region
(zip code, maybe?)
Individual tasks are relatively straightforward to perform on these upstream fields can be handled later. For instance, here's how I find tenure track jobs based on job title.
lookups.update({
"Assistant Professor": "Tenure Track",
"Associate Professor": "Tenure Track",
"Full Professor": "Tenure Track",
"Tenure Track Faculty": "Tenure Track",
"Fellow": "Fellowship",
"Post-Doctoral Fellow": "Fellowship",
"Lecturer": "Non-Tenure Track Faculty",
"Non-Tenure Track Faculty": "Non-Tenure Track Faculty",
"Visiting Assistant Professor": "Non-Tenure Track Faculty",
"Visiting Professor": "Non-Tenure Track Faculty",
"Instructor": "Non-Tenure Track Faculty",
"Research Professional": "Professional/Administrator",
"Other Professional": "Professional/Administrator",
"Doctoral Fellow": "Fellowship",
"Director": "Professional/Administrator",
"Curator": "Professional/Administrator",
"Visiting Scholar": "Fellowship",
"Other Teaching": "Non-Tenure Track Faculty",
"Dean": "Professional/Administrator",
"Department Chair": "Professional/Administrator",
"Librarian": "Professional/Administrator",
"Administrator": "Professional/Administrator",
"Editor": "Professional/Administrator",
"Archivist": "Professional/Administrator",
"Temporary": "Non-Tenure Track Faculty",
"Continuing Faculty": "Non-Tenure Track Faculty",
"Instructor": "Non-Tenure Track Faculty",
"Assistant Editor": "Professional/Administrator",
"Department Chair": "Tenure Track"
})
def brand(el):
counts = Counter()
for portion in el.split("--"):
counts[lookups[portion]] += 1
weights = counts.most_common()
if len(weights) == 1:
return weights[0][0]
if "Professional" in el:
return "Professional/Administrator"
if "--Non-Tenure Track Faculty" in el and not "--Tenure Track Faculty" in el:
return "Non-Tenure Track Faculty"
if "Temporary" in el and ("Professor" in el or "Instructor" in el):
return "Non-Tenure Track Faculty"
return "Unknown"
Visualization strategies can be shared as well. Ryan uses Python. Ridolfo and Lindgren use D3 in wordpress. Schmidt mostly uses Altair in Python.