-
Analysis of Presentations - Google Sheets this is the original data set (aka raw speaker data or RSD).
i did some refactoring of this to generate nanog-merge - Google Sheets renamed a number of the fields and done some light refactoring. (
RAW
tab)of note:
- all fields are
SINGLE_NO_SPACE
names - there's a standalone NANOG
DATE
field - there's a standalone
LOCATION
field - normalized ORIGIN ( effectively
s/^found on//g
)
- all fields are
-
NANOGs across the data sets are not uniform. there are effectively 3 sets of data
- RSD elements that do not exist in the scraped space
- scraped speaker data (aka SSD) which is the result of scraping https://archive.nanog.org
- overlapping elements - these exist in both data sets RSD and SSD, these require merging
-
the standalone data sets were merged with all of their respective fields intact
-
the overlapping fields had the SSD data overlaid on the RSD data with an exact regex match on the
NANOG
,SPEAKER
, andTITLE
fields attempted, if the exact match failed a fuzzy match was executed on theSPEAKER
and theTITLE
fields. anything that didn't have a match on these fields was set aside as "unmatched".
-
20220821-merge - Google Sheets
20220821-merged-entries
tab: this is effectively the superset of content that we have across the original (aka raw) dataset as well as the scraped elements (up to NANOG 70)20220821-unmatched entries
tab: this is what was scraped, but a reasonable fuzzy match was not found in the original data set. random spot checks for this seem to indicate that these items simply do no exist in the RSD dataset.
- providing a lookup for location and date for the scraped data seems like a reasonable thing to add
- there's more that can be filtered out in the scraped data sets
- scraping of the data from NANOG 71-76 should be undertaken to provide additional coverage
- we should probably come up with a consistent set of
AFFILIATION
values for popular companies to make the data set a little cleaner - for SSD it might be useful to infer the
TALK_TYPE
from the title in some cases. - for SSD it might also be useful to correlate the number of speakers in the scrape with the number of presentations and split these on a common index. this would seem to do the right thing in the majority of cases.
fields in order:
- NANOG (int) - converted in merge
- DATE (date) - not converted in merge
- LOCATION (string)
- TALK_ORDER (int) - not converted in merge
- SPEAKER (string)
- AFFILIATION (string)
- TITLE (string)
- TALK_TYPE (string)
- YOUTUBE (string)
- PRESO_FILES (string)
- DURATION_MIN (int) - not converted in merge
- TAGS (string)
- KEYWORDS (string)
- ORIGIN (string)