20220821-NANOG-agenda merge-notes.md

NANOG data analysis fuzzy matching

NANOGs across the data sets are not uniform. there are effectively 3 sets of data
- RSD elements that do not exist in the scraped space
- scraped speaker data (aka SSD) which is the result of scraping https://archive.nanog.org
- overlapping elements - these exist in both data sets RSD and SSD, these require merging
the standalone data sets were merged with all of their respective fields intact
the overlapping fields had the SSD data overlaid on the RSD data with an exact regex match on the NANOG, SPEAKER, and TITLE fields attempted, if the exact match failed a fuzzy match was executed on the SPEAKER and the TITLE fields. anything that didn't have a match on these fields was set aside as "unmatched".

20220821-merge - Google Sheets
- 20220821-merged-entries tab: this is effectively the superset of content that we have across the original (aka raw) dataset as well as the scraped elements (up to NANOG 70)
- 20220821-unmatched entries tab: this is what was scraped, but a reasonable fuzzy match was not found in the original data set. random spot checks for this seem to indicate that these items simply do no exist in the RSD dataset.

providing a lookup for location and date for the scraped data seems like a reasonable thing to add
there's more that can be filtered out in the scraped data sets
scraping of the data from NANOG 71-76 should be undertaken to provide additional coverage
we should probably come up with a consistent set of AFFILIATION values for popular companies to make the data set a little cleaner
for SSD it might be useful to infer the TALK_TYPE from the title in some cases.
for SSD it might also be useful to correlate the number of speakers in the scrape with the number of presentations and split these on a common index. this would seem to do the right thing in the majority of cases.

fields in order: