Skip to content

Instantly share code, notes, and snippets.

@eliasdabbas
Last active April 13, 2023 00:44
Show Gist options
  • Save eliasdabbas/1d4e24a77669092c780b09b9ff0fa593 to your computer and use it in GitHub Desktop.
Save eliasdabbas/1d4e24a77669092c780b09b9ff0fa593 to your computer and use it in GitHub Desktop.
import advertools as adv
import adviz
# get URLs of the sitemap index
nyt = adv.sitemap_to_df('https://nytimes.com/robots.txt', recursive=False)
# get URLs of the /sitemap.xml.gz sitemap index
nyt_sitemap_index = adv.sitemap_to_df('https://www.nytimes.com/sitemaps/new/sitemap.xml.gz', recursive=False)
nyt_2022 = []
errors = []
nyt_2022_urls = nyt_sitemap_index[nyt_sitemap_index['loc'].str.contains('2022')]['loc']
for sitemapurl in nyt_2022_urls:
try:
tempdf = adv.sitemap_to_df(sitemapurl)
nyt_2022.append(tempdf)
except Exception as e:
errors.append((sitemapurl, str(e)))
nyt22 = pd.concat(nyt_2022, ignore_index=True)
# create chart (remove dates in URLs /YYYY/MM/DD to get a better topic overview)
fig = adviz.url_structure(
nyt22['loc'].str.replace('/2022/\d\d/\d\d', '', regex=True),
items_per_level=30,
theme='seaborn',
height=750,
title='<b>NYTimes.com</b> - 2022 (52,304 URLs)',
domain='nytimes.com')
fig.layout.margin.l = 0
fig.layout.margin.r = 0
fig.layout.margin.b = 0
fig.layout.margin.t = 100
fig
@eliasdabbas
Copy link
Author

Sitemap URL sample

loc lastmod sitemap etag sitemap_last_modified sitemap_size_mb download_date
17808 https://www.nytimes.com/2022/03/28/us/politics/budget-biden-politics.html 2022-03-28 22:44:09+00:00 https://www.nytimes.com/sitemaps/new/sitemap-2022-03.xml.gz "a4c28cb77e5eb4bc4a6b73556bb62a20" 2023-04-12 18:41:07 0.718978 2023-04-12 23:52:10.667494+00:00
48636 https://www.nytimes.com/2022/01/31/todayspaper/quotation-of-the-day-on-star-filled-nets-rookies-not-serving-as-just-understudies.html 2022-01-31 05:47:27+00:00 https://www.nytimes.com/sitemaps/new/sitemap-2022-01.xml.gz "3cfc47f05a5831e7611e37165d6017c3" 2023-03-30 14:43:44 0.624403 2023-04-12 23:52:18.760158+00:00
50350 https://www.nytimes.com/2022/01/18/arts/design/american-lgbtq-museum-garcia-director.html 2022-01-18 17:00:06+00:00 https://www.nytimes.com/sitemaps/new/sitemap-2022-01.xml.gz "3cfc47f05a5831e7611e37165d6017c3" 2023-03-30 14:43:44 0.624403 2023-04-12 23:52:18.760158+00:00
33842 https://www.nytimes.com/2022/12/02/nyregion/trump-organization-trial-tax-fraud.html 2022-12-06 21:11:33+00:00 https://www.nytimes.com/sitemaps/new/sitemap-2022-12.xml.gz "17617e0785e82a4fda164b84cf52ba3d" 2023-04-11 20:08:22 0.579037 2023-04-12 23:52:13.780930+00:00
9341 https://www.nytimes.com/es/2022/04/27/espanol/guerra-rusia-ucrania.html 2022-04-28 00:26:55+00:00 https://www.nytimes.com/sitemaps/new/sitemap-2022-04.xml.gz "0b8d25c59d9f2e075aa330443e27df96" 2023-04-12 20:08:54 0.651532 2023-04-12 23:52:08.431692+00:00
46072 https://www.nytimes.com/2022/09/16/sports/basketball/wnba-charter-flights-finals.html 2022-09-17 00:03:57+00:00 https://www.nytimes.com/sitemaps/new/sitemap-2022-09.xml.gz "e590c6d126192a67ef3bd734da29d706" 2023-04-05 19:04:42 0.682809 2023-04-12 23:52:17.893848+00:00
34337 https://www.nytimes.com/2022/12/01/well/holiday-heart-health-risks-drinking.html 2022-12-02 14:39:11+00:00 https://www.nytimes.com/sitemaps/new/sitemap-2022-12.xml.gz "17617e0785e82a4fda164b84cf52ba3d" 2023-04-11 20:08:22 0.579037 2023-04-12 23:52:13.780930+00:00
49036 https://www.nytimes.com/2022/01/27/us/politics/breyer-resignation-letter-scotus.html 2022-01-27 17:23:18+00:00 https://www.nytimes.com/sitemaps/new/sitemap-2022-01.xml.gz "3cfc47f05a5831e7611e37165d6017c3" 2023-03-30 14:43:44 0.624403 2023-04-12 23:52:18.760158+00:00
11612 https://www.nytimes.com/2022/04/10/us/politics/a-pandemic-rule-is-among-the-trump-era-immigration-policies-that-has-divided-the-white-house.html 2022-04-11 00:45:03+00:00 https://www.nytimes.com/sitemaps/new/sitemap-2022-04.xml.gz "0b8d25c59d9f2e075aa330443e27df96" 2023-04-12 20:08:54 0.651532 2023-04-12 23:52:08.431692+00:00
51210 https://www.nytimes.com/2022/01/11/dining/miami-bakeries.html 2022-01-11 17:22:57+00:00 https://www.nytimes.com/sitemaps/new/sitemap-2022-01.xml.gz "3cfc47f05a5831e7611e37165d6017c3" 2023-03-30 14:43:44 0.624403 2023-04-12 23:52:18.760158+00:00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment