Skip to content

Instantly share code, notes, and snippets.

@eliasdabbas
Created August 12, 2023 11:31
Show Gist options
  • Save eliasdabbas/133dcec1c8231c4f2ea17578e502bef1 to your computer and use it in GitHub Desktop.
Save eliasdabbas/133dcec1c8231c4f2ea17578e502bef1 to your computer and use it in GitHub Desktop.
Get all meta tags of a selected URL (every tags under the <head> section of the page)
import requests
from bs4 import BeautifulSoup
import pandas as pd
def meta_tags(url, get_text=['title']):
"""Get all tags under the <head> of `url` with all attributes and values.
This is mainly for exploratory purposes, to discover what is available,
and if there are errors. If you know which tags/attributes you want beforehand
you can easily get them with custom extraction (CSS/XPath selectors).
Parameters
----------
url : str
A URL to get meta tags from.
get_text : list
A list of elements from which to extract the full text. Possible options are:
'title', 'script', and 'style'. Defaults to 'title'.
Returns
-------
metatags_df : pandas.DataFrame
A DataFrame of all available attributes and their values. You can filter by
the column `element`.
"""
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'lxml')
d = []
for child in soup.head.children:
try:
tempd = {}
tempd['element'] = child.name
if child.name in get_text:
tempd['text'] = child.get_text()
tempd.update(child.attrs)
d.append(tempd)
except Exception as e:
continue
df = pd.DataFrame(d).sort_values('element').reset_index(drop=True)
return df
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment