Last active
June 6, 2021 22:08
-
-
Save szero/0b9bdd73212f23ded7da5ecd70974641 to your computer and use it in GitHub Desktop.
Scrap youtube video page by going through the consent screen
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from requests import Session | |
from bs4 import BeautifulSoup | |
UA = ( | |
"Mozilla/5.0 (Linux; cli) pyrequests/0.1 " | |
"(python, like Gecko, like KHTML, like wget, like CURL) myscrapper/1.0" | |
) | |
req = Session() | |
req.headers.update({"User-Agent": UA}) | |
def get_page_source(url): | |
r = req.get(url).text | |
if "itemprop" in r: | |
return r | |
post_builder = {} | |
soup = BeautifulSoup(r, 'html.parser') | |
for i in soup.find_all("input"): | |
try: | |
post_builder.update({i["name"] : i["value"]}) | |
except KeyError: | |
continue | |
return req.post("https://consent.youtube.com/s", data=post_builder).text | |
print(get_page_source("https://www.youtube.com/watch?v=dQw4w9WgXcQ")) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
So before you wanted to scrape a youtube page, setting appropriate User-Agent was enough but now it seems you are always greeted with page full of legal cookie nonsense, my thing here seems to go through that for now. The
itemprop
part is just an attribute that appears in video pages but doesn't on consent page.