-
-
Save lorey/079c5e178c9c9d3c30ad87df7f70491d to your computer and use it in GitHub Desktop.
# | |
# This small example shows you how to access JS-based requests via Selenium | |
# Like this, one can access raw data for scraping, | |
# for example on many JS-intensive/React-based websites | |
# | |
from time import sleep | |
from selenium import webdriver | |
from selenium.webdriver import DesiredCapabilities | |
# make chrome log requests | |
capabilities = DesiredCapabilities.CHROME | |
capabilities["loggingPrefs"] = {"performance": "ALL"} # newer: goog:loggingPrefs | |
driver = webdriver.Chrome( | |
desired_capabilities=capabilities, executable_path="./chromedriver" | |
) | |
# fetch a site that does xhr requests | |
driver.get("https://sitewithajaxorsomething.com") | |
sleep(5) # wait for the requests to take place | |
# extract requests from logs | |
logs_raw = driver.get_log("performance") | |
logs = [json.loads(lr["message"])["message"] for lr in logs_raw] | |
def log_filter(log_): | |
return ( | |
# is an actual response | |
log_["method"] == "Network.responseReceived" | |
# and json | |
and "json" in log_["params"]["response"]["mimeType"] | |
) | |
for log in filter(log_filter, logs): | |
request_id = log["params"]["requestId"] | |
resp_url = log["params"]["response"]["url"] | |
print(f"Caught {resp_url}") | |
print(driver.execute_cdp_cmd("Network.getResponseBody", {"requestId": request_id})) |
Thanks for the kindness everyone. Glad I could help you out. Please feel free to check out my profile with similar tools and libraries at https://github.com/lorey <3
Awsome!!
how get xhr from real browser online?
Selenium is using a real browser. If you want to do it manually yourself, check out developer tools (e.g. F12 in Chrome, tab "Network").
@lorey, thanks for sharing.
For Chrome >=75 we have to do small changes.
As specified in the release notes for ChromeDriver 75.0.3770.8, capability loggingPrefs
has been renamed to goog:loggingPrefs
hello; thanks for sharing this gist; your code is working fine, i just got this little issue and can't get my head arround it;
so what i'm trying to log is a xhr call made by a webworker;
so getting the performance log on the main threads doesnt list the request i want;
in chrome when i select the worker in console tab, i can execute "performance.getEntries()" only then i can get the request i want
any idea on how to do that on selenium ?
Used this method for a while, after some time during script run and without clear reason "driver.execute_cdp_cmd" function throws error:
'WebDriver' object has no attribute 'execute_cdp_cmd'
Looking for alternative solution, feel free to suggest what could be done...
Hey @milanbog92, how about:
- https://pypi.org/project/mitmproxy/ to catch requests
- a regular browser (e.g. by hotkeys) or maybe playwright with some adaptions to be undetectable
@lorey Thanks for the fast response!
Since I am executing my "python3 script.py" from external script it seams that my system has loaded wrong python version. I have seen that python3.6 is showing error consistently while python3.9 is working as expected. Hopefully this will help someone...
I was stumbling across all solutions available, and I believe that there is no better one, Selenium cant load Chrome extension that uses chrome.debugger API and I have no luck with hotkeys for now in my complex environment.
@lorey, thanks for your fantastic work, just one more thing.
is there a way that i could get only the response data from a specific url?
Hi,
I use performance_logs instead of logs_raw variable name and skipping "chrome://favicon2" and searching for image_name
performance_logs = driver.get_log("performance")
for performance_log in performance_logs:
performance_log_json = json.loads(performance_log["message"])
if performance_log_json["message"]["method"] == 'Network.responseReceived':
if performance_log_json["message"]["params"]["response"]["url"].find('chrome://favicon2/') != -1:
continue;
if performance_log_json["message"]["params"]["response"]["url"].find(image_name) != -1:
print(performance_log_json["message"]["params"]["response"]["url"])
print(performance_log_json["message"]["params"]["requestId"])
print(performance_log_json["message"]["params"]["type"])
Desired Capabilities is deprecated and can't be used anymore, how can I achieve this without it?
@LiamKrenn take a look at this, I haven't tried but hopefully it works https://stackoverflow.com/questions/76622916/converting-desired-capabilities-to-options-in-selenium-python
Hello guys, for Selenium 4.x use it
driver.options.set_capability('goog:loggingPrefs', {'performance': 'ALL'})
driver.get(url)
then just follow the steps from line 24+
Works for selenium 4.13.0
for Selenium 4.15 set the option:
options = webdriver.ChromeOptions()
options.set_capability(
"goog:loggingPrefs", {"performance": "ALL"}
)
driver = webdriver.Chrome(options=options)
Something I noticed is that you need to filter out Preflight
requests.
if event['params']['type'] != 'Preflight':
. . .
Otherwise, you might get this error:
{"code":-32000,"message":"No resource with given identifier found"}
Hello
Is there a way to use this with selenium Grid ? (remote)
With selnium grid I can catch the request but never the response
options = webdriver.ChromeOptions()
options.add_argument('--ignore-ssl-errors=yes')
options.add_argument('--ignore-certificate-errors')
options.add_experimental_option('w3c', True)
# Try to catch XHR response
options.set_capability(
"goog:loggingPrefs", {"performance": "ALL"}
)
driver = webdriver.Remote(
command_executor='http://'+GRID_HOST+'/wd/hub',
options=options,
)
driver.get("https://www.mywebsite.com/")
searchbox = driver.find_element(By.ID, "searchbox")
searchbox.send_keys("type something in the searchbox")
logs_raw = driver.get_log("performance")
for log in filter(log_filter, requests):
request_id = log["params"]["requestId"]
resp_url = log["params"]["response"]["url"]
if 'aj_recherche' in resp_url:
response = driver.execute_cdp_cmd("Network.getResponseBody", {"requestId": request_id})
I can then log every request associated to searchbox (ajax)
But never the JSON response (I can find it in the Chrome dev console)
Any idea ?
Hello Is there a way to use this with selenium Grid ? (remote)
With selnium grid I can catch the request but never the response
options = webdriver.ChromeOptions() options.add_argument('--ignore-ssl-errors=yes') options.add_argument('--ignore-certificate-errors') options.add_experimental_option('w3c', True) # Try to catch XHR response options.set_capability( "goog:loggingPrefs", {"performance": "ALL"} ) driver = webdriver.Remote( command_executor='http://'+GRID_HOST+'/wd/hub', options=options, ) driver.get("https://www.mywebsite.com/") searchbox = driver.find_element(By.ID, "searchbox") searchbox.send_keys("type something in the searchbox") logs_raw = driver.get_log("performance") for log in filter(log_filter, requests): request_id = log["params"]["requestId"] resp_url = log["params"]["response"]["url"] if 'aj_recherche' in resp_url: response = driver.execute_cdp_cmd("Network.getResponseBody", {"requestId": request_id})
I can then log every request associated to searchbox (ajax) But never the JSON response (I can find it in the Chrome dev console)
Any idea ?
@ofostier Yes! Took me some time to figure out but starting from Selenium 4.16.0 you should replace response = driver.execute_cdp_cmd("Network.getResponseBody", {"requestId": request_id})
with:
response = driver.execute(
driver_command="executeCdpCommand",
params={
"cmd": "Network.getResponseBody",
"params": {"requestId": request_id},
},
)
body = response["value"]["body"]
source: get_browser_request_body(driver: WebDriver, request_id: str) answer from Borys Oliinyk.
Also consider what @nathan-fiscaletti mentioned about avoiding error -32000. In my case it happens for responses captured from previous sessions where I cleared session and local storage but was still using te same webdriver Remote()
instance
It's better to listen for the Network.loadingFinished
event before calling driver.execute_cdp_cmd("Network.getResponseBody", {"requestId": request_id})
to prevent this issue
it's work! Senk's) I was looking for a solution for a long time, and you helped! 👍