-
-
Save lorey/079c5e178c9c9d3c30ad87df7f70491d to your computer and use it in GitHub Desktop.
# | |
# This small example shows you how to access JS-based requests via Selenium | |
# Like this, one can access raw data for scraping, | |
# for example on many JS-intensive/React-based websites | |
# | |
from time import sleep | |
from selenium import webdriver | |
from selenium.webdriver import DesiredCapabilities | |
# make chrome log requests | |
capabilities = DesiredCapabilities.CHROME | |
capabilities["loggingPrefs"] = {"performance": "ALL"} # newer: goog:loggingPrefs | |
driver = webdriver.Chrome( | |
desired_capabilities=capabilities, executable_path="./chromedriver" | |
) | |
# fetch a site that does xhr requests | |
driver.get("https://sitewithajaxorsomething.com") | |
sleep(5) # wait for the requests to take place | |
# extract requests from logs | |
logs_raw = driver.get_log("performance") | |
logs = [json.loads(lr["message"])["message"] for lr in logs_raw] | |
def log_filter(log_): | |
return ( | |
# is an actual response | |
log_["method"] == "Network.responseReceived" | |
# and json | |
and "json" in log_["params"]["response"]["mimeType"] | |
) | |
for log in filter(log_filter, logs): | |
request_id = log["params"]["requestId"] | |
resp_url = log["params"]["response"]["url"] | |
print(f"Caught {resp_url}") | |
print(driver.execute_cdp_cmd("Network.getResponseBody", {"requestId": request_id})) |
I've been trying to achieve this for at least a week working on it, and for a few months thinking about it. You are great.
This is really great, however at the final step of getting the response body using the requestId I get
self.driver.execute_cdp_cmd("Network.getResponseBody", {"requestId": request_id})
2021-05-06 14:04:12 jim-ThinkPad-S5-S540 selenium.webdriver.remote.remote_connection[36958] DEBUG POST http://127.0.0.1:42437/session/b29c0918324a3defb5d6d11100dd3bec/goog/cdp/execute {"cmd": "Network.getResponseBody", "params": {"requestId": "37056.284"}}
2021-05-06 14:04:12 jim-ThinkPad-S5-S540 urllib3.connectionpool[36958] DEBUG http://127.0.0.1:42437 "POST /session/b29c0918324a3defb5d6d11100dd3bec/goog/cdp/execute HTTP/1.1" 500 253
2021-05-06 14:04:12 jim-ThinkPad-S5-S540 selenium.webdriver.remote.remote_connection[36958] DEBUG Finished Request
*** selenium.common.exceptions.WebDriverException: Message: unknown error: unhandled inspector error: {"code":-32000,"message":"No resource with given identifier found"}
(Session info: chrome=89.0.4389.114)
Can you please help me out, on how to do this with firefox browser? I tried few steps , but it didnt work out.
Sorry, this is not intended for Firefox, @shans0535. Have you tried selenium-wire or just a mitm-proxy instead?
To : lee-hodg
I think it's an error that came from accessing a place without resources.
It works well with try-except syntax.
This is really great, however at the final step of getting the response body using the requestId I get
self.driver.execute_cdp_cmd("Network.getResponseBody", {"requestId": request_id}) 2021-05-06 14:04:12 jim-ThinkPad-S5-S540 selenium.webdriver.remote.remote_connection[36958] DEBUG POST http://127.0.0.1:42437/session/b29c0918324a3defb5d6d11100dd3bec/goog/cdp/execute {"cmd": "Network.getResponseBody", "params": {"requestId": "37056.284"}} 2021-05-06 14:04:12 jim-ThinkPad-S5-S540 urllib3.connectionpool[36958] DEBUG http://127.0.0.1:42437 "POST /session/b29c0918324a3defb5d6d11100dd3bec/goog/cdp/execute HTTP/1.1" 500 253 2021-05-06 14:04:12 jim-ThinkPad-S5-S540 selenium.webdriver.remote.remote_connection[36958] DEBUG Finished Request *** selenium.common.exceptions.WebDriverException: Message: unknown error: unhandled inspector error: {"code":-32000,"message":"No resource with given identifier found"} (Session info: chrome=89.0.4389.114)
I was working on a way to do this for a week or two before I found your post. Works beautifully for what I needed, thanks a bunch.
it's work! Senk's) I was looking for a solution for a long time, and you helped! 👍
Thanks for the kindness everyone. Glad I could help you out. Please feel free to check out my profile with similar tools and libraries at https://github.com/lorey <3
Awsome!!
how get xhr from real browser online?
Selenium is using a real browser. If you want to do it manually yourself, check out developer tools (e.g. F12 in Chrome, tab "Network").
@lorey, thanks for sharing.
For Chrome >=75 we have to do small changes.
As specified in the release notes for ChromeDriver 75.0.3770.8, capability loggingPrefs
has been renamed to goog:loggingPrefs
hello; thanks for sharing this gist; your code is working fine, i just got this little issue and can't get my head arround it;
so what i'm trying to log is a xhr call made by a webworker;
so getting the performance log on the main threads doesnt list the request i want;
in chrome when i select the worker in console tab, i can execute "performance.getEntries()" only then i can get the request i want
any idea on how to do that on selenium ?
Used this method for a while, after some time during script run and without clear reason "driver.execute_cdp_cmd" function throws error:
'WebDriver' object has no attribute 'execute_cdp_cmd'
Looking for alternative solution, feel free to suggest what could be done...
Hey @milanbog92, how about:
- https://pypi.org/project/mitmproxy/ to catch requests
- a regular browser (e.g. by hotkeys) or maybe playwright with some adaptions to be undetectable
@lorey Thanks for the fast response!
Since I am executing my "python3 script.py" from external script it seams that my system has loaded wrong python version. I have seen that python3.6 is showing error consistently while python3.9 is working as expected. Hopefully this will help someone...
I was stumbling across all solutions available, and I believe that there is no better one, Selenium cant load Chrome extension that uses chrome.debugger API and I have no luck with hotkeys for now in my complex environment.
@lorey, thanks for your fantastic work, just one more thing.
is there a way that i could get only the response data from a specific url?
Hi,
I use performance_logs instead of logs_raw variable name and skipping "chrome://favicon2" and searching for image_name
performance_logs = driver.get_log("performance")
for performance_log in performance_logs:
performance_log_json = json.loads(performance_log["message"])
if performance_log_json["message"]["method"] == 'Network.responseReceived':
if performance_log_json["message"]["params"]["response"]["url"].find('chrome://favicon2/') != -1:
continue;
if performance_log_json["message"]["params"]["response"]["url"].find(image_name) != -1:
print(performance_log_json["message"]["params"]["response"]["url"])
print(performance_log_json["message"]["params"]["requestId"])
print(performance_log_json["message"]["params"]["type"])
Desired Capabilities is deprecated and can't be used anymore, how can I achieve this without it?
@LiamKrenn take a look at this, I haven't tried but hopefully it works https://stackoverflow.com/questions/76622916/converting-desired-capabilities-to-options-in-selenium-python
Hello guys, for Selenium 4.x use it
driver.options.set_capability('goog:loggingPrefs', {'performance': 'ALL'})
driver.get(url)
then just follow the steps from line 24+
Works for selenium 4.13.0
for Selenium 4.15 set the option:
options = webdriver.ChromeOptions()
options.set_capability(
"goog:loggingPrefs", {"performance": "ALL"}
)
driver = webdriver.Chrome(options=options)
Something I noticed is that you need to filter out Preflight
requests.
if event['params']['type'] != 'Preflight':
. . .
Otherwise, you might get this error:
{"code":-32000,"message":"No resource with given identifier found"}
Hello
Is there a way to use this with selenium Grid ? (remote)
With selnium grid I can catch the request but never the response
options = webdriver.ChromeOptions()
options.add_argument('--ignore-ssl-errors=yes')
options.add_argument('--ignore-certificate-errors')
options.add_experimental_option('w3c', True)
# Try to catch XHR response
options.set_capability(
"goog:loggingPrefs", {"performance": "ALL"}
)
driver = webdriver.Remote(
command_executor='http://'+GRID_HOST+'/wd/hub',
options=options,
)
driver.get("https://www.mywebsite.com/")
searchbox = driver.find_element(By.ID, "searchbox")
searchbox.send_keys("type something in the searchbox")
logs_raw = driver.get_log("performance")
for log in filter(log_filter, requests):
request_id = log["params"]["requestId"]
resp_url = log["params"]["response"]["url"]
if 'aj_recherche' in resp_url:
response = driver.execute_cdp_cmd("Network.getResponseBody", {"requestId": request_id})
I can then log every request associated to searchbox (ajax)
But never the JSON response (I can find it in the Chrome dev console)
Any idea ?
Hello Is there a way to use this with selenium Grid ? (remote)
With selnium grid I can catch the request but never the response
options = webdriver.ChromeOptions() options.add_argument('--ignore-ssl-errors=yes') options.add_argument('--ignore-certificate-errors') options.add_experimental_option('w3c', True) # Try to catch XHR response options.set_capability( "goog:loggingPrefs", {"performance": "ALL"} ) driver = webdriver.Remote( command_executor='http://'+GRID_HOST+'/wd/hub', options=options, ) driver.get("https://www.mywebsite.com/") searchbox = driver.find_element(By.ID, "searchbox") searchbox.send_keys("type something in the searchbox") logs_raw = driver.get_log("performance") for log in filter(log_filter, requests): request_id = log["params"]["requestId"] resp_url = log["params"]["response"]["url"] if 'aj_recherche' in resp_url: response = driver.execute_cdp_cmd("Network.getResponseBody", {"requestId": request_id})
I can then log every request associated to searchbox (ajax) But never the JSON response (I can find it in the Chrome dev console)
Any idea ?
@ofostier Yes! Took me some time to figure out but starting from Selenium 4.16.0 you should replace response = driver.execute_cdp_cmd("Network.getResponseBody", {"requestId": request_id})
with:
response = driver.execute(
driver_command="executeCdpCommand",
params={
"cmd": "Network.getResponseBody",
"params": {"requestId": request_id},
},
)
body = response["value"]["body"]
source: get_browser_request_body(driver: WebDriver, request_id: str) answer from Borys Oliinyk.
Also consider what @nathan-fiscaletti mentioned about avoiding error -32000. In my case it happens for responses captured from previous sessions where I cleared session and local storage but was still using te same webdriver Remote()
instance
It's better to listen for the Network.loadingFinished
event before calling driver.execute_cdp_cmd("Network.getResponseBody", {"requestId": request_id})
to prevent this issue
Further reading: https://www.rkengler.com/how-to-capture-network-traffic-when-scraping-with-selenium-and-python/
Alternative: Library to access requests directly https://pypi.org/project/selenium-wire/
Alternative method: Using a mitm proxy and accessing it via python https://stackoverflow.com/a/36769922/1275778 https://wiki.wireshark.org/Python