Skip to content

Instantly share code, notes, and snippets.

@svschannak
Last active March 15, 2021 04:35
Show Gist options
  • Save svschannak/ecc4319faef2b17c68952a0fc567d310 to your computer and use it in GitHub Desktop.
Save svschannak/ecc4319faef2b17c68952a0fc567d310 to your computer and use it in GitHub Desktop.
Multiprocessing with Selenium and Python for website crawling
from multiprocessing import Pool, cpu_count
def run_parallel_selenium_processes(datalist, selenium_func):
pool = Pool()
# max number of parallel process
ITERATION_COUNT = cpu_count()-1
count_per_iteration = len(datalist) / float(ITERATION_COUNT)
for i in range(0, ITERATION_COUNT):
list_start = int(count_per_iteration * i)
list_end = int(count_per_iteration * (i+1))
pool.apply_async(selenium_func, [datalist[list_start:list_end]])
@Jiseong-Michael-Yang
Copy link

Jiseong-Michael-Yang commented Mar 1, 2021

Thank you for the example.

I have a question: how do we get the result returned? When I apply get() method on pool.apply_async, the web driver starts in sequence, not simultaneously.

Could you help me with this problem, please?

I solved this by adding this line of code in the for loop:
results = []
results.extend(pool.apply_async(selenium_func, [datalist[list_start:list_end]]))

and by this before return
results = [results[i].get() for i in range(len(results))]

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment