- Use concurrency for I/O bound calls (e.g. REST API calls)
- Use parallelizm for CPU bound calls
You're working on a python 3.X codebase. You have a service that has 1,000 TODO IDs. For each TODO IDs, you want to get the TODO details via https://jsonplaceholder.typicode.com/todos/{tod_id}
. How should you fan out the 1,000 calls to the TODO details web api? Which one of these techniques should you use? Either:
- parallelism using the
multiprocessing
library or - concurrency using the
asyncio
tasks and gather technique
Note: Assume your machine has 4 vCPUs with 2 cores each.
Both parallelism and concurrency can be used to speed up the execution of your code, but they are used in different scenarios and have different trade-offs.
Parallelism, involves running multiple processes at the same time. This is useful when you have CPU-bound tasks, i.e., tasks that require heavy computation and spend most of their time using the CPU.
However, if you're making HTTP requests to an API (which is an I/O-bound task), it spends most of their time waiting for input/output operations (like network or disk operations) to complete. While these operations are being performed, the CPU is mostly idle.
Concurrency, allows you to start a new task before the previous one has completed, thus making better use of your CPU. While one task is waiting for an I/O operation to complete, another task can use the CPU. This approach allows you to start multiple requests at the same time and then wait for all of them to complete, which should be faster than making the requests one by one.
import aiohttp
import asyncio
async def get_todo(todo_id):
async with aiohttp.ClientSession() as session:
async with session.get(f'https://jsonplaceholder.typicode.com/todos/{todo_id}') as response:
return await response.text()
async def main():
todo_ids = range(1, 1001) # replace with your list of TODO IDs
tasks = [get_todo(todo_id) for todo_id in todo_ids]
responses = await asyncio.gather(*tasks)
for todo_details in responses:
print(todo_details)
if __name__ == "__main__":
asyncio.run(main())
This code creates an asyncio
task for each TODO ID, starts all of them at the same time, and then waits for all of them to complete. The responses from the API are stored in the responses
list.
Here are some reasons why asyncio
is a better choice for this task:
-
asyncio
tasks are lightweight and can run concurrently on a single core, while parallelizm (using themultiprocessing
Python library) processes are heavier and can only run on one core at a time. -
asyncio
can take advantage of multiple cores by running multiple tasks concurrently on different cores. -
asyncio
is more efficient in terms of memory usage.
For a machine with 4 vCPUs, each with 2 cores means that you can run up to 8 asyncio
tasks concurrently. If you were to use parallelizm using the python multiprocessing
library, you would only be able to run 4 processes concurrently.
Generally, any task that involves a lot of mathematical computations and doesn't need to wait for data to be read from or written to the disk or the network is likely to be CPU-bound.
-
Image Processing: Tasks like resizing images, applying filters, or converting image formats are CPU-bound because they involve a lot of computations for each pixel in the image.
-
Machine Learning: Training a machine learning model involves a lot of matrix multiplications and other mathematical computations, which are CPU-intensive. Similarly, making predictions with a trained model can also be a CPU-bound task, especially if you're making a lot of predictions at once.
-
Data Analysis: Tasks like sorting large arrays, computing statistical measures (like the mean or standard deviation) over large datasets, or performing complex queries on a database can be CPU-bound if the data is already in memory.
-
Video Encoding/Decoding: Converting a video from one format to another, compressing a video, or extracting frames from a video are all CPU-bound tasks because they involve a lot of computations for each frame in the video.
-
Cryptographic Computations: Tasks like encrypting/decrypting data, generating hashes, or verifying digital signatures are CPU-bound because they involve complex mathematical computations.
-
Physics Simulations: Simulating the motion of a system of particles, predicting the weather, or modeling the behavior of a fluid are all CPU-bound tasks because they involve solving complex mathematical equations for each point in the system.