Predictibly, most of the time was spent in redis
90k tasks, nothing to collect
[2021-12-29 00:31:53,585][INFO] Tasks: Get all tasks: 4.495668649673462
[2021-12-29 00:31:53,586][INFO] Tasks: Tasks from queue: 0.00021696090698242188
[2021-12-29 00:31:53,675][INFO] Tasks: Big loop: 0.08784604072570801
[2021-12-29 00:31:53,675][INFO] Tasks: Smol loop: 1.0728836059570312e-05
[2021-12-29 00:31:53,812][INFO] Resources: List objects: 0.003173351287841797
[2021-12-29 00:31:58,107][INFO] Resources: Get all tasks: 4.294421672821045
[2021-12-29 00:31:58,268][INFO] Resources: For loop: 0.15995168685913086
[2021-12-29 00:31:58,268][INFO] Resources: Remove loop: 1.1205673217773438e-05
[2021-12-29 00:31:58,407][INFO] GC done in 10.318462610244751 seconds
90k tasks including 60k to collect
[2021-12-29 00:36:48,376][INFO] Tasks: Get all tasks: 4.544401407241821
[2021-12-29 00:36:48,377][INFO] Tasks: Tasks from queue: 0.00017404556274414062
[2021-12-29 00:37:01,953][INFO] Tasks: Big loop: 13.575872898101807
[2021-12-29 00:37:01,954][INFO] Tasks: Smol loop: 9.775161743164062e-06
[2021-12-29 00:37:02,108][INFO] Resources: List objects: 0.005395174026489258
[2021-12-29 00:37:03,493][INFO] Resources: Get all tasks: 1.3837084770202637
[2021-12-29 00:37:03,545][INFO] Resources: For loop: 0.05156588554382324
[2021-12-29 00:37:03,546][INFO] Resources: Remove loop: 8.821487426757812e-06
[2021-12-29 00:37:03,593][INFO] GC done in 20.767701625823975 seconds
60k tasks including 60k to collect
[2021-12-29 00:49:17,727][INFO] Tasks: Get all tasks: 2.212865114212036
[2021-12-29 00:49:17,728][INFO] Tasks: Tasks from queue: 0.00018286705017089844
[2021-12-29 00:49:26,743][INFO] Tasks: Big loop: 9.015160083770752
[2021-12-29 00:49:26,744][INFO] Tasks: Smol loop: 1.2874603271484375e-05
[2021-12-29 00:49:27,397][INFO] Resources: List objects: 0.5849082469940186
[2021-12-29 00:49:27,399][INFO] Resources: Get all tasks: 0.0008819103240966797
[2021-12-29 00:49:27,400][INFO] Resources: For loop: 3.337860107421875e-05
[2021-12-29 00:49:37,279][INFO] Resources: Remove loop: 9.878673791885376
[2021-12-29 00:49:37,280][INFO] GC done in 22.77230477333069 seconds
Standard batch from now on: 20k tasks including 20k to collect:
[2021-12-29 01:04:37,191][INFO] Tasks: Get all tasks: 1.994135856628418
[2021-12-29 01:04:37,192][INFO] Tasks: Tasks from queue: 0.00023102760314941406
[2021-12-29 01:04:46,747][INFO] Tasks: Big loop (record): 9.555027723312378
[2021-12-29 01:04:46,748][INFO] Tasks: Smol loop: 1.8596649169921875e-05
[2021-12-29 01:04:46,841][INFO] Resources: List objects: 0.005769491195678711
[2021-12-29 01:04:46,843][INFO] Resources: Get all tasks: 0.001276254653930664
[2021-12-29 01:04:46,843][INFO] Resources: For loop: 3.981590270996094e-05
[2021-12-29 01:04:46,846][INFO] Resources: Remove loop: 0.002631664276123047
[2021-12-29 01:04:46,847][INFO] GC done in 12.65667200088501 seconds
Get all tasks
is (unsuprisingly) slow.
Main loop of gc_collect_tasks
is very slow when there are resources to remove
def delete_tasks(self, tasks: List[Task]) -> None:
"""
Remove multiple tasks from Redis
:param tasks: list of Task objects
"""
keys = [f"{KARTON_TASK_NAMESPACE}:{task.uid}" for task in tasks]
self.redis.delete(*keys)
result:
[2021-12-29 01:09:28,965][INFO] Tasks: Get all tasks: 1.9590599536895752
[2021-12-29 01:09:28,967][INFO] Tasks: Tasks from queue: 0.00023603439331054688
[2021-12-29 01:09:29,052][INFO] Tasks: Big loop (record): 0.08496618270874023
[2021-12-29 01:09:29,053][INFO] Tasks: Deleting 40001 tasks
[2021-12-29 01:09:29,216][INFO] Tasks: Big loop (delete): 0.16314005851745605
[2021-12-29 01:09:33,712][INFO] Tasks: Big loop (metrics): 4.494740009307861
[2021-12-29 01:09:33,713][INFO] Tasks: Smol loop: 1.5020370483398438e-05
[2021-12-29 01:09:33,776][INFO] Resources: List objects: 0.004618406295776367
[2021-12-29 01:09:33,778][INFO] Resources: Get all tasks: 0.0007326602935791016
[2021-12-29 01:09:33,778][INFO] Resources: For loop: 2.384185791015625e-05
[2021-12-29 01:09:33,780][INFO] Resources: Remove loop: 0.001951456069946289
[2021-12-29 01:09:33,781][INFO] GC done in 7.78088116645813 seconds
Wow! We slashed time it took to delete 40k tasks from 9.5 seconds to 0.16 seconds.
Well, almost - incrementing all metrics still takes 4.5 seconds. One idea is to delete the metrics (system-side metrics are a bit redundant with service-side metrics. But still, it's a nice feature to have). But the fact that this is half of the time, suggests that network latency plays a huge role here. Let's try to speed it up with pipelining:
def increment_metrics_list(self, metric: KartonMetrics, identities: List[str]) -> None:
p = self.redis.pipeline()
for identity in identities:
p.hincrby(metric.value, identity, 1)
p.execute()
Result:
[2021-12-29 01:15:55,718][INFO] Tasks: Get all tasks: 2.1441361904144287
[2021-12-29 01:15:55,719][INFO] Tasks: Tasks from queue: 0.0002739429473876953
[2021-12-29 01:15:55,805][INFO] Tasks: Big loop (record): 0.08557796478271484
[2021-12-29 01:15:55,806][INFO] Tasks: Deleting 40001 tasks
[2021-12-29 01:15:55,965][INFO] Tasks: Big loop (delete): 0.15918326377868652
[2021-12-29 01:15:56,424][INFO] Tasks: Big loop (metrics): 0.4580235481262207
[2021-12-29 01:15:56,424][INFO] Tasks: Smol loop: 2.2649765014648438e-05
[2021-12-29 01:15:56,494][INFO] Resources: List objects: 0.005167245864868164
[2021-12-29 01:15:56,495][INFO] Resources: Get all tasks: 0.0008485317230224609
[2021-12-29 01:15:56,496][INFO] Resources: For loop: 2.9325485229492188e-05
[2021-12-29 01:15:56,498][INFO] Resources: Remove loop: 0.002457857131958008
[2021-12-29 01:15:56,499][INFO] GC done in 3.9317240715026855 seconds
Well, there goes another 4.5 seconds. Task GC time improved from ~12 seconds to ~3.5 seconds. Nice.
Can we do better? Well, it depends on why get all tasks
is slow. There are three separate operations:
keys
mget
- deserialisation
I've added logging for them all. A bit suprisingly, deserialisation is by far the slowest part:
[2021-12-29 01:26:56,155][INFO] Keys: 0.18313837051391602
[2021-12-29 01:26:56,583][INFO] Mget: 0.4270799160003662
[2021-12-29 01:26:58,366][INFO] Deserialise: 1.7819278240203857
[2021-12-29 01:26:58,372][INFO] Tasks: Get all tasks: 2.40012788772583
This is good news. This means we can optimise it in the future, without redesigning everything.
But for now, let's focus on resources. I've changed the producer a bit, to produce a different resource every time:
import sys
import os
import logging
from karton.core import Config, Producer, Task, Resource
config = Config("karton.ini")
producer = Producer(config)
howmany = int(sys.argv[1])
for i in range(howmany):
if i % 1000 == 0:
print(i)
contents = str(i).encode()
resource = Resource(str(i), contents)
task = Task({"type": "sample", "kind": "raw"})
logging.info('pushing file to karton %s, task %s' % (i, task))
task.add_resource("sample", resource)
task.add_payload("tags", ["simple_producer"])
task.add_payload("additional_info", ["This sample has been added by simple producer example"])
producer.send_task(task)
New times:
[2021-12-29 02:24:10,000][INFO] Tasks: Get all tasks: 0.8783807754516602
[2021-12-29 02:24:10,048][INFO] Tasks: Deleting 20000 tasks
[2021-12-29 02:24:10,532][INFO] Tasks: Big loop: 0.48422670364379883
[2021-12-29 02:24:11,426][INFO] Resources: List objects: 0.8508834838867188
[2021-12-29 02:24:11,427][INFO] Resources: List objects: 10000 objects
[2021-12-29 02:24:11,428][INFO] Resources: Get all tasks: 0.00037384033203125
[2021-12-29 02:24:11,428][INFO] Resources: For loop: 7.62939453125e-06
[2021-12-29 02:24:11,429][INFO] Resources: Removing 10000 resources
[2021-12-29 02:24:26,314][INFO] Resources: Remove loop: 14.885408878326416
[2021-12-29 02:24:26,315][INFO] GC done in 18.199631452560425 seconds
Let's use basically the same technique, and pipeline object removal:
[2021-12-29 02:21:48,814][INFO] Tasks: Get all tasks: 0.8932852745056152
[2021-12-29 02:21:48,871][INFO] Tasks: Deleting 20000 tasks
[2021-12-29 02:21:49,267][INFO] Tasks: Big loop: 0.39539289474487305
[2021-12-29 02:21:50,224][INFO] Resources: List objects: 0.9117233753204346
[2021-12-29 02:21:50,225][INFO] Resources: List objects: 10000 objects
[2021-12-29 02:21:50,226][INFO] Resources: Get all tasks: 0.0004830360412597656
[2021-12-29 02:21:50,226][INFO] Resources: For loop: 6.198883056640625e-06
[2021-12-29 02:21:50,226][INFO] Resources: Removing 10000 resources
[2021-12-29 02:21:52,237][INFO] Resources: Remove loop: 2.0103540420532227
[2021-12-29 02:21:52,239][INFO] GC done in 5.3230650424957275 seconds
Batching wins, again. Decreased the remove time from 14.88s to 2.01s.
With these changes, GC should be fast enough. There is one optimisation pending (don't call get_all_tasks
twice), but maybe it's time to focus on routing.
The ways to go from here are:
- improve routing performnace (by using pipelining in the
process_task
, for example). - reverse-engineer why routing is not thread-safe and fix it so we can scale routers.
- implement batched routing? (route multiple tasks at once to reduce roundtrips).
Sadly it's gettint a bit late here, so I won't be able to look at it.
wow <3