Skip to content

Instantly share code, notes, and snippets.

@msm-code
Created December 29, 2021 02:41
Show Gist options
  • Save msm-code/fef1088f59a0391689e510141b460a39 to your computer and use it in GitHub Desktop.
Save msm-code/fef1088f59a0391689e510141b460a39 to your computer and use it in GitHub Desktop.

Karton benchmark notes

Predictibly, most of the time was spent in redis

90k tasks, nothing to collect

[2021-12-29 00:31:53,585][INFO] Tasks: Get all tasks: 4.495668649673462
[2021-12-29 00:31:53,586][INFO] Tasks: Tasks from queue: 0.00021696090698242188
[2021-12-29 00:31:53,675][INFO] Tasks: Big loop: 0.08784604072570801
[2021-12-29 00:31:53,675][INFO] Tasks: Smol loop: 1.0728836059570312e-05
[2021-12-29 00:31:53,812][INFO] Resources: List objects: 0.003173351287841797
[2021-12-29 00:31:58,107][INFO] Resources: Get all tasks: 4.294421672821045
[2021-12-29 00:31:58,268][INFO] Resources: For loop: 0.15995168685913086
[2021-12-29 00:31:58,268][INFO] Resources: Remove loop: 1.1205673217773438e-05
[2021-12-29 00:31:58,407][INFO] GC done in 10.318462610244751 seconds

90k tasks including 60k to collect

[2021-12-29 00:36:48,376][INFO] Tasks: Get all tasks: 4.544401407241821
[2021-12-29 00:36:48,377][INFO] Tasks: Tasks from queue: 0.00017404556274414062
[2021-12-29 00:37:01,953][INFO] Tasks: Big loop: 13.575872898101807
[2021-12-29 00:37:01,954][INFO] Tasks: Smol loop: 9.775161743164062e-06
[2021-12-29 00:37:02,108][INFO] Resources: List objects: 0.005395174026489258
[2021-12-29 00:37:03,493][INFO] Resources: Get all tasks: 1.3837084770202637
[2021-12-29 00:37:03,545][INFO] Resources: For loop: 0.05156588554382324
[2021-12-29 00:37:03,546][INFO] Resources: Remove loop: 8.821487426757812e-06
[2021-12-29 00:37:03,593][INFO] GC done in 20.767701625823975 seconds

60k tasks including 60k to collect

[2021-12-29 00:49:17,727][INFO] Tasks: Get all tasks: 2.212865114212036
[2021-12-29 00:49:17,728][INFO] Tasks: Tasks from queue: 0.00018286705017089844
[2021-12-29 00:49:26,743][INFO] Tasks: Big loop: 9.015160083770752
[2021-12-29 00:49:26,744][INFO] Tasks: Smol loop: 1.2874603271484375e-05
[2021-12-29 00:49:27,397][INFO] Resources: List objects: 0.5849082469940186
[2021-12-29 00:49:27,399][INFO] Resources: Get all tasks: 0.0008819103240966797
[2021-12-29 00:49:27,400][INFO] Resources: For loop: 3.337860107421875e-05
[2021-12-29 00:49:37,279][INFO] Resources: Remove loop: 9.878673791885376
[2021-12-29 00:49:37,280][INFO] GC done in 22.77230477333069 seconds

Standard batch from now on: 20k tasks including 20k to collect:

[2021-12-29 01:04:37,191][INFO] Tasks: Get all tasks: 1.994135856628418
[2021-12-29 01:04:37,192][INFO] Tasks: Tasks from queue: 0.00023102760314941406
[2021-12-29 01:04:46,747][INFO] Tasks: Big loop (record): 9.555027723312378
[2021-12-29 01:04:46,748][INFO] Tasks: Smol loop: 1.8596649169921875e-05
[2021-12-29 01:04:46,841][INFO] Resources: List objects: 0.005769491195678711
[2021-12-29 01:04:46,843][INFO] Resources: Get all tasks: 0.001276254653930664
[2021-12-29 01:04:46,843][INFO] Resources: For loop: 3.981590270996094e-05
[2021-12-29 01:04:46,846][INFO] Resources: Remove loop: 0.002631664276123047
[2021-12-29 01:04:46,847][INFO] GC done in 12.65667200088501 seconds

Part 1: Tasks

Get all tasks is (unsuprisingly) slow. Main loop of gc_collect_tasks is very slow when there are resources to remove

    def delete_tasks(self, tasks: List[Task]) -> None:
        """
        Remove multiple tasks from Redis

        :param tasks: list of Task objects
        """
        keys = [f"{KARTON_TASK_NAMESPACE}:{task.uid}" for task in tasks]
        self.redis.delete(*keys)

result:

[2021-12-29 01:09:28,965][INFO] Tasks: Get all tasks: 1.9590599536895752
[2021-12-29 01:09:28,967][INFO] Tasks: Tasks from queue: 0.00023603439331054688
[2021-12-29 01:09:29,052][INFO] Tasks: Big loop (record): 0.08496618270874023
[2021-12-29 01:09:29,053][INFO] Tasks: Deleting 40001 tasks
[2021-12-29 01:09:29,216][INFO] Tasks: Big loop (delete): 0.16314005851745605
[2021-12-29 01:09:33,712][INFO] Tasks: Big loop (metrics): 4.494740009307861
[2021-12-29 01:09:33,713][INFO] Tasks: Smol loop: 1.5020370483398438e-05
[2021-12-29 01:09:33,776][INFO] Resources: List objects: 0.004618406295776367
[2021-12-29 01:09:33,778][INFO] Resources: Get all tasks: 0.0007326602935791016
[2021-12-29 01:09:33,778][INFO] Resources: For loop: 2.384185791015625e-05
[2021-12-29 01:09:33,780][INFO] Resources: Remove loop: 0.001951456069946289
[2021-12-29 01:09:33,781][INFO] GC done in 7.78088116645813 seconds

Wow! We slashed time it took to delete 40k tasks from 9.5 seconds to 0.16 seconds.

Well, almost - incrementing all metrics still takes 4.5 seconds. One idea is to delete the metrics (system-side metrics are a bit redundant with service-side metrics. But still, it's a nice feature to have). But the fact that this is half of the time, suggests that network latency plays a huge role here. Let's try to speed it up with pipelining:

    def increment_metrics_list(self, metric: KartonMetrics, identities: List[str]) -> None:
        p = self.redis.pipeline()
        for identity in identities:
            p.hincrby(metric.value, identity, 1)
        p.execute()

Result:

[2021-12-29 01:15:55,718][INFO] Tasks: Get all tasks: 2.1441361904144287
[2021-12-29 01:15:55,719][INFO] Tasks: Tasks from queue: 0.0002739429473876953
[2021-12-29 01:15:55,805][INFO] Tasks: Big loop (record): 0.08557796478271484
[2021-12-29 01:15:55,806][INFO] Tasks: Deleting 40001 tasks
[2021-12-29 01:15:55,965][INFO] Tasks: Big loop (delete): 0.15918326377868652
[2021-12-29 01:15:56,424][INFO] Tasks: Big loop (metrics): 0.4580235481262207
[2021-12-29 01:15:56,424][INFO] Tasks: Smol loop: 2.2649765014648438e-05
[2021-12-29 01:15:56,494][INFO] Resources: List objects: 0.005167245864868164
[2021-12-29 01:15:56,495][INFO] Resources: Get all tasks: 0.0008485317230224609
[2021-12-29 01:15:56,496][INFO] Resources: For loop: 2.9325485229492188e-05
[2021-12-29 01:15:56,498][INFO] Resources: Remove loop: 0.002457857131958008
[2021-12-29 01:15:56,499][INFO] GC done in 3.9317240715026855 seconds

Well, there goes another 4.5 seconds. Task GC time improved from ~12 seconds to ~3.5 seconds. Nice.

Can we do better? Well, it depends on why get all tasks is slow. There are three separate operations:

  • keys
  • mget
  • deserialisation

I've added logging for them all. A bit suprisingly, deserialisation is by far the slowest part:

[2021-12-29 01:26:56,155][INFO] Keys: 0.18313837051391602
[2021-12-29 01:26:56,583][INFO] Mget: 0.4270799160003662
[2021-12-29 01:26:58,366][INFO] Deserialise: 1.7819278240203857
[2021-12-29 01:26:58,372][INFO] Tasks: Get all tasks: 2.40012788772583

This is good news. This means we can optimise it in the future, without redesigning everything.

Part 2: Resources

But for now, let's focus on resources. I've changed the producer a bit, to produce a different resource every time:

import sys
import os
import logging
from karton.core import Config, Producer, Task, Resource

config = Config("karton.ini")
producer = Producer(config)

howmany = int(sys.argv[1])
for i in range(howmany):
    if i % 1000 == 0:
        print(i)

    contents = str(i).encode()
    resource = Resource(str(i), contents)

    task = Task({"type": "sample", "kind": "raw"})
    logging.info('pushing file to karton %s, task %s' % (i, task))

    task.add_resource("sample", resource)
    task.add_payload("tags", ["simple_producer"])
    task.add_payload("additional_info", ["This sample has been added by simple producer example"])
    producer.send_task(task)

New times:

[2021-12-29 02:24:10,000][INFO] Tasks: Get all tasks: 0.8783807754516602
[2021-12-29 02:24:10,048][INFO] Tasks: Deleting 20000 tasks
[2021-12-29 02:24:10,532][INFO] Tasks: Big loop: 0.48422670364379883
[2021-12-29 02:24:11,426][INFO] Resources: List objects: 0.8508834838867188
[2021-12-29 02:24:11,427][INFO] Resources: List objects: 10000 objects
[2021-12-29 02:24:11,428][INFO] Resources: Get all tasks: 0.00037384033203125
[2021-12-29 02:24:11,428][INFO] Resources: For loop: 7.62939453125e-06
[2021-12-29 02:24:11,429][INFO] Resources: Removing 10000 resources
[2021-12-29 02:24:26,314][INFO] Resources: Remove loop: 14.885408878326416
[2021-12-29 02:24:26,315][INFO] GC done in 18.199631452560425 seconds

Let's use basically the same technique, and pipeline object removal:

[2021-12-29 02:21:48,814][INFO] Tasks: Get all tasks: 0.8932852745056152
[2021-12-29 02:21:48,871][INFO] Tasks: Deleting 20000 tasks
[2021-12-29 02:21:49,267][INFO] Tasks: Big loop: 0.39539289474487305
[2021-12-29 02:21:50,224][INFO] Resources: List objects: 0.9117233753204346
[2021-12-29 02:21:50,225][INFO] Resources: List objects: 10000 objects
[2021-12-29 02:21:50,226][INFO] Resources: Get all tasks: 0.0004830360412597656
[2021-12-29 02:21:50,226][INFO] Resources: For loop: 6.198883056640625e-06
[2021-12-29 02:21:50,226][INFO] Resources: Removing 10000 resources
[2021-12-29 02:21:52,237][INFO] Resources: Remove loop: 2.0103540420532227
[2021-12-29 02:21:52,239][INFO] GC done in 5.3230650424957275 seconds

Batching wins, again. Decreased the remove time from 14.88s to 2.01s.

Part 3: Routing

With these changes, GC should be fast enough. There is one optimisation pending (don't call get_all_tasks twice), but maybe it's time to focus on routing.

The ways to go from here are:

  • improve routing performnace (by using pipelining in the process_task, for example).
  • reverse-engineer why routing is not thread-safe and fix it so we can scale routers.
  • implement batched routing? (route multiple tasks at once to reduce roundtrips).

Sadly it's gettint a bit late here, so I won't be able to look at it.

@ITAYC0HEN
Copy link

wow <3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment