Skip to content

Instantly share code, notes, and snippets.

@ammmze
Last active March 2, 2025 15:54
Show Gist options
  • Save ammmze/5770bb4833de3aafc63ab589e46cd371 to your computer and use it in GitHub Desktop.
Save ammmze/5770bb4833de3aafc63ab589e46cd371 to your computer and use it in GitHub Desktop.
Klipper Timer Too Close (aka TTC)

Klipper Timer Too Close

Note

Update

PR to address this cause for Timer Too Close has been merged. Will leave this gist here for reference, but this should no longer be an issue.

Note

tl;dr Look at klippy memory usage. Higher memory usage (yes 140MB is high in this context) can cause garbage collection to run slow and if it runs during a homing move, it is likely to trigger a TTC. The [shaketune] module seems to increase memory usage by about 70MB. So try turning that off.

There is a relatively common error that is dreaded in the klipper community that is known as TTC, short for Timer Too Close.

This error may be caused by many things from improper CANBUS wiring, overloaded host system, etc. To understand what the error means, requires some knowledge of what klipper is doing. Essentially klipper needs to be able precisely time various actions, movements, read sensors, etc between the host computer and the mcu(s). For this to work it needs to schedule different events at certain times. Some of those actions have a very small window where they are valid. When you get the TTC, it basically means that the scheduled event was received by the MCU too late and the scheduled time has already past. Again this could be because klipper tried to resubmit the event due to communication failures, the system was too slow etc.

Background

On my new Voron 2.4, I opted to NOT use the BTT CB1 that was provided, but I had an Orange Pi CM4 w/ 4GB ram available, which is similar to the BTT CB2, but with twice the RAM. It's not quite as good as full blow Raspberry Pi CM4, but I thought it would be decent enough. But unfortunately I have been plagued by the TTC.

My relevant Voron hardware is configured as follows:

  • Manta M8P as the main MCU and USB2CAN bridge Resistor enabled on this board.
  • MOAR CAN as a CAN hub
    • AFC Lite v1.0. Has CAN in and out data lines going all the way to the BoxTurtle where the in and out lines are crimped together to ensure as short of a stub/spur as possible.
    • EBB SB2209 RP2040 Toolhead board connected. Just using BTT CAN cable. Resistor enabled on this board.
      • Cartographer via CAN

Software wise I have the following extra modules:

  • gcode_shell_command
  • scanner (from cartographer)
  • Happy Hare
  • shaketune

My test print for triggering the TTC errors is my modified BoxTurtle "Cereal" print (w/ a hare on the bottom). This is a small print that has ~60 color changes over the 100 minute print. Each of these color changes will perform a few "homing" movements (which will be important later in analysis).

Analysis

In all the testing I've done, the stats that klipper tracks (bytes retransmitted and bytes invalid) had always looked pretty solid. From personal experience, if you have a high number of these, your issues are more likely related to your CANBUS wiring. So given my numbers looked good, I don't think my issue is related to my CANBUS setup and Kevin also agreed that my issue does not appear to be CANBUS related.

One thing Kevin did point out though is that garbage collection had run at the time the timer too close error occurred. This was evident due to the log Reactor garbage collection: (3156.352535264, 3129.683607084, 3158.637710431), which logs the event time that each generation was collected by the garbage collector. So these times are 3156 for generation 0, 3129 for generation 1, and 3158 for generation 2. This shows us that the generation 2 bucket was the most recently collected and it's event time is in the same window of time as the last messages in the buffer (e.g. Sent 99 3158.931549 3158.931355 13: seq: 1a, queue_step oid=5 interval=31572 count=634 add=0). From here I educated myself on pythons garbage collection and learned that generation 0 are the newest items and are collected more often, and once collected, anything still remaining gets moved to generation 1 and eventually a threshold is hit and generation 1 gets collected and promotes remaining ones to generation 2. Generation 2 is collected least often.

Then I started looking through many of my logs where I triggered Timer Too Close and found all of them were in the same scenario, generation 2 garbage collection followed by the TTC. To further debug, I modified the klipper code around where it does manual garbage collection (python standard gc is disabled...presumably to have more control over when gc is allowed to run) to log when each generation is collected and how long those collections take to complete.

With those additional logs in place I was able to confirm that the TTC does NOT happen after EVERY gen 2 collection. BUT IF gen 2 was collected during some homing movements (again my test print performs many filament changes, which has a few homing movements), then I was basically guaranteed to get a TTC. So ultimately it comes down whether or not gen 2 garbage collection happens at the wrong time. The other thing Kevin mentioned is homing typically requires much lower latency (<100ms) with the MCUs, which points to why gen 2 collection is problematic. With my additional logs, I observed that gen 2 collection was taking ~300ms. And garbage collection essentially pauses the rest of the app (even if you try to do it in a separate thread because of the global interpreter lock (GIL)).

Grasping at straws, I decided to disable the [shaketune] module. Surprisingly this allowed me to finally complete my first BoxTurtle cereal on this printer. Looking at my stats, the gen 2 collection had dropped to ~160ms. In addition, I noticed that the klippy process with [shaketune] enabled used ~140MB ram and only ~70MB without it. Not sure what the module is doing that doubles the memory usage and doubles the gen 2 collection time. While it was still possible to trigger the TTC if gen 2 happens at the wrong time, it is certainly less likely. So disabling this is so far the biggest win against the TTC.

Where to go from here?

With the way the python gc runs, there's no way for us to really delegate it to run separately in parallel with the rest of the application timing. So what options do we have?

  • Try to optimize the problematic modules to reduce memory usage and try to manually manage cleaning up so garbage collection basically never needs to run. Klipper mainline code is pretty well optimized, but does still periodically require garbage collection. Honestly this is probably a large task to do in klipper, scanner, happy hare, and shaketune.
  • Disable gen 2 collection during printing. This would likely mean memory usage would slowly grow over the duration of the print. In a test where I completely disabled gen 2 collection. I've seen around 5MB increase in ram over the duration of the 100 minute print.
  • Raise the thresholds so gc doesn't run as frequently. I did test raising gen threshold from 700 to 5000 which did help, but it also increased gen 0 collection from ~1.5ms to about ~18ms.
  • Perhaps gc can be scheduled to run between gcode commands?

Also I just got a Raspberry Pi CM4, I am curious to see if garbage collection just simply runs faster on the Raspberry vs the Orange. The memory speed of the Raspberry is 3200mhz. I would guess the Orange is probably 2400mhz.

@ammmze
Copy link
Author

ammmze commented Jan 24, 2025

So far tests with gen 2 gc disabled have looked good. I haven't had a single TTC since disabling it. I've currently got a print that will be running all night with gen 2 gc completely disabled and about 300 color swaps. Next I plan to try something like the following, which I have not tested at all, but the theory is to track the worst garbage collection times for the different generations and then only run garbage collection if we have time before the next timer. Now I don't know yet if having a 300ms time on gen 2 would essentially prevent it from running entirely. Not sure how much time there is normally between timers when the printer is basically idle. Will see tomorrow after the current print finishes.

diff --git a/klippy/reactor.py b/klippy/reactor.py
index 412d53edf..041ef6a64 100644
--- a/klippy/reactor.py
+++ b/klippy/reactor.py
@@ -114,6 +114,12 @@ class SelectReactor:
         self._g_dispatch = None
         self._greenlets = []
         self._all_greenlets = []
+        # track max time collecting each gc generation
+        # so we can only run collection if the next timer
+        # is further out than the worst known gc times.
+        # because gen 2 is known to be slower than others
+        # let's start it off with a fairly high max
+        self._gc_max = [0., 0., 0.3]
     def get_gc_stats(self):
         return tuple(self._last_gc_times)
     # Timers
@@ -138,15 +144,20 @@ class SelectReactor:
                 return 0.
             if self._check_gc:
                 gi = gc.get_count()
-                if gi[0] >= 700:
+                if gi[0] >= 700 and (eventtime + self._gc_max[0]) < self._next_timer:
                     # Reactor looks idle and gc is due - run it
                     gc_level = 0
-                    if gi[1] >= 10:
+                    if gi[1] >= 10 and (eventtime + self._gc_max[1]) < self._next_timer:
                         gc_level = 1
-                        if gi[2] >= 10:
+                        if gi[2] >= 10 and (eventtime + self._gc_max[2]) < self._next_timer:
                             gc_level = 2
                     self._last_gc_times[gc_level] = eventtime
-                    gc.collect(gc_level)
+                    start = time.time_ns() / 1_000_000_000
+                    logging.info("Running garbage collection at level %s. Counts %s", gc_level, gi)
+                    count = gc.collect(gc_level)
+                    end = time.time_ns() / 1_000_000_000
+                    self._gc_max[gc_level] = max(self._gc_max[gc_level], end-start)
+                    logging.info("Finished garbage collection of %s items in %.2fms", count, (end-start) * 1000)
                     return 0.
             return min(1., max(.001, self._next_timer - eventtime))
         self._next_timer = self.NEVER

@ammmze
Copy link
Author

ammmze commented Jan 24, 2025

Just tested the above code on a separate machine and it looks like while idle the most amount of time between the eventtime and _next_timer is 0.25s...so my default of 0.3s basically means gen 2 would never actually get collected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment