That timing issue of clips on twitch is interesting.
Here is an example clip with the described issue: https://www.twitch.tv/patty/clip/ThoughtfulMoldyGarlicWutFace-BeJzyXTm1kfwGKPK
You will notice that the video starts at different times (compare the on-screen timer in the bottom left of the stream) on Firefox and Chrome/Chromium.
Preface: Video codecs usually don't just save "full" frames for every frame, instead they save changes only and only every now and then have a "keyframe" or "I" frame that is a full frame, upon which consecutive frames base their changes on (these are called "B" and "P" frames).
There is nothing that dictates that an MPEG container must start with a keyframe from what I could tell.
But realistically, what is a decoder supposed to draw when it receives an intermediate (B/P) frame first? Green background with the changes on top? Looks shit. Most decoders will simply skip until the next keyframe.
Now, MPEG, among tons of other metadata, also stores timestamps for every single frame in its container.
These are used for the correct decoding and presentation order, as these may not necessarily be the same (B frames may require the - temporal - possibly next P frame to be loaded before it to become drawable).
They are also useful to synchronize different streams in a container, like audio and video.
Now comes the problem of creating clips. In this example, the video contains a keyframe every 2 seconds (or 60 frames). But the clip editor lets you be much more granular with your clip timing than that.
The most robust way to solve this would be to simply re-encode the video. Start blank from the drawn frame that the clip starts at and generate a new sequence of I/B/P frames following it.
This is likely what Twitch did before the changes. This has the big downside of high required processing power and subsequently also processing time.
With the new changes, Twitch did something technically rather clever.
They take a carbon copy of the streamed video at the closest, previous keyframe relative to the clips starting time and cut it at the clips end time.
As is, this would result in clips that "snap" to the closest previous keyframe.
To fix that, they set the timestamp for all those frames before the clips starting time to a negative time, effectively offsetting the whole clip into the past. They basically say "these frames should be (decoded &) presented in the past (from the point where you press play)".
This solves the keyframe issue (as it is present and has been decoded) and still keeps the sub-keyframe granular timing, all without ever having to re-encode the video file. The only cost is a very slightly bigger video file than technically required. But all those extra frame are all intermediate frames anyway.
But alas we have inconsistent decoders ruining the fun:
Most decoders (Chrome/Chromium, a bunch of video players I tested) handle this gracefully by simply not showing the negative frames, but using them to properly draw the first visible, not-keyframe frame. This is arguably the most correct interpretation/implementation.
Firefox respects the order of the timestamps, but says "I care not for your 0" and simply starts at the very first (key?) frame in the container, which is technically present, but is temporally placed before the clippers chosen starting time for the clip.
VLC renders the first frame it encounters, then realizes "wait this is in the past, skip ahead, quickly!" and produces artifacts while skipping (since the intermediate frames aren't decoded).
Vegas Pro 19 has a broken decoder implementation that has no clue what to do and somehow chooses to put the start of the audio stream synchronized with the absolute 0 timestamp (which is at ~1.5s relative to the containers start) of the video stream and cuts of the negative video frames.
Bonus Points: The reason that Firefox has about 0.5 seconds muted at the start of the clip is that the audio track simply does not start until then. Since it is outside the clipping range anyway, Twitch didn't bother to include that much audio upfront.
TLDR;
Creating copies of Videos at granular timings is computationally heavy and requires re-encoding.
Twitch does technical codec trickery to be able to copy-paste and cut at specific frames without re-encoding by placing parts of the start of the clip into the temporal past.
Chrome/Chromium interpret this arguably correctly and start at temporal 0 (which decodes but skips some required, but unwanted frames).
Firefox interprets this as-is and starts at frame 0 (which is in temporal past).
VLC is confused but gets there.
Vegas 19 is broken.