Hi Christian,
I hope all is well!
I’m reaching out because we noticed a spike in Apollo's request rate on December 2nd. It was from approximately 14:17 to 14:23 UTC to /messages/inbox that went up by around 35% before returning to baseline. We are hoping you could help us understand what might have happened.
The source IPs were in AWS us-west-2: redacted, redacted, and redacted and had the Apollo UA server:apollo-backend:v1.0 (by /u/iamthatis) contact [email protected]
Thank you!
On it, will have an answer for you shortly.
My partner on the server side of things is at his day job at the moment and will check it out when he’s home. I’ll CC him in.
Thanks again!
Hi all,
I'm looking at the graphs and I am noticing a spike in error responses from Reddit on the 2nd that coincides with the UTC date time.
Were these Reddit trying to rate limit us? Or was there an outage?
Hi André,
This is what I received before: During that time, they would have received back a bunch of 5xx’s because the spike in traffic was more than we could handle that fast.
Does that help?
Probably not the answer you want to hear but I'm... not sure?
Looking at the graphs here. It all seems "normal" (other than the spike in error responses.)
What I can tell you is that we have mechanisms in place in order to avoid hammering Reddit if we can avoid to. We have locks to ensure we don't deploy a thundering herd if response times ever grow on your end, and we only fire off jobs in a pre-determined period of time too.
This one is a real head scratcher, but I'd love to get to the bottom of it. Would it help if I hopped on a call with one of your SREs to dive a bit deeper? I'd like to figure out what I did wrong so I can rectify it.
Actually I see it now.
Let me do some digging on my end to figure out what happened. This is bizarre. Do you have a timeframe you need the definitive answer for this by?
Never mind, I got it now.
From what I gather, we had a huge influx of new accounts (via API calls) starting at that time. Effectively, every time an account gets UPSERT'ed, we check /message/inbox
to grab the last message ID the user has in order to pre-populate that on our end.
Typically this isn't a huge issue, because we don't see spikes like these, but I'm going to dig more into why this happened, and most importantly, ways to prevent it from happening again.
Super sorry about this. I owe you (and your engineers) a beer.
Hi André,
I appreciate you investigating it so quickly. I've passed along the information, and I'll let you know if our team has any questions or suggestions.
Thanks!
😭