Downtime 7/25/2014

As we are a service provider, I want to try and be as transparent as possible when we have outages.

Our Slow Response
Earlier today we had sporadic outages from ~9:30 AM EST to 1:30 PM EST, to which our internal monitoring didn’t pick up on. Our internal monitoring monitors the physical servers and the processes running on them, not if the actual site is accessible. We were not notified of downtime as a result, because nothing was technically “down” according to the status checker.

We also have external monitoring that we’ve only recently begun to use (it was only sending emails, nothing more), and what it sees is a different story. It monitors whether or not the actual site is online, which fills a gap in our internal monitoring for certain cases (like this one). As of current it reports a total of just under 4 hours downtime today, ouch.

As a result of this downtime we’ve already changed notification preferences on the external monitoring until we can get some web-hooks implemented into our existing monitoring system.

What Went Wrong
We utilize(d) an open source project called Faye for connecting desktop applications to Streamtip for real-time alerts. Earlier this week one of our updates to Streamtip’s backend was bringing it to be multi-threaded by utilizing multiple processes load balanced using our web server front-end nginx. This was great for performance and drastically increases the amount of traffic we can handle… until we noticed 2 days ago that Faye was getting unstable and more resource intensive. After looking into how Faye was becoming unstable, it became clear that, due to how it was managing clients in the Redis-backed database it uses, it would not be a suitable long-term solution to how we send real-time alerts. Unfortunately due to the vast amount of people already using Faye, we had decided to just let it keep running and slowly kill it off as soon as possible.

That was a bad decision, I guess, as multiple times today the web services became swamped with Faye requests. There’s multiple reasons for this, as there’s problems in both the server-side and client-side codes Faye provides. On the server-side end, Faye was losing track of client IDs very frequently, causing it to disconnect the client. On the client-side, Faye reconnects you because of a bad client ID, giving you a new client ID. But, the client-side code continues to poll the server with the old client ID too. In essence, the clients were creating an unintentional DDoS attack.

Due to the server loads we’ve had to disable Faye. Earlier today at around 2AM EST a new Streamtip Alerter application was posted on the downloads page that utilizes a new real-time alerter backend (actually the same one that we used to use for Stream Donations, Socket.io [but now we’re on their drastically changed 1.0 version]).

We feel pretty bad about having to cut off Faye on short notice, but it was something that had to be done. We do not have updated documentation at this time for connecting via Socket.io. Due to Socket.io’s drastic changes in version 1.0, it really isn’t suitable for any application not built in JavaScript. We are evaluating the possibility of a raw websocket api to accompany Socket.io to make it easier for developers to integrate Streamtip into their applications for other code languages.

Is it still down? I went to change my donation song and tried the &preview=true, also tried sending a test donations and nothing will go through. I tried re-installing the CLR Browser, but can’t figure out how to uninstall it in the first place. It would be great to know that this isn’t something based on my doing. Please let me know!

Thanks,
Steph

The tip alerter was updated a few minutes ago to support the new backend. You will need to refresh the source in OBS or the page if you use Chroma key.

1 Like