Decommissioning a free public API
Frederic Cambus November 20, 2015 [Miscellaneous]This is a sequel to my "Adventures in running a free public API" post which you should read for some background information about the reasons behind terminating the free public instance of Telize.
I stopped the service as planned, on November 15th, after a two weeks notice. I should probably mention beforehand that things were more complicated that they could have been due to poor initial planning on my part. When I launched the public instance, it was mean as a way to demonstrate the open source project, and to be honest I didn't really anticipate the amount of traffic it would receive. So basically, I didn't bother configuring a subdomain to host the endpoints, and hosted them on the main domain. If I had done so, I could simply have removed DNS records for the subdomain, basically sink-holing it and be done with it. As the HTML pages needed to stay accessible, my only option was thus to block the endpoints by returning an HTTP error code, and so configured Nginx to return a 403 Forbidden HTTP status code. This should have been it, right?
In an ideal world, maybe. In practice, it triggered a huge amount of retries on failed requests from poorly coded scripts, effectively creating a DDoS cannon causing CPU and I/O (both network and disk, think of logging) usage to skyrocket. Now on a busy day, Telize received more than 130M requests, coming from 10M unique IP addresses as one can see in this report generated by Logswan. We were now talking about almost 800,000 requests per minute, as shown on this other report generated from sampling one minute of traffic. At this point I had to switch to using Nginx's 444 No Response HTTP status code and just close the connection in order to save bandwidth. On the day following the API termination, I received more than 1TB of incoming requests, which represent a huge amount of traffic and of course, bandwidth costs money. I can perfectly handle this kind of traffic from a technical standpoint, but from a financial one, there is simply no way I can sustain it, and it doesn't make sense for me to keep paying for data transfer overcharges. So I had to move the site in emergency to a static hosting platform (GitHub Pages in this case) which then returned a 404 error for all three endpoints: 'ip', 'jsonip', and 'geoip'. The last one being by far the most used one, it probably triggered some alert mechanism and caused the appearance of a Varnish rule simply closing the connection on this endpoint, returning an empty response. The situation was now under control, and I was finally able to relax and get some much needed rest; this was Sunday evening, and I decided to call it a day and get some sleep. I'm extremely grateful to GitHub for saving the day, and in fact decided to subscribe to a paid plan as a way to show my appreciation. So, problem solved?
Not so fast! The next morning (on Monday), I was in for a surprise as my mailbox started to fill up with inquiries regarding Telize termination. Basically, some code was crashing at random locations because people relying on a free service for their businesses were not careful enough to implement correct error checking. One of those mails came from a set-top box manufacturer, stating that thousands of customers were unable to watch TV because their boxes crashed when Telize didn't return any data, and demanding that I return an empty JSON object for a two weeks period. I answered as fast as I could, explaining the situation and that I had no control over this as the endpoint had a Varnish rule closing the connection without sending traffic. Realistically, should I have chosen to fulfill the request, I would have had to go back to handling the traffic myself on my own servers. This didn't end there, as the person wouldn't take no for an answer, and came up with the brilliant idea to ask me to redirect the endpoint to a server they would host themselves in order not to just serve empty data this time, but simply restoring service entirely. I had to re-read the mail a couple of times to ensure my brain was not tricking me. On what ground would I allow a third party to serve content on one of my domains? How on earth did I dare not to act on this unreasonable request on a timely manner, warranting further annoyance both by mail and on Twitter, lasting for two days?
So basically, what's the moral here? Some people just expect you to invest your own time and money to solve their problem, and for you to do it straight away, when it's convenient for them, and of course without being compensated for it. No matter that they used a free service to begin with, without giving any notice beforehand, or that you have a daytime job and other involvements. I would not have it so.
Hopefully, this whole story will at least teach some people that relying on a free service you have no control over means adding a single point of failure on a volunteer basis. It's perfectly fine for a side project, but for a business? I would think at least twice before taking this kind of decision.