The web request is a scary place

Sometime in the middle of 2013 I was working on a project that began having  server cpu spiking issues. After several failed attempts to decipher what was going on I began pulling my hair out, its been falling out every since . In the process of looking for reasons that might have caused this issue… finally… I stumbled onto a great conference talk given at PyCon 2013. The title of the talk was ‘ Messaging at Scale at Instagram ‘.  Yeah, let that sink in a bit.

Instagram

Right, half the time you see someone under 30 years old on their phone they are hitting that site. Ok, maybe not… but seriously they serve up to a million page-hits a day. So we probably should listen and learn from their experiences. I knew that they had a similar web stack to what we were using so my ears perked up and I got really excited. This talk was given by Rick Branson , and it was a good one. I watched this talk several times, making notes along the way. Each time I watched I picked up an additional tidbit of information, and did some digging on the topic. I wrote this entry so I would not lose the link to this presentation which is absolutely jam-packed with information.  This talk was the direct result of me completely changing the architecture of one of the projects I work on and saving a bit of precious hair in the process.

This quote from the talk really resonated with me:

“The web request is a scary place, you want to get in and out as quick as you can” - Rick Branson

This resonated with me because yea it is scary … and at the time the web-request was killing me, and my server . For the life of me, I could not get out of the request fast enough. This quote became my mantra as I struggled with my project with the scaling issue. I went after all of the typical software engineering low-hanging-fruit. Profiling specific areas of standalone code, outside of the web request. Time-it, make it faster, and time it again. Go after all of the slow dbms readers and writers and make them DRY and faster. I also used NewRelic to do the same, while in a production environment.  This is a highly recommended tool, I consider it a way to verify your findings in production. When going after performance you can spend alot of time chasing the incorrect areas. After all of my efforts I was only able to eke out a bit more performance. Thus allowing a few additional clients to connect to the server, before sending the server into a spiral of cpu-spinning death.

The basic architecture of the previous system was multiple REST clients connected to an apache stack. Each client connection would cause many reads and writes of the dbms. The writes would cause additional database triggers to fire, thus sending me into a spiral of doom.

asynch_tasks_one

So after falling in love with the quote ‘the web request is a scary place you need to get in and out of there as quick as you can’ I wrestled with the question.

how quick is that?

How long can I stay in the web request? Without causing scaling out problems? Rick actually addresses it in his talk, he states you want to get out of that web request in 100 ms to 500 ms

So ok, great… but how do I do that?

Turns out I had several problems:

Before listening to Rick’s talk I was completely dialed-into the poor performing code part of the problem. Which I could still be working on and still having my cpu spiking issues under relatively low load situations.

I actually had a hunch that I had misconfigured mod_wsgi. I cover my apache/mod_wsgi configurations issues to another entry ‘ Stop miss-configuring Apache and mod_wsgi ‘. Which includes lots of great links to subject matter on the topic. Including a link to another solid-gold PyCon 2013 talk by Graham Dumpleton the author of mod_wsgi.

After dealing with the first 2 issues I had a much more stable server. However it still was not capable of scaling out.

Now for the Juggernaut problem: Architecture Issue

Basically what I learned from Rick’s talk was that I needed to build a Asynchronous Tasking/Queuing system. This system would give me the ability to have ‘long running tasks’/ ‘transcoding’ capabilities. I could now pull my time consuming logic out of the web request and execute it in a task. Thus giving me the ample performance increase in the web request that I was after. Get in the web-request, hand off time consuming logic to a worker and get out of the request letting the webstack service these request’s in a timely manner.

Building a Asynchronous Tasking/Queuing System

Enter Celery

Rick references this is one of the best python projects that he has seen.

Next: Choose a AMQP broker

Enter RabbitMQ

Celery basically steers you to RabbitMQ. Once you choose it, you are going to need to learn to configure it, I highly recommend this book on the subject.

Next: You will need to talk to this broker with python

Enter Pika

Pick Caching solution

Enter Memcached

Picking memcached was a bit of a shoo-in for us because it was already in the infrastructure. You should probably weigh the pros-n-cons of .

Now that the stack is rough’ed in.. its time to get down to the business of wiring all of this up, but before we do lets ponder a few more quotes related to building a task management system.

“There is no such thing as running tasks exactly-once.” - Rick Branson

Just to point out that Instagram did not invent these ideas or concepts. Lets lean on the pre-web history a bit.

“… it is impossible for one process to tell whether another has died ( stopped entirely ) or is just running very slowly.”

Impossibility of distributed consensus with one faulty process
Fisher, N. Lynch 1985

So after all of that, here is what the architecture looks like now.

asynch_tasks_two

The ‘Broker’, ‘Queues’ are part of the RabbitMQ setup. The ‘Workers’ are from celery and the new Cache is an instance of Memcached.

For the final issue: Dont run the dbms/web server on the same machine .

Just don’t do it! Under load conditions you will create I/O blocking conditions that will cause your cpu to churn. See section 2 ‘ Separate Database Server ‘ in Digital Ocean’s documentation.

For reference my server spike issues are gone now and I sleep better at night. As an added benefit I can now scale horizontally quite nicely.

References:

As a follow up to this story I would like to point out a scaling story related to ‘ Black Friday the WalMart servers didn’t go over 1% CPU utilisation and the team did a deploy with 200,000,000 users online. ‘ I am currently kicking the tires of node.js and may write a bit more about this in the future.