Sometime in the middle of 2013 I was working on a project that began having server cpu spiking issues. After several failed attempts to decipher what was going on I began pulling my hair out, its been falling out every since . In the process of looking for reasons that might have caused this issue… finally… I stumbled onto a great conference talk given at PyCon 2013. The title of the talk was ‘ Messaging at Scale at Instagram ‘. Yeah, let that sink in a bit.
… Instagram …
Right, half the time you see someone under 30 years old on their phone they are hitting that site. Ok, maybe not… but seriously they serve up to a million page-hits a day. So we probably should listen and learn from their experiences. I knew that they had a similar web stack to what we were using so my ears perked up and I got really excited. This talk was given by Rick Branson , and it was a good one. I watched this talk several times, making notes along the way. Each time I watched I picked up an additional tidbit of information, and did some digging on the topic. I wrote this entry so I would not lose the link to this presentation which is absolutely jam-packed with information. This talk was the direct result of me completely changing the architecture of one of the projects I work on and saving a bit of precious hair in the process.
This quote from the talk really resonated with me:
“The web request is a scary place, you want to get in and out as quick as you can” - Rick Branson
This resonated with me because yea … it is scary … and at the time the web-request was killing me, and my server . For the life of me, I could not get out of the request fast enough. This quote became my mantra as I struggled with my project with the scaling issue. I went after all of the typical software engineering low-hanging-fruit. Profiling specific areas of standalone code, outside of the web request. Time-it, make it faster, and time it again. Go after all of the slow dbms readers and writers and make them DRY and faster. I also used NewRelic to do the same, while in a production environment. This is a highly recommended tool, I consider it a way to verify your findings in production. When going after performance you can spend alot of time chasing the incorrect areas. After all of my efforts I was only able to eke out a bit more performance. Thus allowing a few additional clients to connect to the server, before sending the server into a spiral of cpu-spinning death.
The basic architecture of the previous system was multiple REST clients connected to an apache stack. Each client connection would cause many reads and writes of the dbms. The writes would cause additional database triggers to fire, thus sending me into a spiral of doom.
So after falling in love with the quote ‘the web request is a scary place you need to get in and out of there as quick as you can’ I wrestled with the question.
how quick is that?
How long can I stay in the web request? Without causing scaling out problems? Rick actually addresses it in his talk, he states you want to get out of that web request in 100 ms to 500 ms
So ok, great… but how do I do that?
Turns out I had several problems:
- apache/mod_wsgi configuration issues
- poor preforming code && dbms queries
- architecture issue
- running dbms/web server on the same machine
Before listening to Rick’s talk I was completely dialed-into the poor performing code part of the problem. Which I could still be working on and still having my cpu spiking issues under relatively low load situations.
I actually had a hunch that I had misconfigured mod_wsgi. I cover my apache/mod_wsgi configurations issues to another entry ‘ Stop miss-configuring Apache and mod_wsgi ‘. Which includes lots of great links to subject matter on the topic. Including a link to another solid-gold PyCon 2013 talk by Graham Dumpleton the author of mod_wsgi.
After dealing with the first 2 issues I had a much more stable server. However it still was not capable of scaling out.
Now for the Juggernaut problem: Architecture Issue
Basically what I learned from Rick’s talk was that I needed to build a Asynchronous Tasking/Queuing system. This system would give me the ability to have ‘long running tasks’/ ‘transcoding’ capabilities. I could now pull my time consuming logic out of the web request and execute it in a task. Thus giving me the ample performance increase in the web request that I was after. Get in the web-request, hand off time consuming logic to a worker and get out of the request letting the webstack service these request’s in a timely manner.
Building a Asynchronous Tasking/Queuing System
Celery basically steers you to RabbitMQ. Once you choose it, you are going to need to learn to configure it, I highly recommend this book on the subject.
Next: You will need to talk to this broker with python
The ‘Broker’, ‘Queues’ are part of the RabbitMQ setup. The ‘Workers’ are from celery and the new Cache is an instance of Memcached.
For the final issue: Dont run the dbms/web server on the same machine .
For reference my server spike issues are gone now and I sleep better at night. As an added benefit I can now scale horizontally quite nicely.
- Stop making apache suck
- Scaling Python Django application with apache and mod_wsgi
- Common Setups for your web application
- Django performance patterns
- does django scale
- AMQP ‘Advanced Messaging and Queuing Protocol’
As a follow up to this story I would like to point out a scaling story related to ‘ Black Friday the WalMart servers didn’t go over 1% CPU utilisation and the team did a deploy with 200,000,000 users online. ‘ I am currently kicking the tires of node.js and may write a bit more about this in the future.