As activity accelerated from just a few thousand activities per day to hundreds of millions, Instagram needed a reliable, scalable messaging infrastructure to distribute work and messages. In this talk, I'll jump from a crash course in the abstract concepts of queueing into the implementation details & hard-earned know-how from experience building massive-scale Python-based systems.
It all started with Agner Krarup Erlang in the early 20th century, when telephone networks were extremely expensive to operate and notoriously unreliable because the people building these networks had very little idea how to plan their construction. Some exchanges would be wildly overcapacity, while others would have barely enough to service calls during the wee hours of the night. Erlang dove through the manholes below the streets of Copenhagen, taking measurements and developing his own brand of mathematics, solving the problems that plagued callers using science. Thus queueing theory was born.
Fast forward a hundred years, and queues are now a part of every day life, and not just on the road, at the grocery store, or at the Department of Bureaucracy. In computer systems, queueing allows us to smooth out capacity issues, handle failures gracefully, and provide higher levels of availability.
At Instagram, messaging is essential to our most basic functionality. Feed distribution, social sharing, spam and malicious user detection, search indexing, and more depend on our core messaging infrastructure to distribute work out to hundreds of workers.
The very first incarnation of messaging was based on Gearman for asynchronous execution of jobs, and was used to perform fan-out distribution of new photos across the user graph. Eventually different kinds of tasks were added to Gearman, but it lacked a solid framework, and had operational issues around performance & availability. As traffic grew from a few jobs per minute to thousands of tasks every second, and engineers needed to be able to quickly deploy new types of asynchronous tasks, a new solution was needed.
Thus, enter Celery and RabbitMQ. Celery provides a comprehensive Python-based framework for building an infrastructure of reliable asynchronous task workers. The Ying to it's Yang, RabbitMQ is an AMQP message broker that provides a reliable, highly available fabric for ensuring delivery of messages to workers.
So we'll just port everything to Celery, run a bunch of worker processes, and throw up a RabbitMQ cluster, right? Not so easy, unfortunately. Performance issues were had, bugs were encountered, and a decent understanding of the concepts around queueing was learned.
Of course, the devil is in the details. How many worker processes should be ran? Should the process, threading, or evented concurrency model be used? How much concurrency can be sustained? Should disk-based persistence of messages be used or is in-memory ok? How can very long running tasks be kept from clogging up the pipes? How can the queues & messaging flowing through them be monitored? How do we scale the broker infrastructure for future growth?
Is this stack always the right solution though? It's great when you need a quick way to develop a very reliable asynchronous task, but the cost leaves much to be desired when delivery guarantees are more flexible. We've also developed alternative systems for delivering a much higher volume of messages to external systems with best-effort guarantees.
Further, how do we abstract these questions to make it easy for developers to make the right choices?
I. Concepts
II. Workers at Instagram
III. Alternatives