The juggernaut of cloud computing has taken enterprise computing by storm. Public clouds have caught the imagination of developers, CIOs, CFOs alike and rightfully so. But public clouds ain’t necessarily cheap at scale; neither is running your own datacenters. I am very excited to share our story of grappling with these questions and launching Snapdeal Cirrus – our new hybrid cloud; as the future of our computing platform. It has been a amazingly exciting journey and we have learned tremendously from this experience. Looking back, we sometimes wonder what were we thinking to take up an ambitious project like this 🙂
For those who may not know about us, Snapdeal is India’s largest e-commerce marketplace. With millions of users and more than 300,000 sellers, Snapdeal is the shopping destination for Internet users across India, delivering to 6000+ cities and towns in India. We have been growing very aggressively and adding new capabilities to our e-commerce platform at an unprecedented speed.
Mid last year, we started pondering on the idea of building a private cloud, mainly because our public cloud bills had become quite significant due to our scale; enough that it could fund a small startup. On top of it, we wanted more performance from our infrastructure and saw data security & sovereignty to become a big factor in our payments business.
After further research it was evident to us that public clouds stop being cost effective after a certain scale, especially when the workloads are relatively constant. But building and operating a private datacenter has its own challenges. The key was to build a private cloud, with the efficiencies and model of a public cloud using entirely opensource and homegrown solutions. In other words, the private cloud had to be elastic, scalable and completely automated to build, manage and use.
Meet Snapdeal Cirrus
Cirrus is a true blue hybrid cloud. It is built ground up with a design that applications running on it needs to be abstracted the underlying infrastructure, so they can be dynamically assigned and reassigned to run in different parts of the cloud. The whole infrastructure is built and managed as code. It spans across 3 datacenter regions connected with redundant high speed secured leased lines. Each datacenter has a highly dense, energy-efficient rack architecture with a core density of over 3500 cores.
We have build over 16 PB of object and block storage, using opensource software on commodity off-the-shelf servers with large number of disks. There are four types of storage available to the applications; a multi PB magnetic and SSD storage cluster using Ceph, a direct-attached SSD storage for applications that need raw disks like Aerospike and a data lake for big data analytics and data warehouse. Each of the storage tier is horizontally scalable and offers different level of QoS, performance and redundancy.
For networking we used the Clos architecture with two-tier 100G Spine and leaf fabric. Keeping multi-tenancy, scalability and performance needs in mind we adopted VXLAN for scale with MP-BGP, ECMP and Anycast routing for traffic engineering. This allows the servers to move anywhere within the data center with impunity, utilizing the complete functionality of a true SDN network. The fabric is built with completely redundant TOR switches, segregating the data center into different POD’s which mark a failure domain where every component is placed keeping redundancy in mind. We have implemented QoS and rate-limiting on the virtual interface ports of the tenant instances using Open vSwitch.
The compute cloud is made up over 100,000 cores, managed by OpenStack Nova. We have modified the Openstack scheduler to become host, rack and pod aware to implement anti-affinity of VM placement for clustered applications. Our Openstack control plane is also made fully-redundant and is able to tolerate multiple degrees of failures.
We will share more on each of these and more in subsequent blogs.
Building a cloud in a hyper-growth company like Snapdeal is a goldmine for an Infrastructure team, because you have almost every major tech stack from big data analytics, machine learning, data warehouse, NoSQL/SQL servers, message queues etc in production at a large scale. Different technologies have different infrastructure requirements, and it was fundamental for us to find a common solution to avoid silos of infrastructure and still offer consistence performance, resiliency and scale. With a solid IaaS as foundation, we are building several platform as services e.g. Load-Balancing and service discovery using smartstack, Database-as-a-service (DBaaS) with MySQL, a Key-value database service, a message-queue service using kafka etc. We are also building a transparent DRaaS in the public cloud, to offer data protection and quick RTO/RPO in case of entire region failure.
We are continuously evolving and adding capabilities to our platform. So stay tuned for more!
VP Engineering – Cloud Platform and Services