SPODSKAK Architecture : A real-time analytics architecture that supports AI/ML and can scale.

3 min readMay 13, 2021

Recently, I was trying to design a real-time analytics stack as a part of work and as I started researching the topic, I found an interesting stack emerge from a number of different examples that I saw in Airbnb, Uber, Lyft, Task Human, Redbus and other places.

The core advantages of this stack are:

Open source with Apache or similar licenses
Real-Time analytics: The data is available in the warehouse pretty much instantaneously after the transaction is done.
Massively Scalable via Containerization. Can support hundreds of terabytes to petabytes.
Machine Learning Capable

I call this the ‘SpodSkak’ stack :-) and it consists of the following :

SpodSkak = Superset + Presto + Druid + Spark + Kafka + Airflow + Kurbernetes

The idea is to use Kafka for streaming and Spark Streaming for transformations before getting the data into Druid. I have also seen Flink used in these stream processing applications instead of Spark streaming.
Use Druid and have presto on top of it to do table joins over druid tables and use the whole setup as the Datawarehouse
Use Spark as a machine learning system and deploy ML models created from Spark via Flask
Use Apache Superset as the BI layer
Use Airflow for all the scheduling and Kubernetes to containerize and manage all these applications

I was partly inspired at coming up with this stack after looking at Airbnb’s architecture and seeing different versions of this stack across the web. I would love to hear your thoughts and comments about the architecture and if you are an engineer that is excited to work on this, then please reach out to me or Sandipto Banerjee at Saviynt.

Here are some of the data architectures that I found across the web.