Visualising data helps in building a much deeper understanding of the data and fastens analytics around the data. There are several mature paid products available in the market. Recently, I explored an open-source product name Apache-Superset which I found a very upbeat product in this space. Some prominent features of Superset are:
- A rich set of data visualisations
- An easy-to-use interface for exploring and visualising data
- Create and share dashboards
After reading about Superset, I wanted to try it, and as Superset is a python
programming language based project, we can easily install it using pip
, but I decided to set it up as a container based on Docker. Apache-Superset GitHub Repo contains code for building and running Superset as a container. Since I want to run Superset in a completely distributed manner and less modification is possible in the code(my opinion), I decided to modify the code so that it could run in multiple different modes.
Below is a list of specific changes/enhancements done in the code
- Different version of Superset image can be built using the same code.
- Superset configuration can be easily edited and mounted into the container, no need of rebuilding the image.
- Asynchronous query execution through Celery based executor and managing it through Flower UI
While for exploring a project, development mode is an excellent choice, however, it would be great if initial exploration happens with all the features for instance, in-case of Superset, running queries in async mode, and storing the result in cache. You can explore Superset smoothly by the below commands.
- First pull a docker-superset image from docker-hub
docker pull abhioncbr/docker-superset:<tag>
- Get docker-compose.yml and superset-config.py from code-base and follow same directory structure.
- Lastly, start a Superset image as a container in a
local
orprod
mode usingdocker-compose
:
cd docker-files/ && SUPERSET_ENV=<local | prod> SUPERSET_VERSION=<tag> docker-compose up -d
As per my understanding, running a Superset in the production environment for serving thousands of end-users setup should be distributed in nature and can be easily scalable as per the requirements. The below image depicts such setup
Published docker-image of Superset can be leveraged to achieve the above depicted image
- Load-balancer in front for routing the request from clients to one server container.
- Multiple containers in
server
mode for serving the UI of the Superset. Starting aserver
container usingdocker run
can be done as
docker run -p 8088:8088 -v config:/home/superset/config/ abhioncbr/docker-superset:<tag> cluster server <db_url> <redis_url>
- Multiple containers in
worker
mode for executing the SQL queries in an async mode using Celery executor. Starting aworker
container usingdocker run
can be done as
docker run -p 5555:5555 -v config:/home/superset/config/ abhioncbr/docker-superset:<tag> cluster worker <db_url>
<redis_url>
- Centralised Redis container or Redis-cluster for serving as cache layer and Celery task queues for workers.
- Centralised Superset metadata database.
I found setting up a Superset as Docker container is quite easy and the same can be used for different environments. You can similarly explore Superset.