Usage Analytics for Superset on Superset

5 min readDec 6, 2024

While not oblivious to the irony of building dashboards on top of dashboards hear me out; this one is to make decisions on making smaller dashboards for performance gains.

I have previously explored some other performance boost strategies for Superset dashboard load times including warming up the cache proactively with every data edit, but there are only so many gains that can be made by optimizing server behavior. With the gains from the cache optimization, I am now looking at trimming dashboards as a way to optimize dashboard loading times.

Prerequisites / Assumptions

Before going on this article assumes some beginner PostgreSQL knowledge and some familiarity with Apache Superset.

Smaller dashboards ? how ? why?

10 dashboards later, and most of these have had an average of 40 charts and an average of 5 tabs. These are a lot of interactions and navigations for users and to be frank , takes a lot of time to set up and maintain. This is my selfish motivation. Imagine the time and energy savings I could reap if I could trim it down to a maximum of 2 tabs per dashboard and a cap of 10 super important charts ? Would this also make more people use these dashboards that I spend so much energy creating ? Well , I won’t know if I do not try.

To make an informed decision, I formulate 3 main questions:

Which tabs are loaded more often than others?
How often are charts exported and could this be relied on as a measure of data use?
One could also choose to see who is looking at what and doing all these. The answers to these questions would be influencers of the decision on which charts to trim.

A bonus question that is a good to have is :

Are there some consistently applied filters that could be used to further improve the cache warmup and therefore server side retrieval time?

Show me the data!

To answer these questions , I need to get my hands on some telemetry data.

For a self hosted instance of Superset, this information can be deduced from the metadata database that comes out of the box regardless of the kind of Superset deployment. This scaffolding is enabled by providing a value for the SQLALCHEMY_DATABASE_URI feature flag in the superset_config.py file.

A lion’s share of the telemetry can be found in the table namedlogs in the public schema of the superset database. This table has multiple columns but for this exploration, the columns of interest are the json column as well as the user_id.

To be found in the json column are the properties event_id /trigger_event¹, source, path and the impression_id columns which we will be extracting.

The supporting tables include the dashboards table also in the public schema . This relation has multiple columns but for this exploration , the columns of interest are the dashboard_title and the id column used for joining the table to get dashboard properties. Similar to the log table, the meat of this table is in the json column named position_json. This column is a json with a parent root element and a nested structure for every element in the dashboard.

Sample position_json fiels in the dashboards table

One other consideration before jumping into querying the data is the shape and size of these tables. The logs table particularly is both wide and long. As such, any queries run on the logs table need to have a limit to prevent long running queries. For the sake of this exploration a time limit mechanism fits the purpose.

Getting my answers:

To answer which tabs are loaded more often than others and how do the load times for a tab with 10 charts compare with that of a tab with say 15 ?

Q1: Are there chart types that are routinely taking longer load times across projects and dataset sizes?

For this I need a tab level dataset that has information such as the tab name , the time when the tab is clicked

SELECT          c.username,
                b.dttm,
                to_timestamp(dttm)::bigint/ 1000) AS log_time,
                b.json::json ->> 'impression_id'                    AS impression_id,
                b.json::json ->> 'duration'                         AS duration,
                b.json::json ->> 'target_name'                      AS tab_text,
                a.dashboard_title
        FROM     dashboards a
        JOIN     logs b
        ON       text(a.id) = b.json::json ->> 'source_id'
        JOIN     ab_user c
        ON       c.id = b.user_id
        WHERE    b.action = 'log'
        AND      json::json ->> 'source' != 'sqlLab'
        AND      json::json ->> 'event_name' IN('select_dashboard_tab', 'mount_dashboard')
        AND      dttm >= '{last_dttm}'

Q2: Are there some consistently applied filters that could be used to further improve the cache warmup and therefore server side retrieval time?

Filters is a critical selling point of why to invest in dashboards whatever the platform as they allow users to retrieve aggregations in a snap.

SELECT p.id, 
            p.json_slice_id,
            q.value::json->>'column' AS filter_column,
            p.dttm
            FROM (
                SELECT 
                    a.id,
                    a.parsed_json->>'slice_id' as json_slice_id,
                    a.dttm,
                    (a.parsed_json->>'applied_filters')::json as applied_filters
                FROM 
                    (
                    SELECT *, json::json AS parsed_json  FROM logs
                    WHERE action = 'log' 
                    AND extract(year from dttm) >= 2023
                    AND dttm >= '{last_dttm}'
                    ) a 
                WHERE a.parsed_json->>'event_name' = 'load_chart'
                ) AS p,
                json_array_elements_text(p.applied_filters) AS q;

Q3: How often are charts exported and could this be relied on as a measure of data use?

SELECT 
                a.id,
                a.action,
                a.user_id,
                a.dttm,
                a.dashboard_id,
                b.dashboard_title,
                a.slice_id,
                a.duration_ms,
                a.referrer,
                a.parsed_json->>'source' as source,
                a.parsed_json->>'source_id' as source_id,
                a.parsed_json->>'impression_id' as impression_id,
                a.parsed_json->>'version' as version,
                a.parsed_json->>'ts' as ts,
                a.parsed_json->>'event_name' as event_name,
                (a.parsed_json->>'slice_id')::integer as json_slice_id,
                c.slice_name,
                (a.parsed_json->>'is_cached')::boolean as is_cached,
                (a.parsed_json->>'force_refresh')::boolean as force_refresh,
                a.parsed_json->>'row_count' as row_count,
                a.parsed_json->>'datasource' as datasource,
                a.parsed_json->>'start_offset' as start_offset,
                a.parsed_json->>'duration' as duration,
                (a.parsed_json->>'has_extra_filters')::boolean as has_extra_filters,
                a.parsed_json->>'viz_type' as viz_type,
                a.parsed_json->>'data_age' as data_age,
                a.parsed_json->>'event_type' as event_type,
                a.parsed_json->>'trigger_event' as trigger_event,
                '{dag_run_time}' as create_time 
            FROM 
                (
                SELECT *, json::json AS parsed_json FROM logs WHERE action = 'log' AND extract(year from dttm) >= 2023
                ) a 
                LEFT JOIN dashboards b ON a.dashboard_id = b.id
                LEFT JOIN slices c ON CAST(a.parsed_json->>'slice_id' AS INT) = c.id
                
            WHERE a.parsed_json->>'event_name' = 'load_chart'
            AND a.dttm >= '{last_dttm}'

Conclusion

As you can see I chose to snoop on who is looking at what just so I could stratify the user types and their respective data needs. The updating of this data is orchestrated using Apache Airflow and on top of these views, I built a dashboard which is once again underutilized and is honestly pretty low on aesthetic scale , thus no screen-grabs for now :)

I would like to hear of any other questions people are answering / looking to answer regarding their dashboards — be it on Superset or otherwise!

Footnote:

The event_id and the trigger event are related such that the event_id property appears only if the trigger_event property does not appear in the json and the trigger_event will reference a former event_id. Whether a row will contain event_id or trigger_event as a key in the JSON field is dependent on the type of event , a value contained in the event_name property of the same JSON column.