Usage Analytics for Superset on Superset
While not oblivious to the irony of building dashboards on top of dashboards hear me out; this one is to make decisions on making smaller dashboards for performance gains.
I have previously explored some other performance boost strategies for Superset dashboard load times including warming up the cache proactively with every data edit, but there are only so many gains that can be made by optimizing server behavior. With the gains from the cache optimization, I am now looking at trimming dashboards as a way to optimize dashboard loading times.
Prerequisites / Assumptions
Before going on this article assumes some beginner PostgreSQL knowledge and some familiarity with Apache Superset.
Smaller dashboards ? how ? why?
10 dashboards later, and most of these have had an average of 40 charts and an average of 5 tabs. These are a lot of interactions and navigations for users and to be frank , takes a lot of time to set up and maintain. This is my selfish motivation. Imagine the time and energy savings I could reap if I could trim it down to a maximum of 2 tabs per dashboard and a cap of 10 super important charts ? Would this also make more people use these dashboards that I spend so much energy creating ? Well , I won’t know if I do not try.
To make an informed decision, I formulate 3 main questions:
- Which tabs are loaded more often than others?
- How often are charts exported and could this be relied on as a measure of data use?
- One could also choose to see who is looking at what and doing all these. The answers to these questions would be influencers of the decision on which charts to trim.
A bonus question that is a good to have is :
- Are there some consistently applied filters that could be used to further improve the cache warmup and therefore server side retrieval time?
Show me the data!
To answer these questions , I need to get my hands on some telemetry data.
For a self hosted instance of Superset, this information can be deduced from the metadata database that comes out of the box regardless of the kind of Superset deployment. This scaffolding is enabled by providing a value for the SQLALCHEMY_DATABASE_URI
feature flag in the superset_config.py
file.
A lion’s share of the telemetry can be found in the table namedlogs
in the public schema of the superset
database. This table has multiple columns but for this exploration, the columns of interest are the json
column as well as the user_id
.
To be found in the json
column are the properties event_id
/trigger_event
¹, source
, path
and the impression_id
columns which we will be extracting.
The supporting tables include the dashboards
table also in the public schema . This relation has multiple columns but for this exploration , the columns of interest are the dashboard_title
and the id
column used for joining the table to get dashboard properties. Similar to the log table, the meat of this table is in the json column named position_json.
This column is a json with a parent root
element and a nested structure for every element in the dashboard.
One other consideration before jumping into querying the data is the shape and size of these tables. The logs table particularly is both wide and long. As such, any queries run on the logs table need to have a limit to prevent long running queries. For the sake of this exploration a time limit mechanism fits the purpose.
Getting my answers:
To answer which tabs are loaded more often than others and how do the load times for a tab with 10 charts compare with that of a tab with say 15 ?
Q1: Are there chart types that are routinely taking longer load times across projects and dataset sizes?
For this I need a tab level dataset that has information such as the tab name , the time when the tab is clicked
SELECT c.username,
b.dttm,
to_timestamp(dttm)::bigint/ 1000) AS log_time,
b.json::json ->> 'impression_id' AS impression_id,
b.json::json ->> 'duration' AS duration,
b.json::json ->> 'target_name' AS tab_text,
a.dashboard_title
FROM dashboards a
JOIN logs b
ON text(a.id) = b.json::json ->> 'source_id'
JOIN ab_user c
ON c.id = b.user_id
WHERE b.action = 'log'
AND json::json ->> 'source' != 'sqlLab'
AND json::json ->> 'event_name' IN('select_dashboard_tab', 'mount_dashboard')
AND dttm >= '{last_dttm}'
Q2: Are there some consistently applied filters that could be used to further improve the cache warmup and therefore server side retrieval time?
Filters is a critical selling point of why to invest in dashboards whatever the platform as they allow users to retrieve aggregations in a snap.
SELECT p.id,
p.json_slice_id,
q.value::json->>'column' AS filter_column,
p.dttm
FROM (
SELECT
a.id,
a.parsed_json->>'slice_id' as json_slice_id,
a.dttm,
(a.parsed_json->>'applied_filters')::json as applied_filters
FROM
(
SELECT *, json::json AS parsed_json FROM logs
WHERE action = 'log'
AND extract(year from dttm) >= 2023
AND dttm >= '{last_dttm}'
) a
WHERE a.parsed_json->>'event_name' = 'load_chart'
) AS p,
json_array_elements_text(p.applied_filters) AS q;
Q3: How often are charts exported and could this be relied on as a measure of data use?
SELECT
a.id,
a.action,
a.user_id,
a.dttm,
a.dashboard_id,
b.dashboard_title,
a.slice_id,
a.duration_ms,
a.referrer,
a.parsed_json->>'source' as source,
a.parsed_json->>'source_id' as source_id,
a.parsed_json->>'impression_id' as impression_id,
a.parsed_json->>'version' as version,
a.parsed_json->>'ts' as ts,
a.parsed_json->>'event_name' as event_name,
(a.parsed_json->>'slice_id')::integer as json_slice_id,
c.slice_name,
(a.parsed_json->>'is_cached')::boolean as is_cached,
(a.parsed_json->>'force_refresh')::boolean as force_refresh,
a.parsed_json->>'row_count' as row_count,
a.parsed_json->>'datasource' as datasource,
a.parsed_json->>'start_offset' as start_offset,
a.parsed_json->>'duration' as duration,
(a.parsed_json->>'has_extra_filters')::boolean as has_extra_filters,
a.parsed_json->>'viz_type' as viz_type,
a.parsed_json->>'data_age' as data_age,
a.parsed_json->>'event_type' as event_type,
a.parsed_json->>'trigger_event' as trigger_event,
'{dag_run_time}' as create_time
FROM
(
SELECT *, json::json AS parsed_json FROM logs WHERE action = 'log' AND extract(year from dttm) >= 2023
) a
LEFT JOIN dashboards b ON a.dashboard_id = b.id
LEFT JOIN slices c ON CAST(a.parsed_json->>'slice_id' AS INT) = c.id
WHERE a.parsed_json->>'event_name' = 'load_chart'
AND a.dttm >= '{last_dttm}'
Conclusion
As you can see I chose to snoop on who is looking at what just so I could stratify the user types and their respective data needs. The updating of this data is orchestrated using Apache Airflow and on top of these views, I built a dashboard which is once again underutilized and is honestly pretty low on aesthetic scale , thus no screen-grabs for now :)
I would like to hear of any other questions people are answering / looking to answer regarding their dashboards — be it on Superset or otherwise!
Footnote:
- The event_id and the trigger event are related such that the
event_id
property appears only if thetrigger_event
property does not appear in the json and thetrigger_event
will reference a former event_id. Whether a row will containevent_id
ortrigger_event
as a key in the JSON field is dependent on the type of event , a value contained in theevent_name
property of the same JSON column.