I would appreciate a tool that improves my productivity of data analysis using SQL.
Most of the time, a plain .sql
file is all I get to work with, so I came up with my own ways to track the analysis and record my thoughts.
Table of Contents
Context
I haven’t come across any solid SQL IDE’s (apart from JetBrains’ DataGrip, which is paid) and usually have to work with either a plain SQL spreadsheet, or a SQL notebook, e.g. Mode, or Count.
These are usually great for when you already the close-to-final SQL scripts written and need a platform to showcase your queries and build your dashboards.
However, as a machine learning engineer I find myself exploring data most of the time and moving onto feature engineering/modelling as opposed to building dashboards.
EDA Workflows
Below I outline my SQL workflow and how I organise my worksheets so that I can keep track of my work.
I used to do everything in a single spreadsheet but doing data exploration is like breadth first search - it quickly becomes bloated and way too many lines of code to track.
My SQL Worksheet Setup
I write my .sql
drafts in the Snowflake interface. Sometimes the auto-saved fails, and so after a few times I manually copy-paste the most important worksheets to a github repo.
When doing a large piece of analysis, I like to have the following sections:
- Analysis overview
- To provide context of the analysis
- Include links to relevant information
- And anything else that is important
- TODO
- Here I have TODOs that is then tagged in the SQL code beneath so it can be
ctrl-f
’d easily, e.g.[Refunds?]
- I found that its very difficult to write notes directly next to the code because it consumes space and makes the SQL hard to read.
- Here I have TODOs that is then tagged in the SQL code beneath so it can be
- Variables
- In Snowflake, we’re able to set variables - so I use that to my advantage.
- I like to keep them in one place. You can set the variables at the point where you need them as well. It’s a matter of preference.
- I like putting things at the top because I can see at a glance every single table that I’m using - and often I use like 5+ tables so its neater that way.
After these 3 sections I’ll then start writing all the SQL code for the individual problems that I want to tackle.
Depending on how rough your code is, I also include a Final Table
section that contains the “production” code that others will want to run to see the final results.
-- project-topic-01-summary.sql
/*
ANALYSIS OVERVIEW
-------------------
-------------------
Summarise the purpose and general approach to the problem here.
LINKS
-----
PULL REQUEST:
ORIGINAL QUERY:
JIRA CARD:
CONFLUENCE:
*/
/*
@TODOs
------
CTRL-F the first bracket to find the queries associated with the task.
[REF_ONE]
[ ] Investigate table A
[REF_TWO]
[ ] Investigate table B
[Refunds?]
[ ] Join refunds and customers complaints to store sales CTE
to be more accurate in calculations.
*/
/*
SET VARIABLES
-------------
*/
set dstn = '<database>.<schema>.<table_name>';
set num_days = 14;
set sales_table = 'fact_sales';
set start_time = '2022-01-01';
set end_time = '2022-01-08';
/*
CALCULATE NEW YEAR WEEK PROFITS
-------------------------------
Final table to calculate total profit during first week
of New Year's for each store.
*/
with store_agg as (
select
country,
region,
store,
sum(goods_cost) as total_cogs,
sum(labour_cost) as total_labour,
sum(sales_price) as total_sales,
from identifier($sales_table)
where event_time >= $start_time::timestamp()
and event_time >= $end_time::timestamp()
group by 1, 2, 3
)
-- [Refunds?]
select *
from store_agg
/*
FIND MOST POPULAR ITEMS
-----------------------
Final table to list top 50 most popular items during first week
of New Year's for each store.
*/
with store_agg as (
select
country,
region,
store,
item_id,
count(item_id) as num_items_sold,
from identifier($sales_table)
where event_time >= $start_time::timestamp()
and event_time >= $end_time::timestamp()
group by 1, 2, 3, 4
order by num_items_sold desc
/*
Snowflake allows us to use "qualify" to insert a window function for filtering.
Very convenient as we don't need to write an extra CTE.
Here we filter only the top 50 most popular items for every store.
*/
qualify
row_number() over (
partition by
order by num_items_sold desc)
) <= 50
)
NOTE: This is using Snowflake syntax!
I breakdown the name of the file by <project> --> <topic> --> <number> --> <name>.sql
If I have multiple worksheets, I’ll usually copy and paste the set variables that I need and add in any extras.
If there are TODOs I will add in the TODOs section.
-- project-topic-02-quick-all-tables.eda.sql
/*
SET VARIABLES
-------------
*/
set items_table = 'dim_items';
set users_table = 'dim_users';
set purchases_table = 'fact_purchases';
set refunds_table = 'fact_refunds';
set start_time = '2022-01-01';
set end_time = '2022-01-08';
/*
DIM TABLES
----------
*/
select *
from identifier($items_table)
where event_time >= $start_time::timestamp()
and event_time >= $end_time::timestamp()
limit 100
;
select *
from identifier($users_table)
where event_time >= $start_time::timestamp()
and event_time >= $end_time::timestamp()
limit 100
;
/*
DIM TABLES
----------
*/
select *
from identifier($purchases_table)
where event_time >= $start_time::timestamp()
and event_time >= $end_time::timestamp()
limit 100
;
select *
from identifier($refunds_table)
where event_time >= $start_time::timestamp()
and event_time >= $end_time::timestamp()
limit 100
;
...
This is all manual, and so it gets tedious to make sure that all the SQL sheets are synced and working.
However, if you do it correctly, the final output of the analysis will have a summary worksheet at the end with context and proper TODO and variables set, along with a set of tables that the user can run to see the output.
It’s worked for me and allowed others to take over my work whenever I needed to hand things off to the next person. 😁
One more thing - I always forget how I format things. So I have a reference sheet with all the formatting decisions I’ve made. Very rarely do I change this.
Here is a snippet of what I have.
/*
FORMATTING FAQ
--------------
*/
-- NOTE: multiline 1 multiline 1 multiline 1 multiline 1
/*
NOTE: multiline 1
multiline 2
multiline 3
*/
-- URL: https://example-url.com
/*
I like to have 2 lines for each "section"
*/
That’s it really! Short and sweet! 😇
Would love to hear any feedback + how you guys set up your .sql
files to be more productive!