Overview
As organizations grow, their data stacks will grow with them. There will be more data available, more transformations to keep data organized and accessible, and more people accessing data. Naturally, the costs associated with querying and writing data to the warehouse will go up as well.
Unfortunately this means that unnecessary costs, such as those associated with unpartitioned large tables or stray ORDER BY clauses in SQL statements, will also increase.
How can we identify unnecessary costs? What are the solutions? To answer these questions we will:
- Set up a model that includes log data about table creation and table access
- Identify costs and cost reduction solutions
- Set up a dashboard to monitor the impact of these solutions
This article will go through an example using GCP, but the general principles can be applied to any data warehouse. In this example we are assuming you are using dbt to organize and automate your warehouse, and that you have a base/prep/staging layer (models/transformations where data is being prepared and is not yet ready for use) and a marts layer (the tables that should be accessed when querying data in the warehouse). For more information on warehouse structure, see the dbt website’s overview.
Set up
Setting up audit logs
Google Cloud gives you access to Cloud Logging – a service that allows you to export Google Cloud logging data and events. Most other cloud platforms, such as Snowflake and AWS, have access history available as well. There is a lot of information available through cloud logging, and we are only using a small portion of it to build a dashboard to monitor GCP costs. If you want more information about the data available, take a look at the docs.
First we will create a sink to route these logs to BigQuery, as detailed in this doc. If you use the filter specified in the doc, once your logs are exported, you should have a table in your warehouse under an audit_logs schema called cloudaudit_google_apis_com_data_access
that contains information about access history – information such as who has accessed or written data to which tables, when, and how many bytes were processed. Once this log sink is set up, we can pull the relevant data into our marts layer.
Setting up warehouse tables
In our prep layer we’re going to make a table called prep_gcp_query_jobs
:
{#
Insert IDs should be unique, but there are duplicates
in the source table. We are deduplicating based on the
recommendation here https://cloud.google.com/logging/docs/export/troubleshoot
Jobs with the same insert IDs have the same job IDs. job_id is being used
as the primary key for this table.
#}
with audit_logs as(
select protopayload_auditlog as audit_log,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent as jobs,
insertId
from {{ source('audit_logs', 'cloudaudit_googleapis_com_data_access')}}
)
select
jobs.job.jobName.jobId as job_id,
jobs.job.jobStatistics.createTime as job_created_at,
audit_log.authenticationInfo.principalEmail as user_id,
jobs.job.jobConfiguration.query.query as query,
jobs.job.jobConfiguration.query.destinationTable.projectId as destination_table_project_id,
jobs.job.jobConfiguration.query.destinationTable.datasetId as destination_table_dataset_id,
jobs.job.jobConfiguration.query.destinationTable.tableId as destination_table_id,
jobs.job.jobStatistics.startTime as job_started_at,
{#
Job start time can be behind the timestamp the job was created at by a few seconds
https://cloud.google.com/logging/docs/reference/audit/bigquery/rest/Shared.Types/AuditData
#}
jobs.job.jobStatistics.endTime as job_ended_at,
jobs.job.jobStatistics.totalBilledBytes as total_billed_bytes,
jobs.job.jobStatistics.totalBilledBytes/power(2, 40) as total_billed_terabytes,
from audit_logs
where audit_log.serviceName = 'bigquery.googleapis.com'
and audit_log.methodName = 'jobservice.jobcompleted'
and jobs.eventName = 'query_job_completed'
qualify row_number() over (partition by insertId) = 1
{#
A sample cost query with a where clause like
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.eventName = 'query_job_completed'
is included here https://cloud.google.com/bigquery/docs/reference/auditlogs/migration
#}
Now that we have our prep layer table, we can make a fct_gcp_query_jobs
table in our marts layer. The only fields we’re going to add to our selected fields are an estimated cost field (based on the $5 per terabyte estimate), the target name (develop, personal, etc), and the node name (model or test run), the last two of which we will pull out of the query field if the query was run by dbt.
select
job_id,
job_created_at,
user_id,
query,
destination_table_project_id,
destination_table_dataset_id,
destination_table_id,
job_started_at,
job_ended_at,
total_billed_bytes,
total_billed_terabytes,
5.0 * (total_billed_bytes)/power(2, 40) as estimated_cost,
{#
Cost estimation based on BigQuery's cost structure:
https://cloud.google.com/bigquery/pricing
#}
regexp_extract(query, r'target_name":\s"([^",")]*)') as dbt_target_name,
regexp_extract(query, r'node_id":\s"([^",")]*)') as dbt_node_name
{#
dbt type queries have an identifiable structure which we are using here to
specify the target name (personal, develop. etc) and the node name
(generally the table created or the test run) for models run by dbt
#}
from {{ ref('prep_gcp_query_jobs') }}
We can use this data to calculate and monitor the on-demand analysis portion of our total costs. This will not help us monitor storage costs, but you can estimate the storage costs from the size of the tables and length of your storage time based on the storage pricing.
Exploration
Exploring your access logs
Once you have your marts layer table, you can do some exploration to identify cost reduction opportunities.
Here are some ideas for query expenses to explore:
- dbt vs non-dbt runs
- Tables
- Environment (develop, staging, prod)
- Individual users/service accounts
- BI tools and dashboards
- Source destination tables
- Users querying source data vs prep data vs marts data
Collaborate with stakeholders to ensure you know what to monitor and understand the needs and use cases. After a first pass at figuring out where changes can be implemented, share this information with stakeholders to help identify where else data is being used and what questions they have about costs that you could look into.
Question: What is the breakdown of the cost for the BigQuery service account running dbt by target for July, 2022?
Query:
with dbt_target as (
select estimated_cost, query, dbt_target_name
from fct_gcp_query_jobs
where (job_started_at between '2022-07-01' and '2022-07-31')
and user_id like '%your-dbt-service-account%'
and query like '%/* {"app": "dbt", "dbt_version"%')
select dbt_target_name, sum(estimated_cost)
from dbt_target
group by dbt_target_name
Making changes
It may be useful to think through how much time and effort any given change will take and weed out high effort/low savings changes; however, stakeholders may have specific changes they care about even if they are high effort/low savings. For example, there may be changes considered ‘best practices’ that should be implemented even if the cost savings are relatively low.
Once you’ve identified some problem areas, you can look at potential solutions. Identify with your stakeholders which solutions should be implemented in which order, make a monitoring dashboard, and monitor the effects of your changes as they go into the warehouse.
Common cost reduction opportunities
If dbt runs are fairly expensive:
- Only run your develop environment if changes were made to the code
- Utilize incremental runs
If there are specific tables that are consistently expensive:
- Require partitions and/or clusters on very large tables
- Find ways to remove ORDER BY or SELECT DISTINCT clauses
- Rework models so joins are done early on in the process, or on as little data as possible
- Make sure JSON fields are flattened or removed as early as possible in the warehouse
If there are significant costs associated with querying the source data, base layer, or prep layer:
- Make it easier for users to utilize marts layer models, and limit access to source data or prep layer models
- Remind users to use marts layer data for dashboards and queries wherever possible
- Meet with users querying source/base data regularly and see if there are opportunities to move their queries to the marts layer
- Partition source data destination tables where possible. For example, Fivetran destination tables can be partitioned by day, integer primary keys, or columns
If there are significant costs associated with dashboards:
- Help users to switch their dashboard queries to the marts layer
- Find and remove dashboards that are no longer in use
- Set up a monthly meeting/reminder to go through and remove unused dashboards
If there are common expensive queries that are regularly run by users:
- Remind data users of the kinds of queries that will be expensive, and give alternatives for commonly used large queries
- Discuss how often the queries need to be run
Monitoring
Making a dashboard
Once you’ve aligned on what changes to implement in what order, you can put together a report to monitor how the changes affect the cost over time.
The tables you decide to include in your dashboard will depend on what you choose to monitor, but it’s a good idea to include some general monitoring information on the first page, such as total cost over the last six months, and cost per environment/target/node/user for the last seven days.
If you want to crack down on dashboards querying source data or unused dashboards, you could view the ‘query’ field to determine the dashboard that is running the query (e.g., looker dashboard queries have select statements that are selecting values such as “clmn_”) and monitor how those expenses change over time. Regularly check on the dashboard to identify new opportunities for cost reduction.
Although warehouse costs will keep rising as your company and warehouse grows, using techniques like these to stay on top of your warehouse activity will help in keeping those costs to a minimum.
If you have any questions around exploring your access logs, making decisions about what to change, or monitoring your changes, Data CRT is happy to help.
What’s important to you? Did we miss anything?
Send us a message at hi(at)datacrt.com!