Partition Pruning
Your WHERE clause isn’t helping—here’s why Snowflake scans everything anyway
I watched a query scan 500GB of data when it should have touched less than 5GB. Same WHERE clause. Same filter. The problem? Snowflake couldn’t prune partitions effectively.
This cost us hours of runtime and thousands in credits—until we understood what was actually happening.
The micro-partition problem:
Snowflake automatically divides tables into micro-partitions (typically 50-500MB compressed). Each micro-partition stores metadata about the min/max values it contains for every column.
When you run a query with a WHERE clause, Snowflake checks this metadata to skip partitions that couldn’t possibly contain your data. This is partition pruning—and when it works, it’s magic.
But here’s the catch: pruning only works if your data is naturally ordered in a way that aligns with your filters.
When I’ve seen this break:
Working with large subscription and order tables, we’d filter by order_date constantly. Sounds perfect for pruning, right?
Except our data wasn’t loaded chronologically. Orders came in from multiple sources, backfills happened, late-arriving data got appended. The result? Every micro-partition contained a mix of dates spanning months.
Query: WHERE order_date = ‘2024-01-15’
Snowflake’s response: “Well, that date MIGHT be in any of these 10,000 partitions, so I’ll scan them all.”
The clustering key solution:
We added a clustering key on order_date for our largest tables:
ALTER TABLE orders CLUSTER BY (order_date);
Snowflake reorganizes data so that rows with similar values are stored together. Now each micro-partition contains a narrow date range, and pruning actually works.
Same query. 5GB scanned instead of 500GB. 95% improvement.
How to check if you’re pruning effectively:
Run your query and check the query profile. Look for “Partitions scanned” vs “Partitions total”:
— Your actual query
SELECT *
FROM orders
WHERE order_date = ‘2024-01-15’;
— Then check the profile or run:
SELECT
query_id,
partitions_scanned,
partitions_total,
bytes_scanned,
(partitions_scanned::float / partitions_total) * 100 as scan_percentage
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY())
WHERE query_id = LAST_QUERY_ID();
What to look for:
Scanning >25% of partitions? Probably not pruning well
Scanning <10%? Good pruning
Scanning 100%? No pruning at all
Common culprits:
→ Filtering on columns with random distribution (UUIDs, hashed values)
→ Using functions in WHERE clauses: WHERE DATE(timestamp_col) = … prevents pruning
→ Data loaded out of order without clustering
→ OR conditions across multiple high-cardinality columns
My decision framework for clustering:
Cluster when:
Table is large (multi-TB)
You filter/join on specific columns repeatedly
Query profiles show poor pruning
The column has natural ordering (dates, sequential IDs)
Don’t cluster when:
Table is small (<100GB)
Query patterns vary wildly
Clustering would cost more than it saves (maintenance overhead)
High-cardinality columns with random distribution
The real cost of bad pruning:
It’s not just slower queries. You’re paying to scan data you’ll immediately discard. Every GB scanned consumes credits, even if filtered out.
For our daily reporting jobs on clustered tables, we saw 60-70% reductions in both runtime and credit consumption. The clustering maintenance cost? Negligible compared to the savings.
Quick win you can try today:
Check your largest, most-queried tables. Run your common WHERE clause patterns. Look at partition scan ratios.
If you’re scanning >50% of partitions on filtered queries, you’ve found your optimization opportunity.
What’s been your experience with partition pruning? Have you seen dramatic improvements from clustering?
#Snowflake #DataEngineering #PerformanceOptimization #CostOptimization