SQL

Traders and quantitative analysts often view data warehouses as slow, static repositories—places where time-series tick data goes to be archived rather than analyzed in real-time. Historically, if you wanted to scan millions of rows of market ticks to find complex structural patterns like a “V-shape capitulation” or a futures curve shifting from contango to backwardation, you had two choices:
1. Pull massive datasets out of the warehouse into memory via Python/Pandas.
2. Write horrific, unmaintainable multi-way self-joins in SQL that choke your warehouse cluster.
Enter **MATCH_RECOGNIZE**. Introduced in the ISO SQL:2016 standard and natively optimized in modern cloud data platforms like Snowflake, this clause brings the declarative power of Regular Expressions directly to sequential rows.
By executing a deterministic finite automaton (DFA) state machine natively within the database engine, it scans streaming or historical tick data sequentially within localized partitions. No self-joins. No massive memory overhead. Just pure, execution-optimized pattern matching.
Let’s look at three production-ready patterns tailored specifically for stock and futures markets.
## Pattern 1: Spotting Intraday “V-Shape” Capitulations
For index futures (like the E-mini S&P 500) or high-beta equities, an aggressive intraday washout followed by an immediate volume-backed reversal is a classic institutional liquidity hunt.
To catch this, our SQL pattern needs to find a series of lower lows, identify the exact tick where the trend breaks, and confirm a multi-step aggressive recovery.
“`sql
SELECT
    ticker,
    match_start_time,
    capitulation_low_price,
    recovery_end_time,
    total_drop_pct,
    total_recovery_pct
FROM futures_ticks
MATCH_RECOGNIZE (
    PARTITION BY ticker
    ORDER BY tick_timestamp
    MEASURES
        FIRST(down.tick_timestamp) AS match_start_time,
        LAST(down.price) AS capitulation_low_price,
        LAST(up.tick_timestamp) AS recovery_end_time,
        ((FIRST(down.price) – LAST(down.price)) / FIRST(down.price)) * 100 AS total_drop_pct,
        ((LAST(up.price) – LAST(down.price)) / LAST(down.price)) * 100 AS total_recovery_pct
    ONE ROW PER MATCH
    AFTER MATCH SKIP TO LAST up
    PATTERN (base down+ up+)
    DEFINE
        down AS price < PREV(price),
        up   AS price > PREV(price)
);

“`
### Anatomy of the State Machine
* **PATTERN (base down+ up+)**: The engine seeks an initial row (base), requires at least one or more consecutive rows where the price trends lower (down+), and immediately transitions to requiring one or more rows where the price trends higher (up+).
* **AFTER MATCH SKIP TO LAST up**: Crucial for optimization. Once a V-shape is confirmed, the engine resets its pointer to the absolute end of the recovery leg, preventing repetitive, overlapping partial matches on the same dataset.
## Pattern 2: Tracking Futures Term Structure Shifts (Contango to Backwardation)
In commodities and futures trading, the shape of the forward curve tells the true story of supply and demand. Normally, markets are in *contango* (longer-dated contracts cost more than near-month contracts due to storage costs). When a sudden supply shock hits, the curve flips into *backwardation* (near-month contracts trade at a massive premium).
Tracking this structural pivot across daily bars requires a state machine that ignores minor daily noise and only triggers when a sustained trend shift occurs.
“`sql
SELECT
    commodity_group,
    shift_date,
    initial_spread,
    final_spread
FROM futures_curve_daily
MATCH_RECOGNIZE (
    PARTITION BY commodity_group
    ORDER BY trading_date
    MEASURES
        FIRST(contango.trading_date) AS shift_date,
        FIRST(contango.cl2_minus_cl1) AS initial_spread,
        LAST(backwardation.cl2_minus_cl1) AS final_spread
    ONE ROW PER MATCH
    AFTER MATCH SKIP PAST LAST ROW
    PATTERN (contango{3,} backwardation{2,})
    DEFINE
        contango AS cl2_minus_cl1 > 0,
        backwardation AS cl2_minus_cl1 <= 0
);

“`
### Anatomy of the State Machine
* **PATTERN (contango{3,} backwardation{2,})**: Here we leverage **greedy quantifiers**. The engine demands at least 3 consecutive trading days of verified contango to establish a stable regime baseline, immediately followed by at least 2 consecutive days of backwardation to validate that a genuine macro shift is underway.
## Pattern 3: Detecting Double Tops with Volatility Tolerances
The “Double Top” is a classic technical reversal pattern, but coding it programmatically is notoriously difficult because market peaks are rarely identical to the penny. We need to introduce mathematical tolerances.
Furthermore, instead of just returning a single summary row, we want to output *every single tick* involved in the pattern so we can feed it into a downstream visualization layer or BI dashboard.
“`sql
SELECT *
FROM stock_bars
MATCH_RECOGNIZE (
    PARTITION BY ticker
    ORDER BY bar_timestamp
    MEASURES
        FIRST(peak1.bar_timestamp) AS first_top_time,
        peak1.high_price AS first_top_price,
        valley.low_price AS valley_floor_price,
        peak2.high_price AS second_top_price,
        LAST(drop.bar_timestamp) AS confirmation_time
    ALL ROWS PER MATCH
    AFTER MATCH SKIP TO LAST drop
    PATTERN (start peak1 valley+ peak2 drop+)
    DEFINE
        peak1 AS high_price > PREV(high_price),
        valley AS low_price < PREV(low_price) AND low_price < peak1.high_price * 0.95,
        peak2 AS high_price > PREV(high_price)
                  AND high_price >= peak1.high_price * 0.98
                  AND high_price <= peak1.high_price * 1.02,
        drop   AS close_price < PREV(close_price) AND close_price < valley.low_price
);

“`
### Anatomy of the State Machine
* **Cross-Variable Referencing**: Notice how valley and peak2 evaluate conditions dynamically based on values captured during peak1 (e.g., peak1.high_price * 0.98). The engine natively holds the variable state across rows within the partition window.
* **ALL ROWS PER MATCH**: Instead of aggregating the match into a single row, this tells the engine to stream out every row that successfully qualified for the pattern, appending the corresponding pattern variables to each row for easy charting.
## Architectural Principles for Production Scaling
If you are running these patterns over massive datasets (hundreds of millions of rows of market data), keep these three optimization principles in mind:
1. **Align Partitions with Physical Clustering**: Ensure your PARTITION BY columns match the underlying physical sorting or micro-partition keys of your cloud data warehouse (e.g., partitioning by ticker or root_symbol). This allows the query coordinator to prune irrelevant data blocks entirely.
2. **Enforce Deterministic Ordering**: The ORDER BY clause inside MATCH_RECOGNIZE must be completely unambiguous. If multiple ticks share the exact same timestamp, append a unique sequence ID or sequence number (ORDER BY tick_timestamp, sequence_id) to ensure the state machine evaluates the rows consistently.
3. **Avoid Wide-Open Quantifiers**: Using lazy or unbounded wildcards (like * or + without constraints) over highly volatile time-series forces the engine to buffer deep histories in memory to check for potential matches. Where possible, use bounded quantifiers (like {1,5}) to limit memory footprint.
By utilizing MATCH_RECOGNIZE, data teams can keep complex financial analytics where they belong—close to the data, running at bare-metal speed, using standard, declarative SQL.

The Warehouse Sizing Paradox: Why I Sometimes Choose XL Over Small
“Always use the smallest warehouse possible to save money.”
I heard this advice constantly when I started with Snowflake. It sounds logical—smaller warehouses cost less per hour, so naturally they should be cheaper, right?
Except the math doesn’t always work that way.
Here’s what I’ve observed:
Snowflake charges by the second with a 60-second minimum. The cost difference between warehouse sizes is linear, but the performance difference often isn’t.
The actual formula is simple:
Total Cost = Credits per Second × Runtime in Seconds
A Small warehouse might be 1/4 the cost per second of an XL, but if it takes 5x longer to complete the same query, you’re paying more overall.
When I’ve seen this matter most:
Working with subscription and order data, certain query patterns consistently benefit from larger warehouses:
→ Customer lifetime value calculations across millions of subscribers
→ Daily cohort analysis with complex retention logic
→ Product affinity analysis joining order details with high SKU cardinality
→ Aggregating subscription events over multi-year periods
These workloads benefit dramatically from parallelization. An XL warehouse has 8x the compute resources of an XS, and for the right queries, it can complete them in less than 1/8th the time.
A simple experiment you can run:
— Test with Small warehouse
USE WAREHOUSE small_wh;
SELECT SYSTEM$START_QUERY_TIMER();

SELECT
    subscription_plan,
    customer_segment,
    COUNT(DISTINCT customer_id) as subscribers,
    SUM(order_value) as total_revenue,
    AVG(order_value) as avg_order_value,
    COUNT(DISTINCT order_id) as total_orders
FROM orders
WHERE order_date >= ‘2023-01-01’
GROUP BY subscription_plan, customer_segment
HAVING COUNT(DISTINCT customer_id) > 100;

— Note the execution time and credits used in query profile

— Now test with XL warehouse
USE WAREHOUSE xl_wh;
— Run the same query
Check the query profile for each:
Execution time
Credits consumed (Execution Time × Warehouse Size Credits/Second)
Total cost
The decision framework I use:
Size up when:
Query runtime > 2 minutes on current warehouse
Query profile shows high parallelization potential
You’re running the query repeatedly (daily pipelines)
Spillage to remote disk is occurring
Stay small when:
Queries are simple lookups or filters
Runtime is already under 30 seconds
Workload is highly sequential (limited parallelization)
It’s truly ad-hoc, one-time analysis
The nuance that surprised me:
It’s not just about individual query cost—it’s about total warehouse utilization. If your Small warehouse runs 10 queries in 100 minutes, but an XL runs them in 20 minutes, you’re paying for 80 fewer minutes of warehouse time. That matters when you’re paying for auto-suspend delays, concurrent users, or just opportunity cost.
My practical approach:
I start with Medium for most workloads. Then I profile:
Queries consistently taking 3+ minutes → test on Large or XL
Queries under 1 minute → consider downsizing to Small
Monitor credit consumption patterns weekly
The goal isn’t to find the “right” size—it’s to match warehouse size to workload characteristics.
Want to test this yourself?
Here’s a quick query to see your warehouse credit consumption:
SELECT
    warehouse_name,
    SUM(credits_used) as total_credits,
    COUNT(*) as query_count,
    AVG(execution_time)/1000 as avg_seconds,
    SUM(credits_used)/COUNT(*) as credits_per_query
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD(day, -7, CURRENT_TIMESTAMP())
    AND warehouse_name IS NOT NULL
GROUP BY warehouse_name
ORDER BY total_credits DESC;
This shows you which warehouses are consuming credits and whether you might benefit from right-sizing.
The counterintuitive truth:
The cheapest warehouse per hour isn’t always the cheapest warehouse per result. Sometimes spending more per second means spending less overall.
What’s been your experience with warehouse sizing? Have you found scenarios where bigger was actually cheaper?
#Snowflake #DataEngineering #CostOptimization #CloudDataWarehouse

Thomas [W] Marshall

Database and Artificial Intelligence

Mastering Financial Pattern Matching: Native Time-Series Analytics with SQL MATCH_RECOGNIZE

The Warehouse Sizing Paradox