Snowflake for Beginners Part 4: Building an Automated Pipeline

In Part 3, we manually uploaded a CSV and queried some JSON. That’s fine for a one-off project, but in a real business, data doesn’t arrive once—it flows in constantly. You don’t want to be the person clicking “Upload” at 8:00 AM every morning.
Today, we’re going to look at how Snowflake automates the “Transform” part of your data pipeline using two powerful features: Streams (the “eyes”) and Tasks (the “muscles”).
1. The Stream: Knowing What’s New
If you have a table with 10 million rows and 100 new rows arrive, you don’t want to re-process the whole 10 million. A Stream is an object that sits on top of a table and tracks exactly what has changed since the last time you looked.
Think of it like a bookmark. When you “read” the new data, the bookmark moves to the end.
— Create a stream to watch our raw USERS table
CREATE OR REPLACE STREAM USERS_STREAM ON TABLE MY_FIRST_DB.STAGING.USERS;

— To see what’s “new,” you just query the stream like a table
SELECT * FROM USERS_STREAM;

Note: If the stream is empty, the query returns nothing. As soon as you INSERT or UPDATE the USERS table, the stream “fills up” with those changes.
2. The Task: Putting it on Autopilot
A Task is simply a SQL statement that runs on a schedule. It uses your virtual warehouse to do the heavy lifting.
Let’s say we want to take new users from our STAGING table and move them into a clean ANALYTICS table every 5 minutes.
— First, create our “Clean” destination table
CREATE OR REPLACE TABLE MY_FIRST_DB.ANALYTICS.DIM_USERS (
    USER_ID NUMBER,
    FULL_NAME STRING,
    EMAIL_DOMAIN STRING,
    LOAD_TIME TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);

— Now, create the muscle (The Task)
CREATE OR REPLACE TASK REFRESH_USERS_TASK
    WAREHOUSE = COMPUTE_WH
    SCHEDULE = ‘5 MINUTE’
    WHEN SYSTEM$STREAM_HAS_DATA(‘MY_FIRST_DB.STAGING.USERS_STREAM’)
AS
    INSERT INTO MY_FIRST_DB.ANALYTICS.DIM_USERS (USER_ID, FULL_NAME, EMAIL_DOMAIN)
    SELECT
        USER_ID,
        FIRST_NAME || ‘ ‘ || LAST_NAME, — Combining names
        SPLIT_PART(EMAIL, ‘@’, 2)       — Extracting just the domain
    FROM MY_FIRST_DB.STAGING.USERS_STREAM;

3. The “Secret Sauce”: Conditional Execution
Notice the line WHEN SYSTEM$STREAM_HAS_DATA(…). This is one of Snowflake’s best cost-saving features.
If no new data has arrived in the last 5 minutes, the task won’t run. This means you don’t waste credits spinning up a warehouse just to realize there’s nothing to do. Snowflake checks the stream metadata for free.
4. Turning it On
All tasks are created in a “Started” state… just kidding! They are actually created Suspended by default to prevent accidental credit spend. You have to manually flip the switch:
— Start the automation
ALTER TASK REFRESH_USERS_TASK RESUME;

— If you need to stop it:
— ALTER TASK REFRESH_USERS_TASK SUSPEND;

What You Now Know
* Streams track “delta” changes (Inserts, Updates, Deletes) so you only process what’s new.
* Tasks allow you to schedule SQL logic directly inside Snowflake.
* Efficiency: By combining them, you create a “Change Data Capture” (CDC) pipeline that only costs money when there is actually work to be done.
What’s Next?
In Part 5, we’ll wrap up the series by looking at Time Travel. Yes, really. Snowflake allows you to query data as it existed up to 90 days in the past—perfect for when a “muscle-bound” task accidentally deletes something it shouldn’t have!

Thomas [W] Marshall

Database and Artificial Intelligence

Snowflake for Beginners Part 4: Building an Automated Pipeline

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply