Snowflake for Beginners Part 4: Building an Automated Pipeline

Posted by thomaswmarshall

In Part 3, we manually uploaded a CSV and queried some JSON. That’s fine for a one-off project, but in a real business, data doesn’t arrive once—it flows in constantly. You don’t want to be the person clicking “Upload” at 8:00 AM every morning.
Today, we’re going to look at how Snowflake automates the “Transform” part of your data pipeline using two powerful features: Streams (the “eyes”) and Tasks (the “muscles”).
1. The Stream: Knowing What’s New
If you have a table with 10 million rows and 100 new rows arrive, you don’t want to re-process the whole 10 million. A Stream is an object that sits on top of a table and tracks exactly what has changed since the last time you looked.
Think of it like a bookmark. When you “read” the new data, the bookmark moves to the end.
— Create a stream to watch our raw USERS table
CREATE OR REPLACE STREAM USERS_STREAM ON TABLE MY_FIRST_DB.STAGING.USERS;

— To see what’s “new,” you just query the stream like a table
SELECT * FROM USERS_STREAM;

Note: If the stream is empty, the query returns nothing. As soon as you INSERT or UPDATE the USERS table, the stream “fills up” with those changes.
2. The Task: Putting it on Autopilot
A Task is simply a SQL statement that runs on a schedule. It uses your virtual warehouse to do the heavy lifting.
Let’s say we want to take new users from our STAGING table and move them into a clean ANALYTICS table every 5 minutes.
— First, create our “Clean” destination table
CREATE OR REPLACE TABLE MY_FIRST_DB.ANALYTICS.DIM_USERS (
    USER_ID NUMBER,
    FULL_NAME STRING,
    EMAIL_DOMAIN STRING,
    LOAD_TIME TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);

— Now, create the muscle (The Task)
CREATE OR REPLACE TASK REFRESH_USERS_TASK
    WAREHOUSE = COMPUTE_WH
    SCHEDULE = ‘5 MINUTE’
    WHEN SYSTEM$STREAM_HAS_DATA(‘MY_FIRST_DB.STAGING.USERS_STREAM’)
AS
    INSERT INTO MY_FIRST_DB.ANALYTICS.DIM_USERS (USER_ID, FULL_NAME, EMAIL_DOMAIN)
    SELECT
        USER_ID,
        FIRST_NAME || ‘ ‘ || LAST_NAME, — Combining names
        SPLIT_PART(EMAIL, ‘@’, 2)       — Extracting just the domain
    FROM MY_FIRST_DB.STAGING.USERS_STREAM;

3. The “Secret Sauce”: Conditional Execution
Notice the line WHEN SYSTEM$STREAM_HAS_DATA(…). This is one of Snowflake’s best cost-saving features.
If no new data has arrived in the last 5 minutes, the task won’t run. This means you don’t waste credits spinning up a warehouse just to realize there’s nothing to do. Snowflake checks the stream metadata for free.
4. Turning it On
All tasks are created in a “Started” state… just kidding! They are actually created Suspended by default to prevent accidental credit spend. You have to manually flip the switch:
— Start the automation
ALTER TASK REFRESH_USERS_TASK RESUME;

— If you need to stop it:
— ALTER TASK REFRESH_USERS_TASK SUSPEND;

What You Now Know
* Streams track “delta” changes (Inserts, Updates, Deletes) so you only process what’s new.
* Tasks allow you to schedule SQL logic directly inside Snowflake.
* Efficiency: By combining them, you create a “Change Data Capture” (CDC) pipeline that only costs money when there is actually work to be done.
What’s Next?
In Part 5, we’ll wrap up the series by looking at Time Travel. Yes, really. Snowflake allows you to query data as it existed up to 90 days in the past—perfect for when a “muscle-bound” task accidentally deletes something it shouldn’t have!

Snowflake for Beginners Part 3: Creating, Loading, and The Magic of Semi-Structured Data

Posted by thomaswmarshall

In Part 2, we explored Snowflake’s built-in sample data and learned how virtual warehouses power our queries. But in the real world, you aren’t just querying someone else’s data—you’re bringing in your own.
Today, we move from “read-only” to “read-write.” We will create our own storage hierarchy, upload a CSV file directly through the browser, and then tackle Snowflake’s “superpower”: the ability to query JSON data as easily as a standard table.
1. Building the Foundation
Before we can load anything, we need a container. Run these in a new worksheet to create your first custom database and schema. Note that we are using the SYSADMIN role here, which is the standard role for creating objects.
USE ROLE SYSADMIN;

CREATE DATABASE MY_FIRST_DB;
CREATE SCHEMA MY_FIRST_DB.STAGING;

— Verify it worked
SHOW SCHEMAS IN DATABASE MY_FIRST_DB;

2. Loading a CSV (The Easy Way)
For files under 250MB, you don’t need a complex pipeline; you can use the Snowsight UI.
First, let’s create a destination table for some imaginary “App Users”:
CREATE OR REPLACE TABLE MY_FIRST_DB.STAGING.USERS (
    USER_ID NUMBER,
    FIRST_NAME STRING,
    LAST_NAME STRING,
    EMAIL STRING,
    SIGNUP_DATE DATE
);

To load your data:
* Navigate to Data → Databases in the left sidebar.
* Drill down: MY_FIRST_DB → STAGING → Tables → USERS.
* In the top right, click Load Data.
* Drag and drop your CSV file.
* When prompted for a File Format, you can create a “Default CSV” format in the wizard—just ensure “Header lines to skip” is set to 1 if your CSV has column names.
3. The VARIANT Superpower: Querying JSON
Traditional databases often require you to “flatten” JSON into rows and columns before you can even see it. Snowflake handles this differently using the VARIANT data type, which stores JSON exactly as it is but allows you to query it with SQL speed.
Let’s create a table for raw JSON data:
CREATE OR REPLACE TABLE MY_FIRST_DB.STAGING.RAW_LOGS (
    EVENT_DATA VARIANT
);

If you have a JSON record like {“device”: “mobile”, “event”: “login”, “user_info”: {“id”: 101, “location”: “NY”}}, you can query it directly using dot notation:
SELECT
    EVENT_DATA:device::STRING AS device_type,
    EVENT_DATA:user_info.location::STRING AS city
FROM RAW_LOGS;

The : grabs the key, and the :: casts it to a specific data type.
4. Zero-Copy Cloning: The “Save Game” Feature
One of the most useful features for beginners is Cloning. Suppose you are about to run a massive update on your USERS table and want a safety net. You can create an instant “Clone” of the entire table:
CREATE TABLE USERS_BACKUP CLONE USERS;

This doesn’t copy the data (so it doesn’t cost extra storage initially); it just creates a new pointer to the existing data. It’s a perfect “Save Game” point for your data.
What You Now Know
* Object Creation: You know how to build your own Databases and Schemas.
* Web Loading: You can move data from your local machine into the cloud without external ETL tools.
* Semi-Structured Power: You’ve seen how the VARIANT type allows Snowflake to act like a NoSQL database when needed.
* Cloning: You understand how to protect your work with instant, metadata-only clones.
What’s Next?
Now that you have data in Snowflake, how do you keep it clean automatically? In Part 4, we will dive into Tasks and Streams—Snowflake’s built-in automation tools that allow you to transform data the second it arrives.

Snowflake for Beginners, Part 2: Your First Queries and Understanding Warehouses

Posted by thomaswmarshall

In Part 1, we signed up for a Snowflake account and took a tour of the Snowsight interface. In this post, we’ll write our first real SQL queries against Snowflake’s built-in sample data, and we’ll pull back the curtain on how virtual warehouses work — including why they matter for your credit balance.
The Three-Tier Data Model
Before we write a single line of SQL, it’s worth spending a moment on how Snowflake organizes data. Everything lives inside a three-level hierarchy:
Database → Schema → Table
A database is the top-level container. Think of it as a folder for an entire project or domain — you might have a database called MARKETING, another called FINANCE, and so on.
A schema is a namespace inside a database. It groups related objects — tables, views, functions — that belong together. A common pattern is to have a RAW schema for unprocessed incoming data and a ANALYTICS schema for the cleaned, transformed version. Schemas also serve an access control purpose: you can grant users permission to one schema without exposing another.
A table is what you’d expect: rows and columns. Nothing exotic here.
When you reference an object in SQL, you can use its fully-qualified name in the form DATABASE.SCHEMA.TABLE, like this:
SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS;
Or, if you’ve already told Snowflake which database and schema you’re working in, you can refer to the table by name alone. You tell Snowflake your context using USE statements:
USE DATABASE SNOWFLAKE_SAMPLE_DATA;
USE SCHEMA TPCH_SF1;

— Now this works without the full path:
SELECT * FROM ORDERS;
Both approaches work. Fully-qualified names are more explicit and less likely to surprise you; USE statements are more convenient for interactive exploration. For now, we’ll use fully-qualified names so it’s always obvious what we’re querying.
The Sample Data: What’s Already There
Every new Snowflake trial account comes pre-loaded with a shared database called SNOWFLAKE_SAMPLE_DATA. This database doesn’t consume any of your storage quota — it’s a live read-only connection to data hosted by Snowflake itself. You can run queries on it, but you can’t insert, update, or delete rows (which is fine for learning).
To explore what’s inside, navigate to Data → Databases in the left navigation bar. Click on SNOWFLAKE_SAMPLE_DATA to expand it, and you’ll see several schemas. The one we’ll use today is TPCH_SF1.
TPC-H is an industry-standard benchmark dataset that models a wholesale supplier’s business — orders, customers, products, suppliers, and shipping. It’s been used to benchmark databases for decades, which means it’s realistic enough to practice real analytical thinking on. The SF1 suffix stands for “scale factor 1,” meaning it contains a base-sized dataset with several million rows — large enough to be interesting, small enough to query without burning through credits.
The TPCH_SF1 schema contains eight tables:
Table
What it represents
ORDERS
Order headers — one row per order
LINEITEM
Individual line items within orders — the largest table at ~6 million rows
CUSTOMER
Customer records
SUPPLIER
Supplier records
PART
Product/part catalog
PARTSUPP
Which suppliers carry which parts, and at what price
NATION
Countries (25 rows)
REGION
Geographic regions (5 rows)
Opening a Worksheet
Navigate to Projects → Worksheets in the left nav and click the + button in the top-right corner to create a new worksheet. A blank SQL editor opens.
You’ll notice two dropdowns near the top of the editor: one for the active database and one for the active schema. These are the worksheet-level equivalents of USE DATABASE and USE SCHEMA. Setting them here means every query in this worksheet defaults to that context without needing fully-qualified names.
For now, leave them as-is and we’ll use fully-qualified names in our queries. This keeps things unambiguous.
There’s also a warehouse selector — a dropdown that shows which virtual warehouse will execute your queries. If your account came with a default warehouse (often named COMPUTE_WH), it will be pre-selected. If the dropdown is empty, we’ll address that in the warehouse section below.
Running Your First Query
Type the following into the worksheet editor:
SELECT COUNT(*) FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.LINEITEM;
To run it, look for the Run button — it appears as a blue play button (▶) in the top-right area of the editor. Click it.
Important note about running queries: The Run button (or its equivalent button in the toolbar) executes only the SQL statement your cursor is currently positioned within — not everything in the worksheet. If you have five statements in a worksheet and want to run all of them in sequence, look for a small dropdown arrow next to the Run button and choose Run All. This distinction trips up nearly everyone at least once, so it’s worth knowing before you spend time wondering why only one thing happened.
After a moment, the results panel will appear below the editor showing a single number: around 6,001,215. That’s how many line items exist in this table. Not bad for a first query.
A Few More Queries to Try
Let’s do something a bit more interesting. Paste each of these in and run them one at a time — position your cursor inside the statement you want to run, then click Run.
How many orders were placed by customers in each region?
SELECT
    r.R_NAME        AS region,
    COUNT(o.O_ORDERKEY) AS total_orders
FROM
    SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS      o
    JOIN SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER c ON o.O_CUSTKEY = c.C_CUSTKEY
    JOIN SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.NATION    n ON c.C_NATIONKEY = n.N_NATIONKEY
    JOIN SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.REGION    r ON n.N_REGIONKEY = r.R_REGIONKEY
GROUP BY
    r.R_NAME
ORDER BY
    total_orders DESC;
This query joins five tables together, walking from individual orders all the way up through the geographic hierarchy to regions. You’ll get five rows back — one per region.
What’s the total revenue per order status?
SELECT
    O_ORDERSTATUS                           AS order_status,
    SUM(O_TOTALPRICE)                       AS total_revenue,
    COUNT(*)                                AS order_count,
    ROUND(AVG(O_TOTALPRICE), 2)             AS avg_order_value
FROM
    SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
GROUP BY
    O_ORDERSTATUS
ORDER BY
    total_revenue DESC;
Notice how fast these queries return. That’s not magic — it’s the combination of Snowflake’s columnar storage format and the micro-partitioning architecture that automatically indexes data behind the scenes. You didn’t create a single index. Snowflake handled that.
Previewing a Table Without Writing SQL
One of Snowsight’s small conveniences: you don’t always need to write SQL to see what’s in a table. In the left navigation, go to Data → Databases, drill down to SNOWFLAKE_SAMPLE_DATA → TPCH_SF1, and click on any table name — say, CUSTOMER. A panel opens showing the table’s column definitions, their data types, and a Data Preview tab that shows you the first rows without needing a query at all.
This is useful when you’re exploring an unfamiliar schema and want to understand what a table contains before writing against it.
Understanding Virtual Warehouses
Now that you’ve run a few queries, let’s talk about what was actually doing the work — the virtual warehouse.
In Snowflake, a virtual warehouse is a cluster of compute resources (CPU, memory, and local SSD cache) that executes your SQL. It’s completely separate from storage. Storage holds your data permanently; the warehouse is a temporary, rentable engine that reads from that storage and processes your queries.
This separation is why Snowflake can do something traditional databases can’t: you can have multiple independent warehouses all querying the same data at the same time, with no interference between them. One warehouse running an ETL job doesn’t slow down another warehouse running your BI dashboards.
Sizes and Credits
Warehouses come in T-shirt sizes. Each step up doubles the compute resources — and doubles the credit consumption rate:
Size
Credits per hour
X-Small
1
Small
2
Medium
4
Large
8
X-Large
16
2X-Large
32
… and up
…
For the queries in this post — counting rows, joining the TPC-H tables — an X-Small warehouse is perfectly sufficient. You’d only need to scale up when you’re regularly querying billions of rows or running complex transformations on large datasets. Starting small and scaling up when queries feel slow is the right approach.
Credits themselves are Snowflake’s billing currency. On a trial account, you start with $400 worth. On a paid account, credits cost roughly $2–$4 each depending on your cloud provider and whether you’re on on-demand or pre-purchased pricing. An X-Small warehouse burning 1 credit per hour therefore costs about $2–$4 per hour of active use.
The 60-Second Minimum — and Why It Matters
Snowflake bills per second, but with a minimum charge of 60 seconds each time a warehouse resumes from a suspended state. This means if you run a query that takes 3 seconds and then the warehouse sits idle long enough to auto-suspend, you’ll be billed for a full 60 seconds of that size warehouse.
For an X-Small warehouse at 1 credit/hour, 60 seconds of use costs 1/60th of a credit — roughly 3 cents at on-demand pricing. Barely noticeable for learning. But this minimum applies at every size, so a 2X-Large warehouse that resumes, runs a 5-second query, and then suspends has still billed you for a full minute at 32 credits/hour. That’s about $1 per query. Worth keeping in mind as you scale up.
Auto-Suspend and Auto-Resume
Two warehouse settings deserve your attention early:
Auto-Suspend tells Snowflake to automatically shut down the warehouse after a period of inactivity. The default is 10 minutes — meaning if you run a query and then walk away, the warehouse will run idle for 10 minutes before Snowflake turns it off. For a learning account, that’s fine. For production, it’s often set to 1–5 minutes to avoid paying for idle compute.
Auto-Resume is the opposite: when a query hits a suspended warehouse, Snowflake automatically starts it back up without any manual intervention. There’s typically a 2–5 second delay while the warehouse provisions — you’ll see a spinner in the worksheet during this time. Auto-resume is enabled by default and you should leave it on.
You can inspect and change these settings by going to Compute → Warehouses in the nav bar, finding your warehouse, and clicking into it. The settings panel will show you the current auto-suspend timeout, let you change the size, and show a credit usage chart for the warehouse.
Checking Your Credit Usage
To see how many credits you’ve consumed so far, navigate to Admin → Cost Management (the exact label may vary depending on your account configuration, but it’s in the Admin section). You’ll find a breakdown of credit consumption by warehouse, date, and service type. On a fresh trial account after running the queries in this post, you should have used a fraction of a credit — less than a nickel.
A Note on the USE Commands
We used fully-qualified names throughout this post, but you’ll see USE DATABASE and USE SCHEMA constantly in Snowflake documentation and tutorials. For reference, here’s how they work:
— Set the active database for this session
USE DATABASE SNOWFLAKE_SAMPLE_DATA;

— Set the active schema within that database
USE SCHEMA TPCH_SF1;

— Set the active warehouse
USE WAREHOUSE COMPUTE_WH;

— Now this works without any path at all:
SELECT * FROM ORDERS LIMIT 10;
These context-setting commands are session-scoped — they apply only to your current connection. If you open a new worksheet tab, you’d need to run them again (or use the dropdowns in the worksheet header, which do the same thing). They don’t change anything permanent in your account.
What You Now Know
You’ve queried a multi-million-row dataset, joined five tables together, and watched Snowflake return results in seconds. You also understand the virtual warehouse — that it’s the engine doing the work, that it’s separately billed from storage, and that the auto-suspend and auto-resume settings govern both cost and availability.
A quick recap of the key mental models from this post:
Data lives in a Database → Schema → Table hierarchy. Reference objects with their full path or set context with USE statements.
The virtual warehouse is compute, separate from storage. You pay for it only while it’s running.
Warehouses bill per second with a 60-second minimum each time they resume.
Auto-suspend stops an idle warehouse. Auto-resume starts it back up when a query arrives. Leave both enabled.
When in doubt, start with an X-Small warehouse and resize only if queries are genuinely slow.
What’s Next
In Part 3, we’ll create our own database and tables, load some real data into Snowflake using the web interface’s file upload feature, and write queries against data we own. We’ll also look at Snowflake’s VARIANT column type — the gateway to querying JSON and other semi-structured data without needing to flatten it first.
This series is designed to be hands-on. If you ran into anything unexpected — a warehouse that wouldn’t start, an error message you didn’t recognize — drop it in the comments. Real errors make the best teaching moments.

Snowflake for Beginners, Part 1: Signing Up and Finding Your Way Around

Posted by thomaswmarshall

This is the first post in a series designed to take you from zero to confident with Snowflake — the cloud data platform that’s changed how companies store, query, and share data. No prior Snowflake experience required.

—

## What Is Snowflake, and Why Should You Care?

Before you click “Sign Up,” it’s worth spending sixty seconds on *why* Snowflake exists. Traditional databases make you choose between storage and compute — you buy a server, it handles both, and when you need more power you buy a bigger server. Snowflake separates those two concerns entirely. Storage lives in cloud object stores (Amazon S3, Azure Blob, or Google Cloud Storage depending on which cloud you choose). Compute — the engines that actually run your queries — is rented on demand and scales independently. This means you pay only for what you use, and you never have to think about hardware.

That architecture is why Snowflake has become the go-to platform for data teams at companies of every size. You’ll hear it described as a “data warehouse,” but it’s grown into something closer to a full data platform: you can query structured tables, semi-structured JSON, load files, build machine learning pipelines, and even publish datasets for other organizations to consume — all from the same interface.

That interface is called **Snowsight**, and it’s where you’ll spend most of your time. Let’s get you into it.

—

## Step 1: Creating a Trial Account

Snowflake offers a **30-day free trial** with $400 worth of free credits — more than enough to learn, experiment, and build something real.

Head to [snowflake.com](https://www.snowflake.com) and click **Start for Free**. You’ll be asked for:

– **Your name and email address** — used to create your Snowflake user profile. Use a real email; you’ll need to verify it.
– **Your company name** — required, but don’t overthink it. If you’re experimenting solo, your own name works fine.
– **A password** — Snowflake enforces a reasonably strong password policy. At least 8 characters, a mix of cases, numbers, and symbols.

### Choosing a Cloud Provider and Region

This is the first decision that trips up new users, and it feels more consequential than it is. You’ll be asked to select a **cloud provider** (AWS, Azure, or Google Cloud) and a **region** (e.g., US East, EU West, etc.).

For learning purposes, any of the three cloud providers will give you an identical Snowflake experience — Snowflake abstracts away the underlying cloud almost completely. The only practical consideration: if you already use AWS, Azure, or GCP for other workloads and eventually plan to connect Snowflake to data already living in one of those clouds, pick the same provider. Data transfer within a provider’s network is faster and cheaper than crossing providers.

For region, choose the one geographically closest to you. Latency differences are minimal for a trial, but it’s a good habit.

### Choosing an Edition

The trial defaults to **Enterprise Edition**, which is exactly what you want. It unlocks multi-cluster warehouses, time travel up to 90 days, and other features you’ll want to explore. You won’t be charged automatically when the trial ends — Snowflake asks you to explicitly enter payment information to continue.

After submitting the form, check your inbox for an activation email. Click the link, set your login, and you’re in.

—

## Step 2: Your First Look at Snowsight

When you land inside Snowflake for the first time, you’re looking at **Snowsight** — the web interface that replaced Snowflake’s original “Classic Console” in 2022. As of 2025, Snowsight is the only interface available to new accounts.

The layout is divided into two primary areas: a **left navigation bar** that runs the full height of the screen, and a **main content area** that changes depending on what you select. Let’s walk through each section of that left nav from top to bottom.

—

## The Navigation Bar, Explained

### 🏠 Home

The house icon at the top of the nav bar takes you to the **Home page** — your dashboard and jumping-off point. Here you’ll find:

– **Recently updated worksheets**, so you can quickly get back to whatever you were working on last.
– **Quick Actions**, a set of prominent buttons tailored to your current role. These shortcuts change depending on your permissions, but for a trial admin account you’ll typically see options to create a worksheet, upload a file, or explore sample data.
– A **Search bar** at the top that lets you search across database objects, worksheets, dashboards, and more.

The Home page is intentionally minimal — it’s a launchpad, not a dashboard crammed with metrics.

### 📊 Projects (Worksheets, Dashboards, Notebooks)

This section is where you’ll do the bulk of your actual work. It contains three sub-items worth understanding:

**Worksheets** are Snowflake’s code editor. Open one and you get a SQL editor on the right and an object explorer on the left — similar in feel to a lightweight version of SQL Server Management Studio or DBeaver. Worksheets support both SQL and Snowpark Python. One subtlety worth knowing early: by default, clicking the **Run** button (or hitting Cmd/Ctrl + Enter) executes only the statement your cursor is currently on, not the entire file. If you want to run everything, use the **Run All** option from the dropdown next to the button. This surprises nearly every newcomer.

**Dashboards** let you combine multiple query results into a visual report using charts — bar, line, scatter, and more. They’re built on top of worksheets, so you’ll create the SQL first, then pin the results to a dashboard tile.

**Notebooks** are a newer addition, offering an experience similar to Jupyter notebooks. You write code in cells, mix SQL and Python, and see results inline. They’re particularly useful for exploratory data analysis and machine learning workflows.

### 🗃️ Data (Databases, Marketplace, Private Sharing)

This is where you manage the *stuff* Snowflake stores and serves. Three main areas:

**Databases** opens the object explorer — a tree view of every database, schema, table, view, and function your account contains. You can click into a table to preview its columns and even sample its rows without writing a single line of SQL. For a fresh trial account, you’ll find some pre-loaded sample datasets here (look for a database called `SNOWFLAKE_SAMPLE_DATA`), which are great for practicing queries without needing to load your own data yet.

**Marketplace** is one of Snowflake’s most distinctive features. It’s a data exchange where thousands of third-party datasets are available — weather data, financial data, demographic data, and more. Many are free. When you access a Marketplace dataset, the data doesn’t copy into your account; instead, Snowflake gives you a live, read-only connection to the provider’s data. This is the platform’s “data sharing” architecture in action.

**Private Sharing** is the enterprise counterpart to the Marketplace — it’s how you share your own data with specific other Snowflake accounts, or receive shared data from partners, without any file transfers or ETL pipelines.

### ⚡ Activity (Query History, Copy History, Task History)

**Query History** is invaluable for debugging and auditing. Every query that runs against your account — from every user, every tool, every automated job — is logged here with execution time, bytes scanned, warehouse used, and the full SQL text. If a query is running slow, this is your first stop. For trial accounts, history retention is 14 days.

**Copy History** shows the status of bulk data loads performed via Snowflake’s `COPY INTO` command, which is the primary way to load files from cloud storage into tables.

**Task History** tracks the execution runs of **Tasks** — Snowflake’s scheduler for automating SQL or procedural logic on a cron schedule.

### 🖥️ Compute (Warehouses, Resource Monitors, Budgets)

**Warehouses** is a section you’ll visit early and often. In Snowflake, a “warehouse” isn’t a building — it’s a cluster of compute that executes your queries. Think of it as the engine you rent by the minute. Warehouses come in T-shirt sizes: XS, S, M, L, XL, and up to 6XL. An XS warehouse is sufficient for most learning exercises and costs about 1 credit per hour. Warehouses auto-suspend when idle (you set the timeout — 5 minutes is a sensible default) and auto-resume the moment a query hits them. You will almost never need to manually start or stop a warehouse.

**Resource Monitors** let you cap credit consumption on a warehouse — set a monthly limit and choose whether Snowflake should notify you, suspend the warehouse, or suspend and notify when the limit is hit. Essential for production accounts; good to understand even on a trial.

**Budgets** provide a higher-level spending view across your whole account, letting you set targets and track consumption by category.

### 👤 Admin (Users, Roles, Warehouses, Accounts)

The Admin section is where account administrators manage users, roles, and security policies. As the owner of a trial account, you’re automatically given the **ACCOUNTADMIN** role — the most powerful role in Snowflake. It can see and do everything.

This is a good moment to understand Snowflake’s **role-based access control** system, because it will come up constantly. Rather than granting permissions directly to users, Snowflake grants permissions to *roles*, and roles are then assigned to users. The built-in hierarchy looks like this:

– **ACCOUNTADMIN** — full account control. Should be used sparingly in production.
– **SYSADMIN** — can create and manage warehouses and databases.
– **USERADMIN** — can create users and roles.
– **PUBLIC** — automatically granted to every user; represents a baseline of no real permissions.

You can create custom roles — and you should, in any real deployment. For learning, ACCOUNTADMIN is fine.

—

## Step 3: The Bottom of the Navigation Bar

At the very bottom of the left nav you’ll find your **username**. Clicking it opens a small menu with a few important options:

**Settings** takes you to your personal profile, where you can change your password, set a default warehouse and role (so Snowsight always loads in a sane state), configure your display theme (light or dark), and enroll in **Multi-Factor Authentication (MFA)**. Snowflake strongly encourages MFA, and you’ll see a prompt every three days until you enable it. Take five minutes and set it up — it’s standard practice and Snowflake makes it easy.

**Role Switcher** lets you switch between any role your user has been granted. The currently active role is displayed next to your username. Roles are central to how Snowflake determines what you can see and do, so being aware of which role you’re operating under is a habit worth building from day one.

—

## What You Now Know

You’ve created a Snowflake account, made your first infrastructure decision (cloud provider and region), and taken a tour of every major area of the Snowsight interface. Here’s the one-line summary of each section for quick reference:

| Section | What it’s for |
|—|—|
| Home | Your launchpad — recent work and quick actions |
| Worksheets | Write and run SQL or Python |
| Dashboards | Visualize query results as charts |
| Notebooks | Interactive, cell-based development |
| Data / Databases | Browse and manage your data objects |
| Marketplace | Access third-party datasets |
| Query History | Debug, audit, and review past queries |
| Warehouses | Manage your compute resources |
| Admin | Users, roles, and account settings |

—

## What’s Next

In **Part 2**, we’ll write our first real SQL queries against Snowflake’s sample data, understand how warehouses start and stop (and why it matters for your bill), and learn the difference between a database, a schema, and a table in Snowflake’s three-tier data model.

For now, poke around. Click into the sample database, open a worksheet, and see what’s there. The worst thing that can happen on a trial account is spending a few credits — and you have $400 worth of them.

—

*Questions or stuck on something? Drop a comment below — this series is designed to be practical, and your questions shape what we cover next.*

AWS Hardening

Posted by thomaswmarshall

Cloud Maturity Series — Runbook for Posts 4–6

This runbook operationalizes Posts 4–6:
Preventive guardrails, multi-account scaling, and velocity-safe delivery.

Post 4 — Enforce Preventive Guardrails (SCPs)

1. Enable AWS Organizations.
2. Create Organizational Units (OUs): Management, Shared, Dev, Prod.
3. Create baseline SCP denying:
   – CloudTrail disable/delete
   – AWS Config disable
   – GuardDuty disable
4. Attach baseline SCP to all OUs.
5. Add stricter SCPs to Prod OU only.

Post 5 — Multi-Account Design

1. Create separate AWS accounts for Shared Services, Dev, and Prod.
2. Move accounts into appropriate OUs.
3. Centralize logging and security tooling.
4. Configure cross-account CI/CD roles.
5. Restrict Prod access to pipelines only.

Post 6 — Velocity-Safe Delivery

1. Standardize permission sets via IAM Identity Center.
2. Deploy infrastructure only through IaC.
3. Automate new-account baselines.
4. Monitor drift with AWS Config.
5. Review guardrails quarterly, not per-deploy.

AWS Hardening

Posted by thomaswmarshall

Cloud Maturity Series — Posts 1–3 (LinkedIn Drafts)

Post 1: Start With the Account, Not the App

Most cloud problems don’t start in code.

They start in the AWS account itself.

Before VPCs.

Before CI/CD.

Before a single workload is deployed…

I lock down the account foundation:

• MFA on the root account

• Root access keys removed

• Break-glass strategy defined

• IAM users eliminated in favor of roles

• Clear ownership and access boundaries

Cloud maturity starts at the control plane, not the application layer.

#AWS #CloudArchitecture #IAM

#CloudSecurity #PlatformEngineering

#Infrastructure #OpenToWork

Post 2: Identity Is the Real Perimeter

Firewalls don’t protect cloud environments.

Identity does.

Once the account is secured, the next layer is IAM done right:

• Role-based access instead of long-lived users

• Least-privilege policies mapped to job function

• Explicit separation between human and workload identities

• MFA everywhere it makes sense

In AWS, IAM is the blast radius.

If identity is loose,

no amount of network segmentation will save you.

Get this layer right and everything downstream becomes safer:

deployments, automation, audits, and incident response.

Cloud security isn’t a tool problem.

It’s an identity design problem.

#AWSIAM #ZeroTrust #CloudSecurity

#DevSecOps #PlatformEngineering

#InfrastructureAsCode #OpenToWork

Post 3: Auditability Before Availability

Most cloud failures aren’t outages.

They’re untraceable changes.

Before scaling workloads or optimizing cost,

I establish control-plane observability.

That means always being able to answer:

• Who changed this?

• What changed?

• When did it happen?

• Can we prove it?

This layer is built with:

• CloudTrail — immutable API history

• AWS Config — resource state and drift detection

• GuardDuty — security signal, not just logs

Compliance benefits (SOC 2, ISO, PCI) are a side effect.

The real value is engineering control.

If you can’t reconstruct an incident,

you don’t truly control your environment.

Observability isn’t something you add later.

It’s something you start with.

#AWS #CloudEngineering #AuditLogging

#IncidentResponse #DevSecOps

#InfrastructureAsCode #OpenToWork

AWS Day 2

Posted by thomaswmarshall

Cloud Maturity Series – Post 2: Identity Is the Real Perimeter

Identity Is the Real Perimeter

When people talk about cloud security, they often focus on firewalls, VPCs, or encryption. But in reality, the biggest security boundary is identity.

Before any workloads exist, the way humans and services access AWS determines how safe, auditable, and scalable your environment will be.

Key Actions Taken

Role-Based Access Control (RBAC)
– All administrative access flows through IAM roles, not IAM users.
– Eliminated standing privileges to prevent credential misuse.
– Temporary credentials ensure that even if a role is compromised, exposure is time-limited.

Centralized Authentication with AWS IAM Identity Center (SSO)
– Users authenticate through a single source of truth.
– Permission sets replace ad-hoc IAM policies, reducing configuration drift.
– MFA enforced for every human identity.

Benefits
– Human access is predictable, auditable, and easily revoked.
– Simplifies future multi-account AWS Organizations setups.
– Reduces risk of accidental privilege escalation and lateral movement.

Account Application

In practice, this was applied across logical account layers:
– Management/Shared Services Account: SSO, permission sets, and role templates centrally managed.
– Dev Account: Developers use temporary roles scoped to non-production resources.
– Prod Account: Minimal direct human access; all actions require role assumption via SSO.

This pattern ensures least-privilege access everywhere, even before any workloads are deployed.

Keywords
– AWS IAM, RBAC, IAM Identity Center, SSO
– Least-privilege, temporary credentials, enterprise access control
– Multi-account readiness, production security baseline

Bottom line:
Cloud security isn’t about locking down VPCs first — it’s about locking down who can get in and what they can do. Identity is the real perimeter.

AWS Day 1

Posted by thomaswmarshall

Cloud work doesn’t start with services. It starts with the account.
Before deploying any workloads, I focused on establishing a secure, enterprise-ready AWS foundation aligned with real-world production standards.
Key areas addressed:
 AWS Account Security & Governance
Root account lockdown (MFA, no access keys, billing controls)
Cost monitoring and budget alerts from day one
 IAM & Access Management (Best Practices)
Role-based access control (RBAC) using IAM roles
Elimination of standing admin privileges
MFA enforced for all human access
 AWS IAM Identity Center (SSO)
Centralized identity and authentication
Permission sets instead of ad-hoc IAM policies
Temporary credentials aligned with AWS Organizations and SCP-ready patterns
茶 Auditability, Compliance & Threat Detection
CloudTrail enabled across all regions for API auditing
AWS Config for configuration and change tracking
GuardDuty for continuous threat detection
Outcome:
An AWS account that is secure, observable, auditable, and scalable before any workloads are introduced — the same baseline expected in regulated and production environments.
This kind of foundation reduces security risk, accelerates future delivery, and prevents painful rework later.
Cloud maturity isn’t about spinning up resources fast.
It’s about governance, security, and intent from day one.

Partition Pruning

Posted by thomaswmarshall

Partition Pruning
Your WHERE clause isn’t helping—here’s why Snowflake scans everything anyway
I watched a query scan 500GB of data when it should have touched less than 5GB. Same WHERE clause. Same filter. The problem? Snowflake couldn’t prune partitions effectively.
This cost us hours of runtime and thousands in credits—until we understood what was actually happening.
The micro-partition problem:
Snowflake automatically divides tables into micro-partitions (typically 50-500MB compressed). Each micro-partition stores metadata about the min/max values it contains for every column.
When you run a query with a WHERE clause, Snowflake checks this metadata to skip partitions that couldn’t possibly contain your data. This is partition pruning—and when it works, it’s magic.
But here’s the catch: pruning only works if your data is naturally ordered in a way that aligns with your filters.
When I’ve seen this break:
Working with large subscription and order tables, we’d filter by order_date constantly. Sounds perfect for pruning, right?
Except our data wasn’t loaded chronologically. Orders came in from multiple sources, backfills happened, late-arriving data got appended. The result? Every micro-partition contained a mix of dates spanning months.
Query: WHERE order_date = ‘2024-01-15’
Snowflake’s response: “Well, that date MIGHT be in any of these 10,000 partitions, so I’ll scan them all.”
The clustering key solution:
We added a clustering key on order_date for our largest tables:
ALTER TABLE orders CLUSTER BY (order_date);
Snowflake reorganizes data so that rows with similar values are stored together. Now each micro-partition contains a narrow date range, and pruning actually works.
Same query. 5GB scanned instead of 500GB. 95% improvement.
How to check if you’re pruning effectively:
Run your query and check the query profile. Look for “Partitions scanned” vs “Partitions total”:
— Your actual query
SELECT *
FROM orders
WHERE order_date = ‘2024-01-15’;

— Then check the profile or run:
SELECT
    query_id,
    partitions_scanned,
    partitions_total,
    bytes_scanned,
    (partitions_scanned::float / partitions_total) * 100 as scan_percentage
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY())
WHERE query_id = LAST_QUERY_ID();
What to look for:
Scanning >25% of partitions? Probably not pruning well
Scanning <10%? Good pruning
Scanning 100%? No pruning at all
Common culprits:
→ Filtering on columns with random distribution (UUIDs, hashed values)
→ Using functions in WHERE clauses: WHERE DATE(timestamp_col) = … prevents pruning
→ Data loaded out of order without clustering
→ OR conditions across multiple high-cardinality columns
My decision framework for clustering:
Cluster when:
Table is large (multi-TB)
You filter/join on specific columns repeatedly
Query profiles show poor pruning
The column has natural ordering (dates, sequential IDs)
Don’t cluster when:
Table is small (<100GB)
Query patterns vary wildly
Clustering would cost more than it saves (maintenance overhead)
High-cardinality columns with random distribution
The real cost of bad pruning:
It’s not just slower queries. You’re paying to scan data you’ll immediately discard. Every GB scanned consumes credits, even if filtered out.
For our daily reporting jobs on clustered tables, we saw 60-70% reductions in both runtime and credit consumption. The clustering maintenance cost? Negligible compared to the savings.
Quick win you can try today:
Check your largest, most-queried tables. Run your common WHERE clause patterns. Look at partition scan ratios.
If you’re scanning >50% of partitions on filtered queries, you’ve found your optimization opportunity.
What’s been your experience with partition pruning? Have you seen dramatic improvements from clustering?
#Snowflake #DataEngineering #PerformanceOptimization #CostOptimization

The Warehouse Sizing Paradox

Posted by thomaswmarshall

The Warehouse Sizing Paradox: Why I Sometimes Choose XL Over Small
“Always use the smallest warehouse possible to save money.”
I heard this advice constantly when I started with Snowflake. It sounds logical—smaller warehouses cost less per hour, so naturally they should be cheaper, right?
Except the math doesn’t always work that way.
Here’s what I’ve observed:
Snowflake charges by the second with a 60-second minimum. The cost difference between warehouse sizes is linear, but the performance difference often isn’t.
The actual formula is simple:
Total Cost = Credits per Second × Runtime in Seconds
A Small warehouse might be 1/4 the cost per second of an XL, but if it takes 5x longer to complete the same query, you’re paying more overall.
When I’ve seen this matter most:
Working with subscription and order data, certain query patterns consistently benefit from larger warehouses:
→ Customer lifetime value calculations across millions of subscribers
→ Daily cohort analysis with complex retention logic
→ Product affinity analysis joining order details with high SKU cardinality
→ Aggregating subscription events over multi-year periods
These workloads benefit dramatically from parallelization. An XL warehouse has 8x the compute resources of an XS, and for the right queries, it can complete them in less than 1/8th the time.
A simple experiment you can run:
— Test with Small warehouse
USE WAREHOUSE small_wh;
SELECT SYSTEM$START_QUERY_TIMER();

SELECT
    subscription_plan,
    customer_segment,
    COUNT(DISTINCT customer_id) as subscribers,
    SUM(order_value) as total_revenue,
    AVG(order_value) as avg_order_value,
    COUNT(DISTINCT order_id) as total_orders
FROM orders
WHERE order_date >= ‘2023-01-01’
GROUP BY subscription_plan, customer_segment
HAVING COUNT(DISTINCT customer_id) > 100;

— Note the execution time and credits used in query profile

— Now test with XL warehouse
USE WAREHOUSE xl_wh;
— Run the same query
Check the query profile for each:
Execution time
Credits consumed (Execution Time × Warehouse Size Credits/Second)
Total cost
The decision framework I use:
Size up when:
Query runtime > 2 minutes on current warehouse
Query profile shows high parallelization potential
You’re running the query repeatedly (daily pipelines)
Spillage to remote disk is occurring
Stay small when:
Queries are simple lookups or filters
Runtime is already under 30 seconds
Workload is highly sequential (limited parallelization)
It’s truly ad-hoc, one-time analysis
The nuance that surprised me:
It’s not just about individual query cost—it’s about total warehouse utilization. If your Small warehouse runs 10 queries in 100 minutes, but an XL runs them in 20 minutes, you’re paying for 80 fewer minutes of warehouse time. That matters when you’re paying for auto-suspend delays, concurrent users, or just opportunity cost.
My practical approach:
I start with Medium for most workloads. Then I profile:
Queries consistently taking 3+ minutes → test on Large or XL
Queries under 1 minute → consider downsizing to Small
Monitor credit consumption patterns weekly
The goal isn’t to find the “right” size—it’s to match warehouse size to workload characteristics.
Want to test this yourself?
Here’s a quick query to see your warehouse credit consumption:
SELECT
    warehouse_name,
    SUM(credits_used) as total_credits,
    COUNT(*) as query_count,
    AVG(execution_time)/1000 as avg_seconds,
    SUM(credits_used)/COUNT(*) as credits_per_query
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD(day, -7, CURRENT_TIMESTAMP())
    AND warehouse_name IS NOT NULL
GROUP BY warehouse_name
ORDER BY total_credits DESC;
This shows you which warehouses are consuming credits and whether you might benefit from right-sizing.
The counterintuitive truth:
The cheapest warehouse per hour isn’t always the cheapest warehouse per result. Sometimes spending more per second means spending less overall.
What’s been your experience with warehouse sizing? Have you found scenarios where bigger was actually cheaper?
#Snowflake #DataEngineering #CostOptimization #CloudDataWarehouse

Thomas [W] Marshall

Database and Artificial Intelligence

Uncategorized

Snowflake for Beginners Part 4: Building an Automated Pipeline

Snowflake for Beginners Part 3: Creating, Loading, and The Magic of Semi-Structured Data

Snowflake for Beginners, Part 2: Your First Queries and Understanding Warehouses

Snowflake for Beginners, Part 1: Signing Up and Finding Your Way Around

AWS Hardening

AWS Hardening

Cloud Maturity Series — Posts 1–3 (LinkedIn Drafts)

Post 1: Start With the Account, Not the App

Post 2: Identity Is the Real Perimeter

Post 3: Auditability Before Availability

AWS Day 2

Cloud Maturity Series – Post 2: Identity Is the Real Perimeter

AWS Day 1

Partition Pruning

The Warehouse Sizing Paradox