This resource demystifies the difference between operational databases and analytical data warehouses. You’ll learn their purposes, architectures, workflows, and how to decide which one your project needs. Hands-on SQL examples and Galaxy tips included.
A database is an organized collection of data designed for real-time transaction processing. Think of the tables powering your SaaS application—users, orders, payments, feature flags. These tables are created, read, updated, and deleted (the famous CRUD operations) thousands of times per minute.
Databases such as PostgreSQL, MySQL, and SQL Server are optimized for OLTP (Online Transaction Processing). Core characteristics include:
Consider an e-commerce platform. When Alice checks out, her purchase triggers several quick database writes:
BEGIN;
INSERT INTO orders (user_id, total) VALUES (123, 49.99);
INSERT INTO order_items (order_id, sku, qty) VALUES (456, 'TSHIRT', 2);
COMMIT; -- total latency target < 100 ms
The design priorities are low latency and high transactional throughput.
A data warehouse is a central store of consolidated, historized data built for large-scale analytics, reporting, and machine learning. Technologies include Snowflake, Amazon Redshift, Google BigQuery, and ClickHouse.
Warehouses excel at OLAP (Online Analytical Processing):
Jane in business intelligence runs a monthly revenue cohort analysis:
SELECT
DATE_TRUNC('month', order_date) AS cohort_month,
DATE_DIFF('month', cohort_month, CURRENT_DATE) AS months_since_first,
SUM(total) AS revenue
FROM warehouse.fact_orders
GROUP BY 1, 2
ORDER BY 1, 2;
This query might scan billions of rows, but the warehouse returns results in seconds by leveraging column pruning, clustering, and parallelism.
CriterionDatabase (OLTP)Data Warehouse (OLAP) Primary PurposeReal-time transactionsHistorical analytics Typical UsersApplication code, microservicesAnalysts, data scientists, execs Schema StyleHighly normalized (3NF)Denormalized (star/snowflake) Storage OrientationRow-basedColumn-based Query PatternsPoint lookups, short range scansLarge scans, aggregates, joins ConcurrencyThousands of short queries/secDozens of heavy queries/min Latency Goals< 100 msSeconds to minutes
Modern organizations rarely choose one over the other—they run both:
raw_
schema in the warehouse.stg_
tables, then dim_
ensions and fact_
tables.Goal: Move a users
table from Postgres to Snowflake and count sign-ups by day.
CREATE OR REPLACE STAGE postgres_stage; -- assumes AWS credentials set
PUT file://users_2023_10.csv @postgres_stage;
CREATE OR REPLACE TABLE raw_users LIKE users;
COPY INTO raw_users
FROM @postgres_stage
FILE_FORMAT = (TYPE = 'CSV' FIELD_OPTIONALLY_ENCLOSED_BY='"');
CREATE OR REPLACE VIEW stg_users AS
SELECT id, email, signup_ts::DATE AS signup_date
FROM raw_users
WHERE deleted_at IS NULL;
SELECT signup_date, COUNT(*) AS signups
FROM stg_users
GROUP BY signup_date
ORDER BY signup_date;
Tip: Paste each statement into Galaxy, link your Postgres and Snowflake connections, and use the split-pane editor to compare results side-by-side.
In Galaxy you can open multiple connections—say, prod-postgres and analytics-snowflake. Tab groups let you:
SELECT COUNT(*) FROM orders;
)SELECT COUNT(*) FROM fact_orders;
)Side-by-side results help validate ETL correctness.
Type a prompt like “show me how to join stg_users to fact_orders by user”. Galaxy’s AI copilot reads table metadata from both systems and proposes a correct query—no docs hunting required.
Create a “Warehouse Quality Checks” Collection. Save queries that reconcile DB and warehouse counts, then endorse them so teammates can rerun on demand.
date
and cluster by high-cardinality columns (e.g., user_id
).docs generate
and surface it via Galaxy’s metadata panes.query_history
(Snowflake) or STL_QUERY
(Redshift) for slow, costly queries.orders
, users
, and products
tables in your OLTP DB, design dimension and fact tables for the warehouse.max(load_ts)
is > 2 hours old.