Module 4 · Curate Phase

Transform Your Ingested Data for Analytics

Modules 2 and 3 landed three classes of data into Fabric - mirrored operational tables, an uploaded facility catalog, and live KQL streams. None of it is yet shaped for the dashboards, semantic models, and agents you'll build in the next four modules.

Fabric gives you several ways to shape data once it has landed. This module presents four independent transformation paths. Pick whichever path you prefer - you only need one to produce curated output. The paths are fully optional and interchangeable; choose based on your team's tooling.

⏱ ~20 minutes (one path)

🎯 Goal: curated, analytics-ready data

🛠 Choose: Notebook · SQL view · Dataflow Gen2 · Materialized Lake View

Learning Objectives

Inspect what landed in lh_urbanpulse_bronze from M2 (mirrors + uploaded catalog).
Pick any one of the four optional paths and run it end-to-end:
- Path A - Spark Notebook
- Path B - SQL Endpoint
- Path C - Dataflow Gen2
- Path D - Materialized Lake View
Understand when each path is the better fit (and that they can be combined in real projects).

Transformation Paths in Fabric

There is no single required method for transformation in Fabric. Each path below is optional and self-contained - pick the one that matches your team's skills. You only need one to complete this module. Running more than one is purely for comparison.

Path A - Spark Notebook: best for complex joins, type cleanup, and multi-step logic.
Path B - SQL Endpoint: best for SQL-first teams creating serving views and aggregates.
Path C - Dataflow Gen2: best for visual/no-code transformations.
Path D - Materialized Lake View: best for persisted SQL aggregates with managed refresh.

ⓘ

Choose one path. Step 1 (inspect Bronze) applies to everyone. After that, jump to whichever path you like - they are independent and you don't need to do them in order or do all of them.

Why a Transform Step?

What landed in M2 is operational shape: each mirrored table looks like the source system, and the uploaded facility catalog is denormalized for storage, not analysis. That's fine for an audit trail, but it's not what a Power BI report or a Data Agent should see. Operators ask questions like "how many beds are available in Region North?" - which spans Hospitals, Wards, and Regions, and requires consistent column names and types.

Transforming the Bronze layer into a curated Silver layer (clean, joined, typed) plus a Gold layer of BI-ready aggregates is what makes the rest of the lab possible. The medallion split also lets you evolve raw schemas without breaking downstream consumers.

Approach	Best for	You'll use it in
Notebook (Spark Python / Spark SQL)	Joins, type cleanup, derived columns, ML prep, complex logic	Path A
SQL view on the SQL endpoint	Lightweight aggregates and friendly names for BI / agents	Path B
Materialized Lake View	SQL transforms you want persisted as Delta with managed refresh	Path D
Dataflow Gen2 (Power Query)	Visual, no-code prep - filtering, type changes, joins	Path C

ⓘ

Pick the approach that fits your scenario. Fabric does not force one path. All four paths below land analytics-ready output - the right choice depends on your team's preferred tooling and your downstream needs. The references at the bottom of this page link the official docs for each.

Step 1 - Inspect What Landed in Bronze LAKEHOUSE

Open the Bronze Lakehouse
From your workspace, open lh_urbanpulse_bronze.

In the left-hand explorer you should see the tables you created in M2:
- Hospitals, Wards, Staff — mirrored from Azure SQL.
- regions, trainRoutes — mirrored from Cosmos DB for NoSQL.
- facility_catalog — uploaded Parquet (or shortcut, depending on your M2 path).
Click any table and use the Preview pane to confirm rows are present.
Preview the source tables in Lakehouse view

Stay in Lakehouse view and use table Preview to check data quality before transforming. For example, verify that Hospitals.hospitalId lines up with Wards.hospitalId, and that regions.regionId has expected values.

After this pre-check, choose one path below and execute only that path (or multiple paths for comparison).

Path A: Spark Notebook Transform OPTIONAL NOTEBOOK

Now you'll create a curated silver_hospital_facility Delta table that joins Hospitals, Wards, and regions into a single analytics-friendly shape. A Spark notebook is the right tool because the join spans tables originating from two different mirrors (Azure SQL + Cosmos DB) and we want consistent typing on the way out.

Create a new notebook bound to the Bronze Lakehouse

From the workspace, click + New item → Notebook.

Name it nb_urbanpulse_silver.

In the left explorer of the notebook, click + Lakehouse and pick lh_urbanpulse_bronze as the default Lakehouse. This makes spark.sql("SELECT * FROM Hospitals") resolve without any extra path.

Write the Silver transform (Spark SQL)

Paste the following into the first cell. This produces one wide row per ward, with the hospital and region attributes joined in:

silver_df = spark.sql("""
    SELECT
        w.wardId             AS ward_id,
        w.name               AS ward_name,
        w.capacity           AS beds,
        h.hospitalId         AS hospital_id,
        h.name               AS hospital_name,
        h.address            AS hospital_address,
        h.facilityType       AS hospital_type,
        r.regionId           AS region_id,
        r.name               AS region_name,
        CURRENT_TIMESTAMP()  AS curated_at
    FROM Wards     w
    JOIN Hospitals h ON h.hospitalId = w.hospitalId
    LEFT JOIN regions r ON r.regionId = h.regionId
""")

silver_df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("silver_hospital_facility")

Run the cell. The output should report a successful Delta write with a row count.

Validate the Silver table
Add a second cell:
```
display(spark.sql("""
    SELECT region_name, COUNT(*) AS wards, SUM(beds) AS total_beds
    FROM silver_hospital_facility
    GROUP BY region_name
    ORDER BY total_beds DESC
"""))
```
You should see one row per region with the total number of wards and beds. This is the shape every downstream consumer (BI report, ontology, hospital agent) will read from.

💡

Why save as a managed Delta table? Writing with saveAsTable registers the table in the Lakehouse metastore so it shows up automatically in the SQL endpoint, in Power BI's Direct Lake mode, and in Fabric Data Agents. No extra plumbing.

Path B: SQL Endpoint Transform OPTIONAL SQL ENDPOINT

This path is SQL-first and independent of the notebook path. You'll create serving views directly from Bronze tables on the Lakehouse SQL analytics endpoint.

Open the SQL analytics endpoint of the Bronze Lakehouse

From the workspace, open lh_urbanpulse_bronze and switch the upper-right toggle to SQL analytics endpoint.

Click New SQL query.

Create a region-level capacity view

Paste and run:

CREATE OR ALTER VIEW gold_region_capacity_sqlpath AS
SELECT
    r.regionId                   AS region_id,
    r.name                       AS region_name,
    COUNT(DISTINCT h.hospitalId) AS hospitals,
    COUNT(*)                     AS wards,
    SUM(w.capacity)              AS total_beds
FROM Wards w
JOIN Hospitals h ON h.hospitalId = w.hospitalId
LEFT JOIN regions r ON r.regionId = h.regionId
GROUP BY r.regionId, r.name;

Create a transit lines summary view
The transit topology lives in trainRoutes from the Cosmos mirror. Add a lightweight Gold view for it too:
```
CREATE OR ALTER VIEW gold_transit_lines_sqlpath AS
SELECT
    routeId   AS line,
    color     AS line_color,
    stopCount AS station_count
FROM trainRoutes;
```

Sanity-check both views

SELECT * FROM gold_region_capacity_sqlpath ORDER BY total_beds DESC;
SELECT * FROM gold_transit_lines_sqlpath  ORDER BY line;

Both should return rows. This confirms the SQL endpoint path works without running Path A.

Path C: Dataflow Gen2 Transform OPTIONAL DATAFLOW GEN2

This path is visual/no-code. Use it when transformation owners are analysts or low-code teams.

Create a Dataflow Gen2

From your workspace, click + New item → Dataflow Gen2.

Name it df_urbanpulse_silver.
Add source tables from Lakehouse

Add queries for Hospitals, Wards, and regions from lh_urbanpulse_bronze.
Apply joins in Power Query

Merge Wards with Hospitals on hospitalId, then merge with regions on regionId = regionId.

Expand columns and rename fields to a serving shape (for example: capacity → beds, name → hospital_name, region name → region_name).
Publish output table

Set destination to Lakehouse table silver_hospital_facility_df and publish the dataflow.

Run the dataflow once and confirm row output in the Lakehouse.

Path D: Materialized Lake View Transform OPTIONAL MATERIALIZED VIEW

This path is SQL-driven with persisted output and Fabric-managed refresh. Materialized Lake Views are defined with Spark SQL inside a notebook bound to the Lakehouse (they are not created on the read-only SQL analytics endpoint).

⚠

Prerequisite: Materialized Lake Views require a schema-enabled Lakehouse running on Fabric Runtime 1.3. The classic lh_urbanpulse_bronze from Module 2 has Lakehouse schemas turned off, so it cannot host an MLV directly. This path provisions a dedicated schema-enabled gold Lakehouse and shortcuts the bronze tables into it (zero copy).

Create a schema-enabled gold Lakehouse

In your workspace, choose New item → Lakehouse. Name it lh_urbanpulse_gold and check Lakehouse schemas (preview) before creating it.

Confirm the workspace (and the notebook you use below) runs on Fabric Runtime 1.3, which is required for Materialized Lake Views.
Shortcut the bronze source tables

In lh_urbanpulse_gold, expand Tables, then under the dbo schema choose New shortcut → Microsoft OneLake.

Select lh_urbanpulse_bronze as the source and add shortcuts for Hospitals, Wards, and regions. They land under dbo with no copy.

Create the materialized lake view in a notebook

Open a notebook with lh_urbanpulse_gold as the default Lakehouse, then run this in a Spark SQL cell (or a Python cell wrapped in spark.sql(...)):

CREATE SCHEMA IF NOT EXISTS gold;

CREATE MATERIALIZED LAKE VIEW IF NOT EXISTS gold.region_capacity AS
SELECT
    r.regionId                   AS region_id,
    r.name                       AS region_name,
    COUNT(DISTINCT h.hospitalId) AS hospitals,
    COUNT(*)                     AS wards,
    SUM(w.capacity)              AS total_beds
FROM dbo.Wards w
JOIN dbo.Hospitals h ON h.hospitalId = w.hospitalId
LEFT JOIN dbo.regions r ON r.regionId = h.regionId
GROUP BY r.regionId, r.name;

Fabric manages refresh as underlying data changes. Validate with:

SELECT * FROM gold.region_capacity ORDER BY total_beds DESC;

Path References

Use these official docs if you want deeper implementation patterns for each path.

Dataflow Gen2 (Power Query): visual, no-code data prep.
Tutorial - Copy & transform with Dataflow
Pipelines / Data Factory: orchestrate notebooks, dataflows, and copy activities on a schedule. Layer this on top of any of the above when you need automation.
T-SQL in a Warehouse: if your downstream pattern is Warehouse-first (multi-table writes, transactions), do the same shaping with T-SQL stored procedures inside a Fabric Warehouse rather than views on the Lakehouse SQL endpoint.

Validation

You only need the output from the one path you chose to pass this module:

Path A: silver_hospital_facility exists as a managed Delta table in the Bronze Lakehouse.
Path B: gold_region_capacity_sqlpath and gold_transit_lines_sqlpath views return rows from the SQL endpoint.
Path C: silver_hospital_facility_df exists from the Dataflow Gen2 run.
Path D: gold.region_capacity returns rows from the Materialized Lake View, managed by lh_urbanpulse_gold.

What You Just Built

One or more curated outputs using independently documented transformation paths.
Path-specific serving objects for BI and agents (Notebook, SQL Endpoint, Dataflow Gen2, or Materialized Lake View).

💡

Real-world implementation note: you can combine these paths based on the use case. For example, teams often use Dataflow Gen2 for simple prep, Spark notebooks for advanced logic, and SQL endpoint views for serving.

Module 5 reads these tables and views directly to build the live KQL dashboard, and Module 6 layers a Direct Lake semantic model on top.

References

Lakehouse in Microsoft Fabric - Spark notebooks, Delta tables, and the SQL endpoint.

Transform Your Ingested Data for Analytics

Learning Objectives

Transformation Paths in Fabric

Why a Transform Step?

Step 1 - Inspect What Landed in Bronze LAKEHOUSE

Open the Bronze Lakehouse

Preview the source tables in Lakehouse view

Path A: Spark Notebook Transform OPTIONAL NOTEBOOK

Create a new notebook bound to the Bronze Lakehouse

Write the Silver transform (Spark SQL)

Validate the Silver table

Path B: SQL Endpoint Transform OPTIONAL SQL ENDPOINT

Open the SQL analytics endpoint of the Bronze Lakehouse

Create a region-level capacity view

Create a transit lines summary view

Sanity-check both views

Path C: Dataflow Gen2 Transform OPTIONAL DATAFLOW GEN2

Create a Dataflow Gen2

Add source tables from Lakehouse

Apply joins in Power Query

Publish output table

Path D: Materialized Lake View Transform OPTIONAL MATERIALIZED VIEW

Create a schema-enabled gold Lakehouse

Shortcut the bronze source tables

Create the materialized lake view in a notebook

Path References

Validation

What You Just Built

References