Module 4 · Curate Phase

Transform Your Ingested Data for Analytics

Modules 2 and 3 landed three classes of data into Fabric - mirrored operational tables, an uploaded facility catalog, and live KQL streams. None of it is yet shaped for the dashboards, semantic models, and agents you'll build in the next four modules.

Fabric gives you several ways to shape data once it has landed. This module presents four independent transformation paths. Pick whichever path you prefer - you only need one to produce curated output. The paths are fully optional and interchangeable; choose based on your team's tooling.

~20 minutes (one path)
🎯 Goal: curated, analytics-ready data
🛠 Choose: Notebook · SQL view · Dataflow Gen2 · Materialized Lake View

Learning Objectives

Transformation Paths in Fabric

There is no single required method for transformation in Fabric. Each path below is optional and self-contained - pick the one that matches your team's skills. You only need one to complete this module. Running more than one is purely for comparison.

Choose one path. Step 1 (inspect Bronze) applies to everyone. After that, jump to whichever path you like - they are independent and you don't need to do them in order or do all of them.

Why a Transform Step?

What landed in M2 is operational shape: each mirrored table looks like the source system, and the uploaded facility catalog is denormalized for storage, not analysis. That's fine for an audit trail, but it's not what a Power BI report or a Data Agent should see. Operators ask questions like "how many beds are available in Region North?" - which spans Hospitals, Wards, and Regions, and requires consistent column names and types.

Transforming the Bronze layer into a curated Silver layer (clean, joined, typed) plus a Gold layer of BI-ready aggregates is what makes the rest of the lab possible. The medallion split also lets you evolve raw schemas without breaking downstream consumers.

Approach Best for You'll use it in
Notebook (Spark Python / Spark SQL) Joins, type cleanup, derived columns, ML prep, complex logic Path A
SQL view on the SQL endpoint Lightweight aggregates and friendly names for BI / agents Path B
Materialized Lake View SQL transforms you want persisted as Delta with managed refresh Path D
Dataflow Gen2 (Power Query) Visual, no-code prep - filtering, type changes, joins Path C
Pick the approach that fits your scenario. Fabric does not force one path. All four paths below land analytics-ready output - the right choice depends on your team's preferred tooling and your downstream needs. The references at the bottom of this page link the official docs for each.

Step 1 - Inspect What Landed in Bronze LAKEHOUSE

  1. Open the Bronze Lakehouse

    From your workspace, open lh_urbanpulse_bronze.

    In the left-hand explorer you should see the tables you created in M2:

    • Hospitals, Wards, Staff — mirrored from Azure SQL.
    • regions, trainRoutes — mirrored from Cosmos DB for NoSQL.
    • facility_catalog — uploaded Parquet (or shortcut, depending on your M2 path).

    Click any table and use the Preview pane to confirm rows are present.

  2. Preview the source tables in Lakehouse view

    Stay in Lakehouse view and use table Preview to check data quality before transforming. For example, verify that Hospitals.hospitalId lines up with Wards.hospitalId, and that regions.regionId has expected values.

    After this pre-check, choose one path below and execute only that path (or multiple paths for comparison).

Path A: Spark Notebook Transform OPTIONAL NOTEBOOK

Now you'll create a curated silver_hospital_facility Delta table that joins Hospitals, Wards, and regions into a single analytics-friendly shape. A Spark notebook is the right tool because the join spans tables originating from two different mirrors (Azure SQL + Cosmos DB) and we want consistent typing on the way out.

  1. Create a new notebook bound to the Bronze Lakehouse

    From the workspace, click + New itemNotebook.

    Name it nb_urbanpulse_silver.

    In the left explorer of the notebook, click + Lakehouse and pick lh_urbanpulse_bronze as the default Lakehouse. This makes spark.sql("SELECT * FROM Hospitals") resolve without any extra path.

  2. Write the Silver transform (Spark SQL)

    Paste the following into the first cell. This produces one wide row per ward, with the hospital and region attributes joined in:

    silver_df = spark.sql("""
        SELECT
            w.wardId             AS ward_id,
            w.name               AS ward_name,
            w.capacity           AS beds,
            h.hospitalId         AS hospital_id,
            h.name               AS hospital_name,
            h.address            AS hospital_address,
            h.facilityType       AS hospital_type,
            r.regionId           AS region_id,
            r.name               AS region_name,
            CURRENT_TIMESTAMP()  AS curated_at
        FROM Wards     w
        JOIN Hospitals h ON h.hospitalId = w.hospitalId
        LEFT JOIN regions r ON r.regionId = h.regionId
    """)
    
    silver_df.write \
        .format("delta") \
        .mode("overwrite") \
        .option("overwriteSchema", "true") \
        .saveAsTable("silver_hospital_facility")

    Run the cell. The output should report a successful Delta write with a row count.

  3. Validate the Silver table

    Add a second cell:

    display(spark.sql("""
        SELECT region_name, COUNT(*) AS wards, SUM(beds) AS total_beds
        FROM silver_hospital_facility
        GROUP BY region_name
        ORDER BY total_beds DESC
    """))

    You should see one row per region with the total number of wards and beds. This is the shape every downstream consumer (BI report, ontology, hospital agent) will read from.

💡
Why save as a managed Delta table? Writing with saveAsTable registers the table in the Lakehouse metastore so it shows up automatically in the SQL endpoint, in Power BI's Direct Lake mode, and in Fabric Data Agents. No extra plumbing.

Path B: SQL Endpoint Transform OPTIONAL SQL ENDPOINT

This path is SQL-first and independent of the notebook path. You'll create serving views directly from Bronze tables on the Lakehouse SQL analytics endpoint.

  1. Open the SQL analytics endpoint of the Bronze Lakehouse

    From the workspace, open lh_urbanpulse_bronze and switch the upper-right toggle to SQL analytics endpoint.

    Click New SQL query.

  2. Create a region-level capacity view

    Paste and run:

    CREATE OR ALTER VIEW gold_region_capacity_sqlpath AS
    SELECT
        r.regionId                   AS region_id,
        r.name                       AS region_name,
        COUNT(DISTINCT h.hospitalId) AS hospitals,
        COUNT(*)                     AS wards,
        SUM(w.capacity)              AS total_beds
    FROM Wards w
    JOIN Hospitals h ON h.hospitalId = w.hospitalId
    LEFT JOIN regions r ON r.regionId = h.regionId
    GROUP BY r.regionId, r.name;
  3. Create a transit lines summary view

    The transit topology lives in trainRoutes from the Cosmos mirror. Add a lightweight Gold view for it too:

    CREATE OR ALTER VIEW gold_transit_lines_sqlpath AS
    SELECT
        routeId   AS line,
        color     AS line_color,
        stopCount AS station_count
    FROM trainRoutes;
  4. Sanity-check both views

    SELECT * FROM gold_region_capacity_sqlpath ORDER BY total_beds DESC;
    SELECT * FROM gold_transit_lines_sqlpath  ORDER BY line;

    Both should return rows. This confirms the SQL endpoint path works without running Path A.

Path C: Dataflow Gen2 Transform OPTIONAL DATAFLOW GEN2

This path is visual/no-code. Use it when transformation owners are analysts or low-code teams.

  1. Create a Dataflow Gen2

    From your workspace, click + New itemDataflow Gen2.

    Name it df_urbanpulse_silver.

  2. Add source tables from Lakehouse

    Add queries for Hospitals, Wards, and regions from lh_urbanpulse_bronze.

  3. Apply joins in Power Query

    Merge Wards with Hospitals on hospitalId, then merge with regions on regionId = regionId.

    Expand columns and rename fields to a serving shape (for example: capacitybeds, namehospital_name, region nameregion_name).

  4. Publish output table

    Set destination to Lakehouse table silver_hospital_facility_df and publish the dataflow.

    Run the dataflow once and confirm row output in the Lakehouse.

Path D: Materialized Lake View Transform OPTIONAL MATERIALIZED VIEW

This path is SQL-driven with persisted output and Fabric-managed refresh. Materialized Lake Views are defined with Spark SQL inside a notebook bound to the Lakehouse (they are not created on the read-only SQL analytics endpoint).

Prerequisite: Materialized Lake Views require a schema-enabled Lakehouse running on Fabric Runtime 1.3. The classic lh_urbanpulse_bronze from Module 2 has Lakehouse schemas turned off, so it cannot host an MLV directly. This path provisions a dedicated schema-enabled gold Lakehouse and shortcuts the bronze tables into it (zero copy).
  1. Create a schema-enabled gold Lakehouse

    In your workspace, choose New item → Lakehouse. Name it lh_urbanpulse_gold and check Lakehouse schemas (preview) before creating it.

    Confirm the workspace (and the notebook you use below) runs on Fabric Runtime 1.3, which is required for Materialized Lake Views.

  2. Shortcut the bronze source tables

    In lh_urbanpulse_gold, expand Tables, then under the dbo schema choose New shortcut → Microsoft OneLake.

    Select lh_urbanpulse_bronze as the source and add shortcuts for Hospitals, Wards, and regions. They land under dbo with no copy.

  3. Create the materialized lake view in a notebook

    Open a notebook with lh_urbanpulse_gold as the default Lakehouse, then run this in a Spark SQL cell (or a Python cell wrapped in spark.sql(...)):

    CREATE SCHEMA IF NOT EXISTS gold;
    
    CREATE MATERIALIZED LAKE VIEW IF NOT EXISTS gold.region_capacity AS
    SELECT
        r.regionId                   AS region_id,
        r.name                       AS region_name,
        COUNT(DISTINCT h.hospitalId) AS hospitals,
        COUNT(*)                     AS wards,
        SUM(w.capacity)              AS total_beds
    FROM dbo.Wards w
    JOIN dbo.Hospitals h ON h.hospitalId = w.hospitalId
    LEFT JOIN dbo.regions r ON r.regionId = h.regionId
    GROUP BY r.regionId, r.name;

    Fabric manages refresh as underlying data changes. Validate with:

    SELECT * FROM gold.region_capacity ORDER BY total_beds DESC;

Path References

Use these official docs if you want deeper implementation patterns for each path.

Validation

You only need the output from the one path you chose to pass this module:

  • Path A: silver_hospital_facility exists as a managed Delta table in the Bronze Lakehouse.
  • Path B: gold_region_capacity_sqlpath and gold_transit_lines_sqlpath views return rows from the SQL endpoint.
  • Path C: silver_hospital_facility_df exists from the Dataflow Gen2 run.
  • Path D: gold.region_capacity returns rows from the Materialized Lake View, managed by lh_urbanpulse_gold.

What You Just Built

💡
Real-world implementation note: you can combine these paths based on the use case. For example, teams often use Dataflow Gen2 for simple prep, Spark notebooks for advanced logic, and SQL endpoint views for serving.

Module 5 reads these tables and views directly to build the live KQL dashboard, and Module 6 layers a Direct Lake semantic model on top.

References