Introduction
This notebook documents the data transformation and
processing stage of the housing affordability analysis
pipeline. Its purpose is to convert raw, heterogeneous source data into
a clean, consistent, and analysis-ready format suitable for exploratory
analysis and modeling.
The transformation process begins with the raw datasets generated
during the ingestion stage, including Zillow Home Value Index (ZHVI)
time series, macroeconomic indicators from the Federal Reserve Economic
Data (FRED), and median household income estimates from the American
Community Survey (ACS). These sources differ in temporal resolution,
geographic definitions, and data structure, requiring systematic
alignment and standardization before analysis.
Key transformation steps performed in this notebook include reshaping
wide time-series data into tidy long formats, harmonizing date
conventions across sources, standardizing metropolitan region
identifiers, adjusting prices for inflation, and reconciling annual
income data with monthly housing and financial indicators. Where
necessary, transparent imputation rules are applied to ensure temporal
continuity while preserving the underlying structure of the data.
This notebook is deliberately procedural rather than
analytical. No exploratory data analysis or modeling is
conducted here. Instead, each step focuses on producing well-defined
intermediate datasets and a final processed panel that serves as the
sole input to downstream analysis. Validation checks are included to
confirm expected coverage, missingness, and variable ranges prior to
saving outputs.
All processed datasets produced by this notebook are written to the
data/processed/ directory and are tracked in version
control. By separating ingestion, transformation, and analysis into
distinct stages, the pipeline improves reproducibility, clarity, and
maintainability, allowing analytical results to be traced back to their
original data sources with minimal ambiguity.
Load Datasets and Frequency Harmonization
library(tidyverse)
library(lubridate)
dir.create("data/processed", recursive = TRUE, showWarnings = FALSE)
#---Mortgage rate: Week -> monthly average ---
mortgage_m <- read_csv("./data/raw/fred_MORTGAGE30US_20000101.csv") %>%
mutate(date = floor_date(date,"month")) %>%
group_by(date) %>%
summarise(mortgage_30y = mean(MORTGAGE30US, NA.RM = TRUE), .groups = "drop")
# --- CPI already monthly ---
cpi_m <- read_csv("./data/raw/fred_CPIAUCSL_20000101.csv") %>%
rename(cpi = CPIAUCSL)
#---Save ---
write_csv(mortgage_m, "./data/processed/mortgage_monthly.csv")
write_csv(cpi_m, "./data/processed/cpi_monthly.csv")
Income -> Monthly Panel
library(tidyverse)
library(lubridate)
# Load annual income (2010–2023)
income_annual <- read_csv(
"./data/raw/acs5_median_hh_income_metro_2010_2023.csv",
show_col_types = FALSE
)
# Load zhvi to get the monthly date span
zhvi <- read_csv(
"./data/processed/zhvvi_metro_monthly.csv",
show_col_types = FALSE
)
# Define the global monthly date sequence (month-start)
date_seq <- seq.Date(
from = min(zhvi$date),
to = max(zhvi$date),
by = "month"
)
# Normalizer so Census & Zillow names can match later
normalize_region <- function(x) {
x %>%
stringr::str_remove(" Metropolitan Statistical Area$") %>%
stringr::str_remove(" Metro Area$") %>%
stringr::str_remove(" Micro Area$") %>%
stringr::str_trim()
}
# Convert annual -> monthly grid, fill down
income_monthly <- income_annual %>%
mutate(
region_norm = normalize_region(region),
date = ymd(paste0(year, "-01-01"))
) %>%
select(region_norm, date, income_median_hh) %>%
group_by(region_norm) %>%
tidyr::complete(date = date_seq) %>%
tidyr::fill(income_median_hh, .direction = "down") %>%
ungroup() %>%
arrange(region_norm, date)
# Verify
income_monthly %>%
summarise(
n_regions = n_distinct(region_norm),
start_date = min(date),
end_date = max(date),
pct_missing = mean(is.na(income_median_hh))
) %>%
print()
# A tibble: 1 × 4
n_regions start_date end_date pct_missing
<int> <date> <date> <dbl>
1 1218 2000-01-01 2025-12-01 0.444
# Save
#dir.create("./data/processed", recursive = TRUE, showWarnings = FALSE)
readr::write_csv(income_monthly, "./data/processed/income_metro_monthly.csv")
Zillow Data processing
library(tidyverse)
library(lubridate)
zhvi_raw <- read_csv("./data/raw/Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv")
#Columns are RegionID, SizeRank, RegionName, RegionType, StateName
date_cols <- names(zhvi_raw) %>%
str_subset("^\\d{4}-\\d{2}-\\d{2}$")
#print(date_cols)
stopifnot(length(date_cols) > 0)
#---Wide -> Long ---
zhvi_metro <- zhvi_raw %>%
pivot_longer(
cols = all_of(date_cols),
names_to = "date",
values_to = "zhvi",
values_drop_na = TRUE
) %>%
mutate(
date = ymd(date),
region = str_trim(RegionName)
) %>%
select(
region,
date,
zhvi
) %>%
arrange(region, date)
#head(zhvi_metro,100)
#---Sanity checks ---
stopifnot(all(!is.na(zhvi_metro$region)))
stopifnot(all(!is.na(zhvi_metro$date)))
stopifnot(all(!is.na(zhvi_metro$zhvi)))
zhvi_metro <- zhvi_metro %>%
mutate(date = floor_date(date, "month"))
zhvi_metro %>%
summarise(
n_regions = n_distinct(region),
start_date = min(date),
end_date = max(date)
) %>%
print()
# A tibble: 1 × 3
n_regions start_date end_date
<int> <date> <date>
1 895 2000-01-01 2025-12-01
write_csv(
zhvi_metro, "./data/processed/zhvvi_metro_monthly.csv"
)
CSUSHPISA dataset
cs_m <- readr::read_csv("./data/raw/fred_CSUSHPISA_20000101.csv") %>%
mutate(date = ymd(date)) %>%
rename(case_shiller_us = CSUSHPISA) %>%
arrange(date)
readr::write_csv(cs_m, "./data/processed/case_shiller_us_monthly.csv")
Panel Construction
library(tidyverse)
library(lubridate)
# --- Load processed datasets ---
zhvi <- read_csv("./data/processed/zhvvi_metro_monthly.csv", show_col_types = FALSE)
mortgage <- read_csv("./data/processed/mortgage_monthly.csv", show_col_types = FALSE)
cpi <- read_csv("./data/processed/cpi_monthly.csv", show_col_types = FALSE)
income <- read_csv("./data/processed/income_metro_annual.csv", show_col_types = FALSE)
#---Define global monthly date sequence --
date_seq <- seq.Date(
from = min(zhvi$date),
to = max(zhvi$date),
by = "month"
)
#--- Expand annual income to monthly (forward-fill) ---
income_m <- income %>%
mutate(date = floor_date(date, "year")) %>%
group_by(region) %>%
tidyr::complete(date = date_seq) %>%
fill(income_median_hh, .direction = "updown") %>%
ungroup()
#---Build unified metro-month panel ---
housing_panel <- zhvi %>%
left_join(mortgage, by = "date") %>%
left_join(cpi, by = "date") %>%
left_join(income_m, by = c("region", "date"))
#---checks ---
housing_panel %>%
summarise(
n_regions = n_distinct(region),
start_date = min(date),
end_date = max(date),
pct_missing_zhvi = mean(is.na(zhvi)),
pct_missing_income = mean(is.na(income_median_hh))
) %>%
print()
# A tibble: 1 × 5
n_regions start_date end_date pct_missing_zhvi pct_missing_income
<int> <date> <date> <dbl> <dbl>
1 895 2000-01-01 2025-12-01 0 1
write_csv(
housing_panel,
"./data/processed/housing_panel_metro_monthly.csv"
)
Panel Integrity Check
housing_panel %>%
summarise(
n_regions = n_distinct(region),
n_rows = n(),
start_date = min(date),
end_date = max(date),
pct_missing_zhvi = mean(is.na(zhvi)),
pct_missing_income = mean(is.na(income_median_hh)),
pct_missing_mortgage = mean(is.na(mortgage_30y)),
pct_missing_cpi = mean(is.na(cpi))
) %>%
print()
# A tibble: 1 × 8
n_regions n_rows start_date end_date pct_missing_zhvi pct_missing_income pct_missing_mortgage pct_missing_cpi
<int> <int> <date> <date> <dbl> <dbl> <dbl> <dbl>
1 895 229974 2000-01-01 2025-12-01 0 1 0 0.00389
Normalization function for region
normalize_region <- function(x){
x %>%
stringr::str_remove(" Metropolitain Statistical Area$") %>%
stringr::str_remove(" Metro Area$") %>%
stringr::str_remove(" Micro Area$") %>%
stringr::str_trim()
}
Apply normalization
zhvi <- zhvi %>%
mutate(region_norm = normalize_region(region))
income_m <- income_m %>%
mutate(region_norm = normalize_region(region))
Rebuild the panel
housing_panel <- zhvi %>%
left_join(mortgage, by = "date") %>%
left_join(cpi, by = "date") %>%
left_join(
income_m,
by = c("region_norm" = "region_norm", "date"="date")
)
Check
housing_panel %>%
summarise(
pct_missing_income = mean(is.na(income_median_hh))
) %>% print()
# A tibble: 1 × 1
pct_missing_income
<dbl>
1 0.207
head(housing_panel, 200)
Save the full panel
write_csv(
housing_panel,
"./data/processed/housing_panel_metro_monthly.csv"
)
Filter to metros with income coverage
housing_panel_model <- housing_panel %>%
group_by(region_norm) %>%
filter(any(!is.na(income_median_hh))) %>%
ungroup()
Recheck
housing_panel_model %>%
summarise(pct_missing_income = mean(is.na(income_median_hh)))
Approximately 20% of Zillow-defined metro areas do not have
corresponding ACS income estimates due to geographic definition
differences. I restrict income-based analyses to metros with observed
ACS coverage.
Save model ready panel
readr::write_csv(
housing_panel_model,
"./data/processed/housing_panel_model_metro_monthly.csv"
)
message("Saved: data/processed/housing_panel_model_metro_monthly.csv")
---
title: "R Notebook"
output: html_notebook
---

# Introduction

This notebook documents the **data transformation and processing stage** of the housing affordability analysis pipeline. Its purpose is to convert raw, heterogeneous source data into a clean, consistent, and analysis-ready format suitable for exploratory analysis and modeling.

The transformation process begins with the raw datasets generated during the ingestion stage, including Zillow Home Value Index (ZHVI) time series, macroeconomic indicators from the Federal Reserve Economic Data (FRED), and median household income estimates from the American Community Survey (ACS). These sources differ in temporal resolution, geographic definitions, and data structure, requiring systematic alignment and standardization before analysis.

Key transformation steps performed in this notebook include reshaping wide time-series data into tidy long formats, harmonizing date conventions across sources, standardizing metropolitan region identifiers, adjusting prices for inflation, and reconciling annual income data with monthly housing and financial indicators. Where necessary, transparent imputation rules are applied to ensure temporal continuity while preserving the underlying structure of the data.

This notebook is deliberately **procedural rather than analytical**. No exploratory data analysis or modeling is conducted here. Instead, each step focuses on producing well-defined intermediate datasets and a final processed panel that serves as the sole input to downstream analysis. Validation checks are included to confirm expected coverage, missingness, and variable ranges prior to saving outputs.

All processed datasets produced by this notebook are written to the `data/processed/` directory and are tracked in version control. By separating ingestion, transformation, and analysis into distinct stages, the pipeline improves reproducibility, clarity, and maintainability, allowing analytical results to be traced back to their original data sources with minimal ambiguity.

## Load Datasets and Frequency Harmonization

```{r}
library(tidyverse)
library(lubridate)

dir.create("data/processed", recursive = TRUE, showWarnings = FALSE)

#---Mortgage rate: Week -> monthly average ---

mortgage_m <- read_csv("./data/raw/fred_MORTGAGE30US_20000101.csv") %>%
  mutate(date = floor_date(date,"month")) %>%
  group_by(date) %>%
  summarise(mortgage_30y = mean(MORTGAGE30US, NA.RM = TRUE), .groups = "drop")

# --- CPI already monthly ---

cpi_m <- read_csv("./data/raw/fred_CPIAUCSL_20000101.csv") %>%
  rename(cpi = CPIAUCSL)

#---Save ---
write_csv(mortgage_m, "./data/processed/mortgage_monthly.csv")
write_csv(cpi_m, "./data/processed/cpi_monthly.csv")
```

## Income -\> Monthly Panel

```{r}
library(tidyverse)
library(lubridate)

# Load annual income (2010–2023)
income_annual <- read_csv(
  "./data/raw/acs5_median_hh_income_metro_2010_2023.csv",
  show_col_types = FALSE
)

# Load zhvi to get the monthly date span
zhvi <- read_csv(
  "./data/processed/zhvvi_metro_monthly.csv",
  show_col_types = FALSE
)

# Define the global monthly date sequence (month-start)
date_seq <- seq.Date(
  from = min(zhvi$date),
  to   = max(zhvi$date),
  by   = "month"
)

# Normalizer so Census & Zillow names can match later
normalize_region <- function(x) {
  x %>%
    stringr::str_remove(" Metropolitan Statistical Area$") %>%
    stringr::str_remove(" Metro Area$") %>%
    stringr::str_remove(" Micro Area$") %>%
    stringr::str_trim()
}

# Convert annual -> monthly grid, fill down
income_monthly <- income_annual %>%
  mutate(
    region_norm = normalize_region(region),
    date = ymd(paste0(year, "-01-01"))
  ) %>%
  select(region_norm, date, income_median_hh) %>%
  group_by(region_norm) %>%
  tidyr::complete(date = date_seq) %>%
  tidyr::fill(income_median_hh, .direction = "down") %>%
  ungroup() %>%
  arrange(region_norm, date)

# Verify
income_monthly %>%
  summarise(
    n_regions = n_distinct(region_norm),
    start_date = min(date),
    end_date = max(date),
    pct_missing = mean(is.na(income_median_hh))
  ) %>%
  print()

# Save
#dir.create("./data/processed", recursive = TRUE, showWarnings = FALSE)
readr::write_csv(income_monthly, "./data/processed/income_metro_monthly.csv")


```

## Zillow Data processing

```{r}
library(tidyverse)
library(lubridate)

zhvi_raw <- read_csv("./data/raw/Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv")

#Columns are RegionID, SizeRank, RegionName, RegionType, StateName
date_cols <- names(zhvi_raw) %>%
  str_subset("^\\d{4}-\\d{2}-\\d{2}$")
#print(date_cols)
stopifnot(length(date_cols) > 0)

#---Wide -> Long ---

zhvi_metro <- zhvi_raw %>%
  pivot_longer(
    cols      =  all_of(date_cols),
    names_to  =  "date",
    values_to =  "zhvi",
    values_drop_na = TRUE
  ) %>%
  mutate(
    date = ymd(date),
    region = str_trim(RegionName)
  ) %>%
  select(
    region,
    date,
    zhvi
  ) %>%
  arrange(region, date)
#head(zhvi_metro,100)
#---Sanity checks ---
stopifnot(all(!is.na(zhvi_metro$region)))
stopifnot(all(!is.na(zhvi_metro$date)))
stopifnot(all(!is.na(zhvi_metro$zhvi)))

zhvi_metro <- zhvi_metro %>%
  mutate(date = floor_date(date, "month"))

zhvi_metro %>%
  summarise(
    n_regions = n_distinct(region),
    start_date = min(date),
    end_date = max(date)
  ) %>%
  print()

write_csv(
  zhvi_metro, "./data/processed/zhvvi_metro_monthly.csv"
)
```

### CSUSHPISA dataset

```{r}
cs_m <- readr::read_csv("./data/raw/fred_CSUSHPISA_20000101.csv") %>%
  mutate(date = ymd(date)) %>%
  rename(case_shiller_us = CSUSHPISA) %>%
  arrange(date)

readr::write_csv(cs_m, "./data/processed/case_shiller_us_monthly.csv")
```

## Panel Construction

```{r}
library(tidyverse)
library(lubridate)

# --- Load processed datasets ---

zhvi <- read_csv("./data/processed/zhvvi_metro_monthly.csv", show_col_types = FALSE)
mortgage <- read_csv("./data/processed/mortgage_monthly.csv", show_col_types = FALSE)
cpi <- read_csv("./data/processed/cpi_monthly.csv", show_col_types = FALSE)
income <- read_csv("./data/processed/income_metro_annual.csv", show_col_types = FALSE)

#---Define global monthly date sequence --
date_seq <- seq.Date(
  from = min(zhvi$date),
  to = max(zhvi$date),
  by = "month"
)

#--- Expand annual income to monthly (forward-fill) ---

income_m <- income %>%
  mutate(date = floor_date(date, "year")) %>%
  group_by(region) %>%
  tidyr::complete(date = date_seq) %>%
  fill(income_median_hh, .direction = "updown") %>%
  ungroup()


#---Build unified metro-month panel ---

housing_panel <- zhvi %>%
  left_join(mortgage, by = "date") %>%
  left_join(cpi, by = "date") %>%
  left_join(income_m, by = c("region", "date"))

#---checks ---

housing_panel %>%
  summarise(
    n_regions = n_distinct(region),
    start_date = min(date),
    end_date = max(date),
    pct_missing_zhvi = mean(is.na(zhvi)),
    pct_missing_income = mean(is.na(income_median_hh))
  ) %>%
  print()


```

## Panel Integrity Check

```{r}
housing_panel %>%
  summarise(
    n_regions = n_distinct(region),
    n_rows = n(),
    start_date = min(date),
    end_date = max(date),
    pct_missing_zhvi = mean(is.na(zhvi)),
    pct_missing_income = mean(is.na(income_median_hh)),
    pct_missing_mortgage = mean(is.na(mortgage_30y)),
    pct_missing_cpi = mean(is.na(cpi))
  ) %>%
  print()

```

### Normalization function for region

```{r}
normalize_region <- function(x){
  x %>%
    stringr::str_remove(" Metropolitain Statistical Area$") %>%
    stringr::str_remove(" Metro Area$") %>%
    stringr::str_remove(" Micro Area$") %>%
    stringr::str_trim()
}
```

### Apply normalization

```{r}
zhvi <- zhvi %>%
  mutate(region_norm = normalize_region(region))

income_m <- income_m %>%
  mutate(region_norm = normalize_region(region))


```

### Rebuild the panel

```{r}
housing_panel <- zhvi %>%
  left_join(mortgage, by = "date") %>%
  left_join(cpi, by = "date") %>%
  left_join(
    income_m,
    by = c("region_norm" = "region_norm", "date"="date")
  )
```

### Check

```{r}
housing_panel %>%
  summarise(
    pct_missing_income = mean(is.na(income_median_hh))
  ) %>% print()
```

```{r}
head(housing_panel, 200)
```

## Save the full panel

```{r}
write_csv(
  housing_panel,
  "./data/processed/housing_panel_metro_monthly.csv"
)
```

## Filter to metros with income coverage

```{r}
housing_panel_model <- housing_panel %>%
  group_by(region_norm) %>%
  filter(any(!is.na(income_median_hh))) %>%
  ungroup()
```

### Recheck

```{r}
housing_panel_model %>%
  summarise(pct_missing_income = mean(is.na(income_median_hh)))
```

Approximately 20% of Zillow-defined metro areas do not have corresponding ACS income estimates due to geographic definition differences. I restrict income-based analyses to metros with observed ACS coverage.

### Save model ready panel

```{r}
readr::write_csv(
  housing_panel_model,
  "./data/processed/housing_panel_model_metro_monthly.csv"
)

message("Saved: data/processed/housing_panel_model_metro_monthly.csv")
```
