In my last project analyzing household wealth using the Survey of Consumer Finances (Cohen 2017), I found that cross-sectional regression analyses predict age to be the strongest predictor of wealth. The relationship makes sense. People save money over time and are more likely to inherit familial property later in life.
So what are the typical differences between in household finance across age groups? What counts as “rich”, “middling”, or “poor” for a twenty-something versus a fifty-something? And what do these differences tell us about household finances? In this analysis, I examine differences in net worth across age groups in the 2022 data. It looks at the size of wealth differences across age groups and the incidence of low net worth over the life course.
Data and Methods
I use data from the Survey of Consumer Finances, a triennial survey of several thousand U.S. households (with a high-wealth oversample) (Federal Reserve 2023). In this analysis, we will examine the distribution. Readers can see the details of my acquisition and wrangling of this data in the folds of this document or by consulting its associated R Markdown file. I discuss the data and its analysis at greater length in this post.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 |
# Generic session set up # First, start your session with an empty memory rm(list=ls()) gc() # Set your working directory directory <- "D:/Dropbox/Research/Household Finance/Net Worth - Age" setwd(directory) # Load the libraries used in this session library(httr) library(jsonlite) library(survey) library(mitools) library(haven) library(scales) library(ggplot2) library(kableExtra) library(knitr) library(tidyverse) # Set the random seed set.seed(123) # Turn off scientific notation options(scipen=999) # PART ONE: DOWNLOAD DATA # Download Main 2022 Files response_0 <- GET("https://www.federalreserve.gov/econres/files/scf2022s.zip", write_disk("scf2022s.zip", overwrite = TRUE)) response_1 <- GET("https://www.federalreserve.gov/econres/files/scfp2022s.zip", write_disk("scfp2022s.zip", overwrite = TRUE)) response_2 <- GET("https://www.federalreserve.gov/econres/files/scf2022rw1s.zip", write_disk("scf2022rw.zip", overwrite = TRUE)) rm(list=ls(pattern = "response")) # Clean up objects # PART TWO: UNZIP DATA unzip("scf2022s.zip") unzip("scfp2022s.zip") unzip("scf2022rw.zip") file.remove("scf2022s.zip") # Erase the zip files because I don't need them and they take up space. file.remove("scfp2022s.zip") file.remove("scf2022rw.zip") # PART THREE: CONVERT DATA TO R FORMAT # The data are distributed in Stata format. Below, I import data from its Stata format. scf2022 <- read_dta("p22i6.dta") scf2022s <- read_dta("rscfp2022.dta") scf2022rw <- read_dta("p22_rw1.dta") # PART FOUR: DATA QUALITY CHECK # Data quality check: Ensure all three files have corresponding rows stopifnot( nrow( scf2022 ) == nrow( scf2022rw ) * 5 ) # One RW score per household stopifnot( nrow( scf2022 ) == nrow( scf2022s ) ) #Confirm only the primary economic unit and the five implicate identifiers overlap: stopifnot( all( sort( intersect( names( scf2022 ) , names( scf2022s ) ) ) == c( 'y1' , 'yy1' ) ) ) stopifnot( all( sort( intersect( names( scf2022 ) , names( scf2022rw ) ) ) == c( 'y1' , 'yy1' ) ) ) stopifnot( all( sort( intersect( names( scf2022s ) , names( scf2022rw ) ) ) == c( 'y1' , 'yy1' ) ) ) # Convert column names to lower case in all sets, per Damico script names(scf2022) <- tolower(names(scf2022)) names(scf2022rw) <- tolower(names(scf2022rw)) names(scf2022s) <- tolower(names(scf2022s)) # Per Damico script # Remove implicate identifier from RW table, and add column of fives for weighting scf2022rw[, 'y1'] <- NULL scf2022[,'five'] <- 5 save(scf2022, scf2022s, scf2022rw, file = "SCF 2022 Raw Data Tables.RData") # Save the data # PART FIVE: MERGE MAIN AND SUMMARY DATA # Merge Summary and Raw Data Tables by 'y1' scf2022 <- merge(scf2022, scf2022s, by = "y1", sort = T) # PART SIX: RECAST DATA AS FIVE SEPARATE IMPLICATES # Splitting data set into five separate sets of individual implicates scf_1 <- subset(scf2022, as.numeric(substr(scf2022$y1, nchar(scf2022$y1), nchar(scf2022$y1))) == 1) scf_2 <- subset(scf2022, as.numeric(substr(scf2022$y1, nchar(scf2022$y1), nchar(scf2022$y1))) == 2) scf_3 <- subset(scf2022, as.numeric(substr(scf2022$y1, nchar(scf2022$y1), nchar(scf2022$y1))) == 3) scf_4 <- subset(scf2022, as.numeric(substr(scf2022$y1, nchar(scf2022$y1), nchar(scf2022$y1))) == 4) scf_5 <- subset(scf2022, as.numeric(substr(scf2022$y1, nchar(scf2022$y1), nchar(scf2022$y1))) == 5) # Clean Up Subject Identifier in Individual Implicates for (i in 1:5){ temp <- get(paste0("scf_", i)) temp$yy1 <- temp$yy1.x temp$yy1.x <- NULL temp$yy1.y <- NULL assign(paste0("scf_", i), temp) } # Compile Individual Implicates to a List scf_data_list <- list(scf_1, scf_2, scf_3, scf_4, scf_5) # Removing unnecessary objects to save memory and space in Environment window rm(scf_1, scf_2, scf_3, scf_4, scf_5) rm(i, temp) gc() # PART SIX: CLEANING DATA # Replace missing replicate weights with zeros to prevent downstream bugs scf2022rw[ is.na( scf2022rw ) ] <- 0 # Rescale weights, per documentation scf2022rw[ , paste0( 'wgt' , 1:999 ) ] <- scf2022rw[ , paste0( 'wt1b' , 1:999 ) ] * scf2022rw[ , paste0( 'mm' , 1:999 ) ] # Using Damico's strategy of storing as a data table with y1 and wgts* scf2022rw <- scf2022rw[ , c( 'yy1' , paste0( 'wgt' , 1:999 ) ) ] # Check if yy1 values match across the datasets and RW table all(scf_data_list[[1]]$yy1 == scf2022rw$yy1) all(scf_data_list[[2]]$yy1 == scf2022rw$yy1) all(scf_data_list[[3]]$yy1 == scf2022rw$yy1) all(scf_data_list[[4]]$yy1 == scf2022rw$yy1) all(scf_data_list[[5]]$yy1 == scf2022rw$yy1) # Survey Design Object following Damico scf_design <- svrepdesign( weights = ~wgt, repweights = scf2022rw[ , -1 ] , data = imputationList( scf_data_list ) , scale = 1 , rscales = rep( 1 / 998 , 999 ) , mse = FALSE , type = "other" , combined.weights = TRUE ) scf_design_single <- svrepdesign( weights = ~wgt, repweights = scf2022rw[ , -1 ] , data = scf_data_list[[1]] , scale = 1 , rscales = rep( 1 / 998 , 999 ) , mse = FALSE , type = "other" , combined.weights = TRUE ) # Damico's function to combine implicates to give summary estimate from each of the five sets. scf_MIcombine <- function (results, variances, call = sys.call(), df.complete = Inf, ...) { m <- length(results) oldcall <- attr(results, "call") if (missing(variances)) { variances <- suppressWarnings(lapply(results, vcov)) results <- lapply(results, coef) } vbar <- variances[[1]] cbar <- results[[1]] for (i in 2:m) { cbar <- cbar + results[[i]] # MODIFICATION: # vbar <- vbar + variances[[i]] } cbar <- cbar/m # MODIFICATION: # vbar <- vbar/m evar <- var(do.call("rbind", results)) r <- (1 + 1/m) * evar/vbar df <- (m - 1) * (1 + 1/r)^2 if (is.matrix(df)) df <- diag(df) if (is.finite(df.complete)) { dfobs <- ((df.complete + 1)/(df.complete + 3)) * df.complete * vbar/(vbar + evar) if (is.matrix(dfobs)) dfobs <- diag(dfobs) df <- 1/(1/dfobs + 1/df) } if (is.matrix(r)) r <- diag(r) rval <- list(coefficients = cbar, variance = vbar + evar * (m + 1)/m, call = c(oldcall, call), nimp = m, df = df, missinfo = (r + 2/(df + 3))/(r + 1)) class(rval) <- "MIresult" rval } save(scf_design, scf_design_single, scf_data_list, scf_MIcombine, file = "SCF22.RData") |
Here, the focus is on the household net worth, which is the market value of a household’s assets, less the costs of settling their outstanding debts. This can be treated as a metric that captures a family’s ability to come up with money. The metric is less useful for making fine distinctions between households with similar wealth levels and better for assessing differences on the order of hundreds of thousands or millions of dollars.
We are examining how the distribution of net worth varies according to the age of the household head. In the SCF, the household head is the individual or partners who are described as “economically dominant” in a primary economic unit. It includes the person(s) who cover the household’s basic living costs for themselves and all economic dependents who live within that household. We group households according to the age of the SCF’s “reference person”, whom the Survey considers to be the male in a mixed-sex pair and the older person in a single-sex couple. This convention is likely maintained to create a constancy over time in the measurement of the SCF concept of “reference person”. My own expectation is that the choice of reference person does not strongly affect the wealth accumulation differentials that we observe, but I would test different ways of counting the reference person (i.e., the oldest in any relationship, the mean age when people are paired) if there were interest. I chose to group them by decades, mostly so that the research findings conform to commonplace terms used to discuss personal finance and age.
1 |
load("SCF22.RData")<br><br>scf_design <- update(scf_design,<br> age_group = cut(age, breaks = c(0, 30, 40, 50, 60, 70, 999),<br> labels = c("<30", "30s", "40s", "50s", "60s", "70s+"))) |
Analysis
The table below describes the distribution of net worth across our age groups. The columns represent age groups, and the rows represent different percentile net worth values for each age group.
Percentile Net Worth Values by Age Group
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
p10 <- scf_MIcombine( with( scf_design, svyby(~networth, ~age_group, svyquantile, 0.1 , se = TRUE , interval.type = 'quantile'))) p25 <- scf_MIcombine( with( scf_design, svyby(~networth, ~age_group, svyquantile, 0.25 , se = TRUE , interval.type = 'quantile'))) p50 <- scf_MIcombine( with( scf_design, svyby(~networth, ~age_group, svyquantile, 0.5 , se = TRUE , interval.type = 'quantile'))) p75 <- scf_MIcombine( with( scf_design, svyby(~networth, ~age_group, svyquantile, 0.75 , se = TRUE , interval.type = 'quantile'))) p90 <- scf_MIcombine( with( scf_design, svyby(~networth, ~age_group, svyquantile, 0.9 , se = TRUE , interval.type = 'quantile'))) p95 <- scf_MIcombine( with( scf_design, svyby(~networth, ~age_group, svyquantile, 0.95 , se = TRUE , interval.type = 'quantile'))) p99 <- scf_MIcombine( with( scf_design, svyby(~networth, ~age_group, svyquantile, 0.99 , se = TRUE , interval.type = 'quantile'))) pcttab_10 <- p10$coefficients pcttab_25 <- p25$coefficients pcttab_50 <- p50$coefficients pcttab_75 <- p75$coefficients pcttab_90 <- p90$coefficients pcttab_95 <- p95$coefficients pcttab_99 <- p99$coefficients pct_tab <- data.frame(cbind(pcttab_10, pcttab_25, pcttab_50, pcttab_75, pcttab_90, pcttab_95, pcttab_99)) pct_tab$age_group <- c("<30", "30s", "40s", "50s", "60s", "70s+") pct_tab <- pct_tab[,c(8, 1:7)] rownames(pct_tab) <- NULL t_pct_tab <- data.frame(t(pct_tab)) colnames(t_pct_tab) <- t_pct_tab[1,] t_pct_tab <- t_pct_tab[-1,] t_pct_tab <- apply(t_pct_tab, 2, function(x) scales::dollar(as.numeric(as.character(x)), accuracy = 1, trim = TRUE)) rownames(t_pct_tab) <- c("10th Percentile", "25th Percentile", "50th Percentile", "75th Percentile", "90th Percentile", "95th Percentile", "99th Percentile") kable_table <- kable(t_pct_tab, format = "pipe", caption = "Percentile Net Worth Values by Age Group") kable_table |
<30 | 30s | 40s | 50s | 60s | 70s+ | |
10th Percentile | -$13,278 | -$3,274 | $736 | $4,086 | $5,420 | $9,536 |
25th Percentile | $1,122 | $15,552 | $31,326 | $69,840 | $78,254 | $99,192 |
50th Percentile | $21,054 | $107,752 | $189,746 | $306,886 | $394,394 | $367,860 |
75th Percentile | $117,796 | $323,606 | $600,252 | $1,074,428 | $1,106,748 | $1,035,780 |
90th Percentile | $280,300 | $750,544 | $1,388,340 | $2,711,830 | $2,962,016 | $2,976,530 |
95th Percentile | $411,880 | $1,223,800 | $2,704,490 | $5,382,202 | $6,639,864 | $5,875,320 |
99th Percentile | $2,128,300 | $5,026,540 | $7,670,380 | $15,530,000 | $18,921,600 | $18,552,600 |
First, note that the distribution of household wealth is much more compressed before age 30. Many young people are still in school, which often involves accumulating debt without earning money. Many of the poorer households in this age category will ultimately end up among the wealthier later in life. Households that enter the workforce immediately may get an earlier start accumulating property, but my sense of the data and evidence is that advanced training has much higher returns.
Differences open up over people’s 30s and 40s, and eventually stabilize in the 50s onward. A household’s 30s and 40s may be a period in which they experience the run-up of their career and property accumulation, setting them on the trajectory into which they will settle in the 50s onward. Households that spend their 30s and 40s with the privilege of creating a successful business, ascending quickly and going far in their professional field, getting an early start on their retirement savings, or having successful first forays into real estate. Households accumulate earning power in part through business or professional success, and a growing portfolio of money-making or price-appreciating property.
The distribution of wealth seems to steady when we get beyond people’s fifties. It does not mean that people’s personal fates are static. Some people will run up professional success and savings in their 30s and 40s, and then find themselves pressed out of the workforce when they hit their 50s. Others will have earned more steadily and will be able to stretch out their working years into their 70s. But the general distribution of wealth remains pretty static in the aggregate beyond the 50s. Another reason that wealth may stabilize after the 50s is that people are most likely to receive intergenerationally-transmitted wealth by this decade. These transmissions may also propel the run-up of wealth in people’s 30s and 40s, as the incidence of parental death rises.
I close the analysis with a visualization that resembles the age-weight charts at the pediatrician. It offers a way to look up wealth levels over the life course. I can create a year-by-year one if there’s interest. The first version is on a log scale because we would not be able to see clear differences in non-millionaires when we include multi-millionaires in the figure.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
p20 <- scf_MIcombine( with( scf_design, svyby(~networth, ~age_group, svyquantile, 0.2 , se = TRUE , interval.type = 'quantile'))) p30 <- scf_MIcombine( with( scf_design, svyby(~networth, ~age_group, svyquantile, 0.3 , se = TRUE , interval.type = 'quantile'))) p40 <- scf_MIcombine( with( scf_design, svyby(~networth, ~age_group, svyquantile, 0.4 , se = TRUE , interval.type = 'quantile'))) p60 <- scf_MIcombine( with( scf_design, svyby(~networth, ~age_group, svyquantile, 0.6 , se = TRUE , interval.type = 'quantile'))) p70 <- scf_MIcombine( with( scf_design, svyby(~networth, ~age_group, svyquantile, 0.7 , se = TRUE , interval.type = 'quantile'))) p80 <- scf_MIcombine( with( scf_design, svyby(~networth, ~age_group, svyquantile, 0.8 , se = TRUE , interval.type = 'quantile'))) pcttab_20 <- p20$coefficients pcttab_30 <- p30$coefficients pcttab_40 <- p40$coefficients pcttab_60 <- p60$coefficients pcttab_70 <- p70$coefficients pcttab_80 <- p80$coefficients pct_fig <- data.frame(rbind(pcttab_10, pcttab_20, pcttab_30, pcttab_40, pcttab_50, pcttab_60, pcttab_70, pcttab_80, pcttab_90, pcttab_95, pcttab_99)) names(pct_fig) <- paste(c("Twenties","Thirties","Forties", "Fifties", "Sixties", "Seventies")) rownames(pct_fig) <- c("P10", "P20", "P30", "P40", "P50", "P60", "P70", "P80", "P90", "P95", "P99") pct_fig <- round(pct_fig, 0) pct_table <- pct_fig pct_long <- pct_fig %>% t() %>% as.data.frame() %>% rownames_to_column("AgeGroup") %>% gather(key = "Percentile", value = "NetWorth", -AgeGroup) # Convert the AgeGroup and Percentile to factors for proper ordering in the plot pct_long$AgeGroup <- factor(pct_long$AgeGroup, levels = c("Twenties","Thirties","Forties", "Fifties", "Sixties", "Seventies")) pct_long$Percentile <- factor(pct_long$Percentile, levels = c("P10", "P20", "P30", "P40", "P50", "P60", "P70", "P80", "P90", "P95", "P99")) # Need log y-scale pct_long$NetWorth <- ifelse(pct_long$NetWorth <= 0, 1, pct_long$NetWorth) # Create the plot library(ggplot2) library(dplyr) # Extend the plot space to the right and reduce the label offset fig1 <- ggplot(pct_long, aes(x = AgeGroup, y = NetWorth, group = Percentile, color = Percentile)) + geom_point() + geom_line() + stat_smooth(se = FALSE, method = 'loess') + scale_y_log10(breaks = c(10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000, 25000, 50000, 100000, 250000, 500000, 1e6, 2.5e6, 5e6, 1e7, 2.5e7, 5e7, 1e8, 2.5e8, 5e8), labels = scales::dollar_format()) + labs(title = "Net Worth Percentiles by Age Group, US Households, 2022", x = "Age Group", y = "Net Worth (Log Scale)", color = "Percentile") + theme_minimal() + theme(legend.position = "none", aspect.ratio = 3/2) + scale_x_discrete(limits = c("Twenties", "Thirties", "Forties", "Fifties", "Sixties", "Seventies", ""), expand = c(0.01, 0.8)) + geom_text(data = pct_long %>% group_by(Percentile) %>% slice(n()), aes(label = paste0(sub("P", "", Percentile), "th percentile"), x = AgeGroup, y = NetWorth), hjust = 0, vjust = 0.5, size = 2, nudge_x = 0.2, check_overlap = TRUE) + annotate("text", x = "", y = 3, label = "J. Cohen, Queens College (josephnathancohen.info)", hjust = 1, vjust = 0, size = 3.5, color = "black") ggsave(fig1, file = "Fig1.png", device = "png") ## Saving 5 x 4 in image ## `geom_smooth()` using formula = 'y ~ x' fig1 ## `geom_smooth()` using formula = 'y ~ x' |
Keep the log scale in mind when interpreting the data. Wealth levels get larger in multiplicative rather than additive terms as we move up the y-axis. The differences look pretty evenly spread out on this scale, but this is the figure without the logged axis (showing only up to the 90th percentile):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
pct_long_b <- subset(pct_long, Percentile != "P95" & Percentile != "P99") ggplot(pct_long_b, aes(x = AgeGroup, y = NetWorth, group = Percentile, color = Percentile)) + geom_point() + geom_line() + stat_smooth(se = FALSE, method = 'loess') + scale_y_continuous(labels = scales::dollar_format(), breaks = seq(0, max(pct_long_b$NetWorth, na.rm = TRUE), by = 250000)) + labs(title = "Net Worth Percentiles by Age Group, US Households, 2022", x = "Age Group", y = "Net Worth", color = "Percentile") + theme_minimal() + theme(legend.position = "none", aspect.ratio = 7/5, plot.title = element_text(hjust = 0), plot.margin = margin(6, 6, 20, 6)) + scale_x_discrete(limits = c("Twenties", "Thirties", "Forties", "Fifties", "Sixties", "Seventies", ""), expand = c(0.01, 0.8)) + geom_text(data = pct_long_b %>% group_by(Percentile) %>% slice(n()), aes(label = paste0(sub("P", "", Percentile), "th percentile"), x = AgeGroup, y = NetWorth), hjust = 0, vjust = 0.5, size = 3, nudge_x = 0.2, check_overlap = TRUE) |
The graph gets much taller when we include 95th and 99th percentile scores. All of this is to clarify the fact that the run-up of wealth that occurs during people’s working years differs on multiplicative scales.