Utilizing Social Vulnerability Factors to Predict Health Outcomes¶

Fall 2025 Data Science Project

Bhavika Buddi, Srinidhi Gubba, Avishi Gupta, Betty Au, Manasa Vinod Kumar

Contributions¶

  • Project Idea
    • Everyone - Various ideas were proposed until one was decided on. This topic was then re-workshopped when we decided to pivot. Lots of group discussions were done during this process.
  • Dataset Curation and Preprocessing
    • Bhavika and Srinidhi both identified different sources of credible data before and after the pivot our team had. Bhavika focused on merging datasets while Srinidhi focused on filtration.
  • Dataset Exploration and Summary Statistics
    • Everyone - Avishi, Betty, and Manasa were involved in the data exploration for the second checkpoint, while Bhavika and Srinidhi contributed after the pivot.
  • ML Algorithm Design/Development
    • Avishi and Manasa - Avishi worked on the linear regression, random forest regressor, and k means clustering. The remaining models were done by Manasa
  • ML Algorithm Training and Test Data Analysis
    • Avishi and Manasa - Avishi worked on the linear regression, random forest regressor, and k means clustering. The remaining models were done by Manasa.
  • Visualizations, Result Analysis, Conclusion
    • Everyone - Srinidhi and Bhavika focused on visualizations while Avishi and Manasa focused on result analysis and Betty focused on fine-tuning our conclusion.
  • Final Tutorial Creation
    • Betty Au - wrote out explanations of the pipeline and justified why decisions for various steps were made.

Introduction¶

This project will explore the factors which contribute to the health of Maryland counties. It will first explore the relationship between the presence of small businesses versus large businesses and whether it correlates with health outcomes. However, the data available suggests a lack of any relationship which makes it difficult to develop good machine learning models which rely on relationships and patterns. Thus the focus pivots to social vulnerability index (SVI) and its relationship with health outcomes. SVI refers to a set of demographic and socioeconomic factors which adversly affect communities. The pivot to SVIs exemplifies a natural part of the data science pipeline in which barriers during the process results in needing to take steps back.

A key step towards creating more healthy communities is to first understand what affects community health. Data science is helpful as it is an efficient tool for breaking down and finding what's important within a large database of numbers. It is able to process a myriad of factors and identify which are most significant towards affecting health.

In this project, we go through the data science pipeline, including data curation to clean a dataset until it is understandable and only contains what is relevant, exploratory data analysis as a starting point to look for deeper insights, primary analysis to explore data with more complexity, and visualization to help articulate the meaning of numbers.

Imports¶

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats
from scipy.stats import pearsonr
from scipy.stats import f_oneway
from scipy.cluster.hierarchy import linkage, dendrogram
from scipy.cluster.hierarchy import fcluster

from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsRegressor

Data Curation¶

The first dataset we will look at is data on the the health of each county in Maryland. This data is provided by County Health Rankings & Roadmaps, a program of the University of Wisconsin Population Health Institute. We first load in the dataset and observe what information is given.

In [ ]:
#Load health data
df_health = pd.read_csv("sample_data/health.csv", header=1)
print(df_health.shape)
df_health
(25, 249)
Out[ ]:
FIPS State County Unreliable Deaths Years of Potential Life Lost Rate 95% CI - Low 95% CI - High Z-Score YPLL Rate (AIAN) ... % Drive Alone (Hispanic) 95% CI - Low % Drive Alone (Hispanic) 95% CI - High % Drive Alone (white) % Drive Alone (white) 95% CI - Low % Drive Alone (white) 95% CI - High # Workers who Drive Alone % Long Commute - Drives Alone 95% CI - Low.19 95% CI - High.19 Z-Score.34
0 24000 Maryland NaN NaN 75564 7547 7467 7627 NaN 4720.0 ... 64.0 67.0 78.0 78.0 79.0 3047112 50 50 51 NaN
1 24001 Maryland Allegany NaN 1199 9538 8662 10414 0.74 NaN ... NaN NaN 86.0 84.0 87.0 26999 22 19 24 -1.90
2 24003 Maryland Anne Arundel NaN 6579 6937 6693 7180 -0.45 NaN ... 68.0 77.0 80.0 80.0 81.0 302608 48 46 49 0.24
3 24005 Maryland Baltimore NaN 11566 8371 8141 8601 0.20 NaN ... 64.0 73.0 82.0 81.0 82.0 412698 47 46 48 0.16
4 24009 Maryland Calvert NaN 1038 6531 5911 7151 -0.64 NaN ... 59.0 77.0 77.0 75.0 79.0 47266 62 59 66 1.45
5 24011 Maryland Caroline NaN 519 9003 7820 10187 0.49 NaN ... NaN NaN 82.0 78.0 86.0 15783 52 47 56 0.55
6 24013 Maryland Carroll NaN 2047 6604 6148 7060 -0.61 NaN ... 53.0 70.0 78.0 76.0 80.0 87217 58 56 60 1.08
7 24015 Maryland Cecil NaN 1772 10405 9673 11138 1.14 NaN ... 46.0 88.0 80.0 77.0 82.0 49939 49 45 53 0.35
8 24017 Maryland Charles NaN 1965 7786 7284 8288 -0.06 NaN ... 66.0 82.0 82.0 80.0 84.0 82067 68 65 71 1.88
9 24019 Maryland Dorchester NaN 575 10398 8973 11822 1.13 NaN ... 32.0 62.0 81.0 78.0 85.0 14310 42 37 47 -0.22
10 24021 Maryland Frederick NaN 2568 5885 5543 6227 -0.94 NaN ... 63.0 71.0 78.0 76.0 80.0 134016 50 48 51 0.40
11 24023 Maryland Garrett NaN 446 7521 6282 8760 -0.19 NaN ... NaN NaN NaN NaN NaN 13668 31 26 35 -1.16
12 24025 Maryland Harford NaN 3103 7005 6619 7391 -0.42 NaN ... 62.0 75.0 80.0 79.0 82.0 130855 53 51 55 0.66
13 24027 Maryland Howard NaN 2256 4342 4065 4618 -1.64 NaN ... 66.0 76.0 81.0 80.0 83.0 169456 48 46 50 0.26
14 24029 Maryland Kent NaN 308 7810 6273 9347 -0.05 NaN ... 27.0 76.0 78.0 73.0 83.0 9258 38 32 44 -0.54
15 24031 Maryland Montgomery NaN 7369 4304 4155 4453 -1.66 NaN ... 61.0 65.0 69.0 68.0 70.0 549986 54 52 55 0.72
16 24033 Maryland Prince George's NaN 11369 7484 7285 7684 -0.20 NaN ... 62.0 67.0 80.0 79.0 81.0 475260 61 59 62 1.31
17 24035 Maryland Queen Anne's NaN 583 6384 5509 7260 -0.71 NaN ... 33.0 72.0 74.0 71.0 76.0 25283 54 51 58 0.79
18 24037 Maryland St. Mary's NaN 1348 6954 6417 7491 -0.45 NaN ... 61.0 86.0 82.0 80.0 84.0 56674 40 37 43 -0.36
19 24039 Maryland Somerset NaN 444 8834 7547 10121 0.42 NaN ... NaN NaN 85.0 79.0 90.0 9200 34 28 41 -0.86
20 24041 Maryland Talbot NaN 504 8256 6975 9537 0.15 NaN ... 54.0 88.0 79.0 76.0 82.0 17089 30 26 34 -1.21
21 24043 Maryland Washington NaN 2384 9490 8919 10061 0.72 NaN ... 64.0 89.0 83.0 82.0 84.0 67878 39 36 41 -0.51
22 24045 Maryland Wicomico NaN 1518 8924 8250 9597 0.46 NaN ... 58.0 81.0 84.0 82.0 87.0 48898 25 23 28 -1.60
23 24047 Maryland Worcester NaN 782 6632 5796 7469 -0.59 NaN ... 61.0 71.0 82.0 79.0 85.0 23732 29 26 32 -1.30
24 24510 Maryland Baltimore City NaN 13322 14845 14497 15193 3.17 NaN ... 55.0 63.0 76.0 75.0 77.0 276972 43 41 44 -0.18

25 rows × 249 columns

In [ ]:
print(list(df_health.columns))
['FIPS', 'State', 'County', 'Unreliable', 'Deaths', 'Years of Potential Life Lost Rate', '95% CI - Low', '95% CI - High', 'Z-Score', 'YPLL Rate (AIAN)', 'YPLL Rate (AIAN) 95% CI - Low', 'YPLL Rate (AIAN) 95% CI - High', 'YPLL Rate (AIAN) Unreliable', 'YPLL Rate (Asian)', 'YPLL Rate (Asian) 95% CI - Low', 'YPLL Rate (Asian) 95% CI - High', 'YPLL Rate (Asian) Unreliable', 'YPLL Rate (Black)', 'YPLL Rate (Black) 95% CI - Low', 'YPLL Rate (Black) 95% CI - High', 'YPLL Rate (Black) Unreliable', 'YPLL Rate (Hispanic)', 'YPLL Rate (Hispanic) 95% CI - Low', 'YPLL Rate (Hispanic) 95% CI - High', 'YPLL Rate (Hispanic) Unreliable', 'YPLL Rate (white)', 'YPLL Rate (white) 95% CI - Low', 'YPLL Rate (white) 95% CI - High', 'YPLL Rate (white) Unreliable', '% Fair or Poor Health', '95% CI - Low.1', '95% CI - High.1', 'Z-Score.1', 'Average Number of Physically Unhealthy Days', '95% CI - Low.2', '95% CI - High.2', 'Z-Score.2', 'Average Number of Mentally Unhealthy Days', '95% CI - Low.3', '95% CI - High.3', 'Z-Score.3', 'Unreliable.1', '% Low birthweight', '95% CI - Low.4', '95% CI - High.4', 'Z-Score.4', '% LBW (AIAN)', '% LBW (AIAN) 95% CI - Low', '% LBW (AIAN) 95% CI - High', '% LBW (Asian)', '% LBW (Asian) 95% CI - Low', '% LBW (Asian) 95% CI - High', '% LBW (Black)', '% LBW (Black) 95% CI - Low', '% LBW (Black) 95% CI - High', '% LBW (Hispanic)', '% LBW (Hispanic) 95% CI - Low', '% LBW (Hispanic) 95% CI - High', '% LBW (white)', '% LBW (white) 95% CI - Low', '% LBW (white) 95% CI - High', '% Smokers', '95% CI - Low.5', '95% CI - High.5', 'Z-Score.5', '% Adults with Obesity', '95% CI - Low.6', '95% CI - High.6', 'Z-Score.6', 'Food Environment Index', 'Z-Score.7', '% Physically Inactive', '95% CI - Low.7', '95% CI - High.7', 'Z-Score.8', '% With Access to Exercise Opportunities', 'Z-Score.9', '% Excessive Drinking', '95% CI - Low.8', '95% CI - High.8', 'Z-Score.10', '# Alcohol-Impaired Driving Deaths', '# Driving Deaths', '% Driving Deaths with Alcohol Involvement', '95% CI - Low.9', '95% CI - High.9', 'Z-Score.11', '# Chlamydia Cases', 'Chlamydia Rate', 'Z-Score.12', 'Teen Birth Rate', '95% CI - Low.10', '95% CI - High.10', 'Z-Score.13', 'Teen Birth Rate (AIAN)', 'Teen Birth Rate (AIAN) 95% CI - Low', 'Teen Birth Rate (AIAN) 95% CI - High', 'Teen Birth Rate (Asian)', 'Teen Birth Rate (Asian) 95% CI - Low', 'Teen Birth Rate (Asian) 95% CI - High', 'Teen Birth Rate (Black)', 'Teen Birth Rate (Black) 95% CI - Low', 'Teen Birth Rate (Black) 95% CI - High', 'Teen Birth Rate (Hispanic)', 'Teen Birth Rate (Hispanic) 95% CI - Low', 'Teen Birth Rate (Hispanic) 95% CI - High', 'Teen Birth Rate (white)', 'Teen Birth Rate (white) 95% CI - Low', 'Teen Birth Rate (white) 95% CI - High', '# Uninsured', '% Uninsured', '95% CI - Low.11', '95% CI - High.11', 'Z-Score.14', '# Primary Care Physicians', 'Primary Care Physicians Rate', 'Primary Care Physicians Ratio', 'Z-Score.15', '# Dentists', 'Dentist Rate', 'Dentist Ratio', 'Z-Score.16', '# Mental Health Providers', 'Mental Health Provider Rate', 'Mental Health Provider Ratio', 'Z-Score.17', 'Preventable Hospitalization Rate', 'Z-Score.18', 'Preventable Hosp. Rate (AIAN)', 'Preventable Hosp. Rate (Asian)', 'Preventable Hosp. Rate (Black)', 'Preventable Hosp. Rate (Hispanic)', 'Preventable Hosp. Rate (white)', '% With Annual Mammogram', 'Z-Score.19', '% Screened (AIAN)', '% Screened (Asian)', '% Screened (Black)', '% Screened (Hispanic)', '% Screened (white)', '% Vaccinated', 'Z-Score.20', '% Vaccinated (AIAN)', '% Vaccinated (Asian)', '% Vaccinated (Black)', '% Vaccinated (Hispanic)', '% Vaccinated (white)', '# Completed High School', 'Population', '% Completed High School', '95% CI - Low.12', '95% CI - High.12', 'Z-Score.21', '# Some College', 'Population.1', '% Some College', '95% CI - Low.13', '95% CI - High.13', 'Z-Score.22', '# Unemployed', 'Labor Force', '% Unemployed', 'Z-Score.23', '% Children in Poverty', '95% CI - Low.14', '95% CI - High.14', 'Z-Score.24', '% Children in Poverty (AIAN)', '% Children in Poverty (Asian)', '% Children in Poverty (Black)', '% Children in Poverty (Hispanic)', '% Children in Poverty (white)', '80th Percentile Income', '20th Percentile Income', 'Income Ratio', 'Z-Score.25', '# Children in Single-Parent Households', '# Children in Households', '% Children in Single-Parent Households', '95% CI - Low.15', '95% CI - High.15', 'Z-Score.26', '# Associations', 'Social Association Rate', 'Z-Score.27', 'Annual Average Violent Crimes', 'Violent Crime Rate', 'Z-Score.28', '# Injury Deaths', 'Injury Death Rate', '95% CI - Low.16', '95% CI - High.16', 'Z-Score.29', 'Injury Death Rate (AIAN)', 'Injury Death Rate (AIAN) 95% CI - Low', 'Injury Death Rate (AIAN) 95% CI - High', 'Injury Death Rate (Asian)', 'Injury Death Rate (Asian) 95% CI - Low', 'Injury Death Rate (Asian) 95% CI - High', 'Injury Death Rate (Black)', 'Injury Death Rate (Black) 95% CI - Low', 'Injury Death Rate (Black) 95% CI - High', 'Injury Death Rate (Hispanic)', 'Injury Death Rate (Hispanic) 95% CI - Low', 'Injury Death Rate (Hispanic) 95% CI - High', 'Injury Death Rate (white)', 'Injury Death Rate (white) 95% CI - Low', 'Injury Death Rate (white) 95% CI - High', 'Average Daily PM2.5', 'Z-Score.30', 'Presence of Water Violation', 'Z-Score.31', '% Severe Housing Problems', '95% CI - Low.17', '95% CI - High.17', 'Severe Housing Cost Burden', 'Severe Housing Cost Burden 95% CI - Low', 'Severe Housing Cost Burden 95% CI - High', 'Overcrowding', 'Overcrowding 95% CI - Low', 'Overcrowding 95% CI - High', 'Inadequate Facilities', 'Inadequate Facilities 95% CI - Low', 'Inadequate Facilities 95% CI - High', 'Z-Score.32', '% Drive Alone to Work', '95% CI - Low.18', '95% CI - High.18', 'Z-Score.33', '% Drive Alone (AIAN)', '% Drive Alone (AIAN) 95% CI - Low', '% Drive Alone (AIAN) 95% CI - High', '% Drive Alone (Asian)', '% Drive Alone (Asian) 95% CI - Low', '% Drive Alone (Asian) 95% CI - High', '% Drive Alone (Black)', '% Drive Alone (Black) 95% CI - Low', '% Drive Alone (Black) 95% CI - High', '% Drive Alone (Hispanic)', '% Drive Alone (Hispanic) 95% CI - Low', '% Drive Alone (Hispanic) 95% CI - High', '% Drive Alone (white)', '% Drive Alone (white) 95% CI - Low', '% Drive Alone (white) 95% CI - High', '# Workers who Drive Alone', '% Long Commute - Drives Alone', '95% CI - Low.19', '95% CI - High.19', 'Z-Score.34']

We see that data is given for all 23 counties in Maryland, plus Baltimore city (an independent city) and the state overall. These latter 2 may are dropped so that we can focus on the counties. Similarly, as we know all these areas are located in Maryland, the "State" column is removed, as well as the "FIPS" as it is irrelevant. Many features are provided for considering county health, but from this dataset we only intend to observe a single measurement of health. We chose to isolate "% Fair or Poor Health", "Average Number of Physically Unhealthy Days", and "Average Number of Mentally Unhealthy Days" as candidates for our single measure. "% Fair or Poor Health" is the percentage of adults who report fair or poor health, while "Average Number of Physically Unhealthy Days" and "Average Number of Mentally Unhealthy Days" is the average number of days for which adults report being unhealthy.

In [ ]:
# cleaning the health data, keeping only the columns that we are using to determine health outcomes
cols_to_keep = ['County', '% Fair or Poor Health', 'Average Number of Physically Unhealthy Days', 'Average Number of Mentally Unhealthy Days']
df_filtered_health = df_health[cols_to_keep]
df_filtered_health = df_filtered_health.drop(df_filtered_health.index[0])
df_filtered_health
Out[ ]:
County % Fair or Poor Health Average Number of Physically Unhealthy Days Average Number of Mentally Unhealthy Days
1 Allegany 19 4.2 5.0
2 Anne Arundel 14 3.2 4.1
3 Baltimore 16 3.4 4.3
4 Calvert 14 3.3 4.3
5 Caroline 20 4.3 5.1
6 Carroll 13 3.2 4.3
7 Cecil 16 3.8 4.8
8 Charles 17 3.4 4.3
9 Dorchester 19 4.2 4.9
10 Frederick 14 3.2 4.1
11 Garrett 18 4.0 5.0
12 Harford 14 3.3 4.3
13 Howard 11 2.7 3.5
14 Kent 16 3.7 4.6
15 Montgomery 13 2.9 3.6
16 Prince George's 17 3.5 4.1
17 Queen Anne's 13 3.2 4.3
18 St. Mary's 15 3.5 4.4
19 Somerset 25 4.8 5.2
20 Talbot 14 3.4 4.3
21 Washington 18 4.0 4.8
22 Wicomico 20 4.1 4.9
23 Worcester 16 3.6 4.6
24 Baltimore City 22 4.2 4.8

Now we take a look at the businesses in Maryland. The dataset, "Statistics of U.S. Businesses" is given by the United States Census Bureau, dated at 2022 and includes data on businesses throughout the USA. First we filter to Maryland businesses and also remvoe businesses which are state-wide. Furthermore, we isolate the features to only those relevant to a business' size, such as "Establishments", "NAICS description", and "Enterpreise Size".

In [ ]:
#load in the business data, filtering for only MD counties
df_county = pd.read_csv("sample_data/county_3digitnaics_2022.csv", header=2)
df_county
df_county_MD = df_county[df_county['State Name'].str.contains('Maryland', case=False, na=False)]
print(df_county_MD.shape)
(6099, 15)
In [ ]:
df_county = pd.read_csv("sample_data/county_3digitnaics_2022.csv", header=2)
# print(df_county.shape) # Keep this commented out for now

# Inspect the DataFrame to understand the column names and data
print("df_county head:")
display(df_county.head())
print("Unique values in 'State Name' column:")
print(df_county['State Name'].unique())

df_county_MD = df_county[df_county['State Name'].str.contains('Maryland', case=False, na=False)]
print(df_county_MD.shape)
df_county head:
State State Name County County Name NAICS NAICS Description Enterprise Size Firms Establishments Employment Employment Noise Flag Annual Payroll\n($1,000) Annual Payroll Noise Flag Receipts\n($1,000) Receipts\nNoise Flag
0 1 Alabama 1 Autauga -- Total 1: Total 874 948 12,409 G 496,158 G 3,422,297 G
1 1 Alabama 1 Autauga -- Total 2: <20 employees 578 579 2,591 G 90,641 G 562,413 G
2 1 Alabama 1 Autauga -- Total 3: 20-99 employees 91 101 2,517 G 98,457 G 530,120 G
3 1 Alabama 1 Autauga -- Total 4: 100-499 employees 38 51 1,406 H 51,404 H 205,532 H
4 1 Alabama 1 Autauga -- Total 5: 500+ employees 167 217 5,895 G 255,656 G 2,124,232 G
Unique values in 'State Name' column:
[' Alabama' 'Alabama' ' Alaska' 'Alaska' ' Arizona' 'Arizona' ' Arkansas'
 'Arkansas' ' California' 'California' ' Colorado' 'Colorado'
 ' Connecticut' 'Connecticut' ' Delaware' 'Delaware'
 ' District of Columbia' ' Florida' 'Florida' ' Georgia' 'Georgia'
 ' Hawaii' 'Hawaii' ' Idaho' 'Idaho' ' Illinois' 'Illinois' ' Indiana'
 'Indiana' ' Iowa' 'Iowa' ' Kansas' 'Kansas' ' Kentucky' 'Kentucky'
 ' Louisiana' 'Louisiana' ' Maine' 'Maine' ' Maryland' 'Maryland'
 ' Massachusetts' 'Massachusetts' ' Michigan' 'Michigan' ' Minnesota'
 'Minnesota' ' Mississippi' 'Mississippi' ' Missouri' 'Missouri'
 ' Montana' 'Montana' ' Nebraska' 'Nebraska' ' Nevada' 'Nevada'
 ' New Hampshire' 'New Hampshire' ' New Jersey' 'New Jersey' ' New Mexico'
 'New Mexico' ' New York' 'New York' ' North Carolina' 'North Carolina'
 ' North Dakota' 'North Dakota' ' Ohio' 'Ohio' ' Oklahoma' 'Oklahoma'
 ' Oregon' 'Oregon' ' Pennsylvania' 'Pennsylvania' ' Rhode Island'
 'Rhode Island' ' South Carolina' 'South Carolina' ' South Dakota'
 'South Dakota' ' Tennessee' 'Tennessee' ' Texas' 'Texas' ' Utah' 'Utah'
 ' Vermont' 'Vermont' ' Virginia' 'Virginia' ' Washington' 'Washington'
 ' West Virginia' 'West Virginia' ' Wisconsin' 'Wisconsin' ' Wyoming'
 'Wyoming']
(6099, 15)
In [ ]:
#Clean the data, keeping only the columns that are needed for analysis
columns_to_keep = ['County Name', 'NAICS Description', 'Enterprise Size', 'Establishments']
df_filtered = df_county_MD[df_county_MD['NAICS Description'].str.contains('Total', case=False, na=False)][columns_to_keep]
display(df_filtered)
County Name NAICS Description Enterprise Size Establishments
188081 Allegany Total 1: Total 1,468
188082 Allegany Total 2: <20 employees 931
188083 Allegany Total 3: 20-99 employees 137
188084 Allegany Total 4: 100-499 employees 106
188085 Allegany Total 5: 500+ employees 294
... ... ... ... ...
194096 Statewide Maryland Total 1: Total 855
194097 Statewide Maryland Total 2: <20 employees 6
194098 Statewide Maryland Total 3: 20-99 employees 10
194099 Statewide Maryland Total 4: 100-499 employees 64
194100 Statewide Maryland Total 5: 500+ employees 775

125 rows × 4 columns

In [ ]:
#remove unneccessary rows (don't need statewide totals, for example)
df_filtered = df_filtered[~df_filtered['Enterprise Size'].str.contains('Total', case=False, na=False)]
df_filtered = df_filtered[~df_filtered['County Name'].str.contains('Statewide Maryland', case=False, na=False)]
df_filtered
Out[ ]:
County Name NAICS Description Enterprise Size Establishments
188082 Allegany Total 2: <20 employees 931
188083 Allegany Total 3: 20-99 employees 137
188084 Allegany Total 4: 100-499 employees 106
188085 Allegany Total 5: 500+ employees 294
188296 Anne Arundel Total 2: <20 employees 9,698
... ... ... ... ...
193519 Worcester Total 5: 500+ employees 243
193725 Baltimore city Total 2: <20 employees 8,190
193726 Baltimore city Total 3: 20-99 employees 1,260
193727 Baltimore city Total 4: 100-499 employees 658
193728 Baltimore city Total 5: 500+ employees 2,257

96 rows × 4 columns

"Enterprise size" is used to interpret whether a business is a small or large business. Those with a size smaller than 100 is classified as a small businesses while those with a size 100 or larger are considered large businesses. A column is added to carry this label, and helps to add meaning to otherwise confusing numbers.

In [ ]:
#classify the size as small or large business
def identify_size(size):
    size = size.lower()
    if '<20' in size or '20-99' in size:
        return 'Small'
    elif '100-499' in size or '500+' in size:
        return 'Large'
    else:
        return None

df_filtered['Business Size'] = df_filtered['Enterprise Size'].apply(identify_size)
df_filtered
Out[ ]:
County Name NAICS Description Enterprise Size Establishments Business Size
188082 Allegany Total 2: <20 employees 931 Small
188083 Allegany Total 3: 20-99 employees 137 Small
188084 Allegany Total 4: 100-499 employees 106 Large
188085 Allegany Total 5: 500+ employees 294 Large
188296 Anne Arundel Total 2: <20 employees 9,698 Small
... ... ... ... ... ...
193519 Worcester Total 5: 500+ employees 243 Large
193725 Baltimore city Total 2: <20 employees 8,190 Small
193726 Baltimore city Total 3: 20-99 employees 1,260 Small
193727 Baltimore city Total 4: 100-499 employees 658 Large
193728 Baltimore city Total 5: 500+ employees 2,257 Large

96 rows × 5 columns

Right now, the dataset features information on various businesses, but we want to focus on data based on per county. Using the number of "Establishments", for each business, we total the number of small businesses and the number of large businesses for each county. This gives us a dataset where each county has two data points, one for large businesses and one for small businesses. The percentage of small businesses, or percentage of large businesses, can then be calculated and is added as another column.

In [ ]:
#remove , to convert to numeric type
df_filtered['Establishments'] = df_filtered['Establishments'].str.replace(',', '', regex=False).astype(int)
In [ ]:
#summarize establishments by size for each county
df_summary = (
    df_filtered
    .groupby(['County Name', 'Business Size'], as_index=False)['Establishments']
    .sum()
)

df_summary
Out[ ]:
County Name Business Size Establishments
0 Allegany Large 400
1 Allegany Small 1068
2 Anne Arundel Large 3401
3 Anne Arundel Small 11244
4 Baltimore Large 4554
5 Baltimore Small 15812
6 Baltimore city Large 2915
7 Baltimore city Small 9450
8 Calvert Large 291
9 Calvert Small 1442
10 Caroline Large 107
11 Caroline Small 510
12 Carroll Large 662
13 Carroll Small 3602
14 Cecil Large 345
15 Cecil Small 1405
16 Charles Large 621
17 Charles Small 2135
18 Dorchester Large 127
19 Dorchester Small 539
20 Frederick Large 1371
21 Frederick Small 5097
22 Garrett Large 163
23 Garrett Small 779
24 Harford Large 1198
25 Harford Small 4416
26 Howard Large 2154
27 Howard Small 7712
28 Kent Large 106
29 Kent Small 487
30 Montgomery Large 5868
31 Montgomery Small 21899
32 Prince George's Large 3747
33 Prince George's Small 12497
34 Queen Anne's Large 206
35 Queen Anne's Small 1231
36 Somerset Large 67
37 Somerset Small 288
38 St. Mary's Large 538
39 St. Mary's Small 1541
40 Talbot Large 277
41 Talbot Small 1201
42 Washington Large 993
43 Washington Small 2439
44 Wicomico Large 701
45 Wicomico Small 1786
46 Worcester Large 354
47 Worcester Small 1880
In [ ]:
#calculate the percentages of small vs large businesses for each county
df_summary['Total per County'] = df_summary.groupby('County Name')['Establishments'].transform('sum')
df_summary['Percentage'] = (df_summary['Establishments'] / df_summary['Total per County']) * 100
df_summary['Percentage'] = df_summary['Percentage'].round()
df_summary
Out[ ]:
County Name Business Size Establishments Total per County Percentage
0 Allegany Large 400 1468 27.0
1 Allegany Small 1068 1468 73.0
2 Anne Arundel Large 3401 14645 23.0
3 Anne Arundel Small 11244 14645 77.0
4 Baltimore Large 4554 20366 22.0
5 Baltimore Small 15812 20366 78.0
6 Baltimore city Large 2915 12365 24.0
7 Baltimore city Small 9450 12365 76.0
8 Calvert Large 291 1733 17.0
9 Calvert Small 1442 1733 83.0
10 Caroline Large 107 617 17.0
11 Caroline Small 510 617 83.0
12 Carroll Large 662 4264 16.0
13 Carroll Small 3602 4264 84.0
14 Cecil Large 345 1750 20.0
15 Cecil Small 1405 1750 80.0
16 Charles Large 621 2756 23.0
17 Charles Small 2135 2756 77.0
18 Dorchester Large 127 666 19.0
19 Dorchester Small 539 666 81.0
20 Frederick Large 1371 6468 21.0
21 Frederick Small 5097 6468 79.0
22 Garrett Large 163 942 17.0
23 Garrett Small 779 942 83.0
24 Harford Large 1198 5614 21.0
25 Harford Small 4416 5614 79.0
26 Howard Large 2154 9866 22.0
27 Howard Small 7712 9866 78.0
28 Kent Large 106 593 18.0
29 Kent Small 487 593 82.0
30 Montgomery Large 5868 27767 21.0
31 Montgomery Small 21899 27767 79.0
32 Prince George's Large 3747 16244 23.0
33 Prince George's Small 12497 16244 77.0
34 Queen Anne's Large 206 1437 14.0
35 Queen Anne's Small 1231 1437 86.0
36 Somerset Large 67 355 19.0
37 Somerset Small 288 355 81.0
38 St. Mary's Large 538 2079 26.0
39 St. Mary's Small 1541 2079 74.0
40 Talbot Large 277 1478 19.0
41 Talbot Small 1201 1478 81.0
42 Washington Large 993 3432 29.0
43 Washington Small 2439 3432 71.0
44 Wicomico Large 701 2487 28.0
45 Wicomico Small 1786 2487 72.0
46 Worcester Large 354 2234 16.0
47 Worcester Small 1880 2234 84.0

Having information on both small businesses and large businesses is redundant as it creates two data points for each county, and furthermore focusing on only one is sufficient as we would be able to calculate information on the other if needed. We choose to focus on small businesses.

The range for percentage of small businesses is scaled between a small set of numbers without a lot of variability (71% to 86%). To align with our exploration of whether a higher or lower small business presence would relate to better county health, we use the percentage to divide counties into having a "Low" or "High" Small Business Presence, where "Low" are counties whose small business percentage falls at or below the median (79%)

In [ ]:
#categorize the small business presence as high vs low based on the median percentage value
df_categorize_presence = df_summary[df_summary['Business Size'] == "Small"]
median_value = df_categorize_presence['Percentage'].median()

df_categorize_presence['Small_Business_Presence'] = np.where(
    df_categorize_presence['Percentage'] <= median_value,
    'Low',
    'High'
)

cols_to_keep = ['County Name', 'Percentage', 'Small_Business_Presence']
df_final = df_categorize_presence[cols_to_keep]

df_final
/tmp/ipython-input-2825041807.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_categorize_presence['Small_Business_Presence'] = np.where(
Out[ ]:
County Name Percentage Small_Business_Presence
1 Allegany 73.0 Low
3 Anne Arundel 77.0 Low
5 Baltimore 78.0 Low
7 Baltimore city 76.0 Low
9 Calvert 83.0 High
11 Caroline 83.0 High
13 Carroll 84.0 High
15 Cecil 80.0 High
17 Charles 77.0 Low
19 Dorchester 81.0 High
21 Frederick 79.0 Low
23 Garrett 83.0 High
25 Harford 79.0 Low
27 Howard 78.0 Low
29 Kent 82.0 High
31 Montgomery 79.0 Low
33 Prince George's 77.0 Low
35 Queen Anne's 86.0 High
37 Somerset 81.0 High
39 St. Mary's 74.0 Low
41 Talbot 81.0 High
43 Washington 71.0 Low
45 Wicomico 72.0 Low
47 Worcester 84.0 High

The two datasets for county health and business statistics is merged together to achieve one singular dataset showing the percentage of small businesses, whether the presence of small businesses is low or high, the percentage of adults who report fair or poor health, the average number of physically unhealthy days, and the average number of mentally unhealthy days.

In [ ]:
# create FINAL MERGED DATASET
df_filtered_health = df_filtered_health.rename(columns={'County': 'County Name'})
df_final.loc[7, 'County Name'] = "Baltimore City"
df_final = pd.merge(df_final, df_filtered_health, on='County Name', how='left')
df_final
Out[ ]:
County Name Percentage Small_Business_Presence % Fair or Poor Health Average Number of Physically Unhealthy Days Average Number of Mentally Unhealthy Days
0 Allegany 73.0 Low 19 4.2 5.0
1 Anne Arundel 77.0 Low 14 3.2 4.1
2 Baltimore 78.0 Low 16 3.4 4.3
3 Baltimore City 76.0 Low 22 4.2 4.8
4 Calvert 83.0 High 14 3.3 4.3
5 Caroline 83.0 High 20 4.3 5.1
6 Carroll 84.0 High 13 3.2 4.3
7 Cecil 80.0 High 16 3.8 4.8
8 Charles 77.0 Low 17 3.4 4.3
9 Dorchester 81.0 High 19 4.2 4.9
10 Frederick 79.0 Low 14 3.2 4.1
11 Garrett 83.0 High 18 4.0 5.0
12 Harford 79.0 Low 14 3.3 4.3
13 Howard 78.0 Low 11 2.7 3.5
14 Kent 82.0 High 16 3.7 4.6
15 Montgomery 79.0 Low 13 2.9 3.6
16 Prince George's 77.0 Low 17 3.5 4.1
17 Queen Anne's 86.0 High 13 3.2 4.3
18 Somerset 81.0 High 25 4.8 5.2
19 St. Mary's 74.0 Low 15 3.5 4.4
20 Talbot 81.0 High 14 3.4 4.3
21 Washington 71.0 Low 18 4.0 4.8
22 Wicomico 72.0 Low 20 4.1 4.9
23 Worcester 84.0 High 16 3.6 4.6

Exploratory Data Analysis¶

Now that we have prepped our data, let's take a look at how the parts of the data is related to one another.

Summary Statistics¶

We'll start with some descriptive statistics and get a general idea of how the data in each of our features looks.

For All Counties¶

In [ ]:
# Overall summary statistics for all maryland counties

numeric_cols = ['Percentage', '% Fair or Poor Health',
                'Average Number of Physically Unhealthy Days',
                'Average Number of Mentally Unhealthy Days']

summary_stats = df_final[numeric_cols].describe()

iqr = df_final[numeric_cols].quantile(0.75) - df_final[numeric_cols].quantile(0.25)
summary_stats.loc['IQR'] = iqr

summary_stats.loc['median'] = df_final[numeric_cols].median()

summary_stats = summary_stats.loc[['count', 'mean', 'median', 'std', 'min', '25%', '50%', '75%', 'max', 'IQR']]

summary_stats = summary_stats.round(2)

print(summary_stats)
print()
print("Interpretation:")
print("- Percentage: % of establishments that are small businesses in each county")
print("- % Fair or Poor Health: % of adults reporting fair or poor health")
print("- Physically Unhealthy Days: Average number of physically unhealthy days per month")
print("- Mentally Unhealthy Days: Average number of mentally unhealthy days per month")
        Percentage  % Fair or Poor Health  \
count        24.00                  24.00   
mean         79.08                  16.42   
median       79.00                  16.00   
std           3.99                   3.27   
min          71.00                  11.00   
25%          77.00                  14.00   
50%          79.00                  16.00   
75%          82.25                  18.25   
max          86.00                  25.00   
IQR           5.25                   4.25   

        Average Number of Physically Unhealthy Days  \
count                                         24.00   
mean                                           3.63   
median                                         3.50   
std                                            0.51   
min                                            2.70   
25%                                            3.28   
50%                                            3.50   
75%                                            4.03   
max                                            4.80   
IQR                                            0.75   

        Average Number of Mentally Unhealthy Days  
count                                       24.00  
mean                                         4.48  
median                                       4.35  
std                                          0.44  
min                                          3.50  
25%                                          4.30  
50%                                          4.35  
75%                                          4.82  
max                                          5.20  
IQR                                          0.53  

Interpretation:
- Percentage: % of establishments that are small businesses in each county
- % Fair or Poor Health: % of adults reporting fair or poor health
- Physically Unhealthy Days: Average number of physically unhealthy days per month
- Mentally Unhealthy Days: Average number of mentally unhealthy days per month

Here we can see information such as the average percentage of small businesses and the average % Fair or Poor health across all the counties, as well as that for the average number of both physically and mentally unhealthy days. For both Average Number of Physically Unhealthy Days and Average Number of Mentally Unhealthy Days, the standard deviation is quite low (0.51 and 0.44, respectively) which suggests little variation across these two features.

Grouped by Small Business Presence¶

Let's see if what else we can find by comparing descriptive statistics between counties with a low small business presence and a high small business presence.

In [ ]:
grouped_stats = df_final.groupby('Small_Business_Presence')[numeric_cols].agg([
    'count', 'mean', 'median', 'std', 'min', 'max',
    lambda x: x.quantile(0.25),
    lambda x: x.quantile(0.75),
    lambda x: x.quantile(0.75) - x.quantile(0.25)
])

grouped_stats = grouped_stats.rename(columns={
    '<lambda_0>': 'Q1',
    '<lambda_1>': 'Q3',
    '<lambda_2>': 'IQR'
})

grouped_stats = grouped_stats.round(2)

print(grouped_stats)
print()

print("\nNumber of counties in each category:")
print(df_final['Small_Business_Presence'].value_counts().sort_index())
                        Percentage                                        \
                             count   mean median   std   min   max    Q1   
Small_Business_Presence                                                    
High                            11  82.55   83.0  1.75  80.0  86.0  81.0   
Low                             13  76.15   77.0  2.76  71.0  79.0  74.0   

                                   % Fair or Poor Health  ...  \
                           Q3  IQR                 count  ...   
Small_Business_Presence                                   ...   
High                     83.5  2.5                    11  ...   
Low                      78.0  4.0                    13  ...   

                        Average Number of Physically Unhealthy Days  \
                                                                IQR   
Small_Business_Presence                                               
High                                                           0.75   
Low                                                            0.80   

                        Average Number of Mentally Unhealthy Days        \
                                                            count  mean   
Small_Business_Presence                                                   
High                                                           11  4.67   
Low                                                            13  4.32   

                                                                 
                        median   std  min  max   Q1    Q3   IQR  
Small_Business_Presence                                          
High                       4.6  0.35  4.3  5.2  4.3  4.95  0.65  
Low                        4.3  0.47  3.5  5.0  4.1  4.80  0.70  

[2 rows x 36 columns]


Number of counties in each category:
Small_Business_Presence
High    11
Low     13
Name: count, dtype: int64
In [ ]:
# detailed comparison of high vs low business presence
comparison = pd.DataFrame()

for col in numeric_cols:
    low_data = df_final[df_final['Small_Business_Presence'] == 'Low'][col]
    high_data = df_final[df_final['Small_Business_Presence'] == 'High'][col]

    comparison[col] = [
        f"Low: {low_data.mean():.2f}, High: {high_data.mean():.2f}",
        f"Low: {low_data.median():.2f}, High: {high_data.median():.2f}",
        f"Low: {low_data.std():.2f}, High: {high_data.std():.2f}",
        f"Low: {low_data.min():.2f}, High: {high_data.min():.2f}",
        f"Low: {low_data.max():.2f}, High: {high_data.max():.2f}",
        f"Low: {(low_data.quantile(0.75) - low_data.quantile(0.25)):.2f}, High: {(high_data.quantile(0.75) - high_data.quantile(0.25)):.2f}"
    ]

comparison.index = ['Mean', 'Median', 'Std Dev', 'Min', 'Max', 'IQR']

print(comparison.to_string())
                      Percentage    % Fair or Poor Health Average Number of Physically Unhealthy Days Average Number of Mentally Unhealthy Days
Mean     Low: 76.15, High: 82.55  Low: 16.15, High: 16.73                       Low: 3.51, High: 3.77                     Low: 4.32, High: 4.67
Median   Low: 77.00, High: 83.00  Low: 16.00, High: 16.00                       Low: 3.40, High: 3.70                     Low: 4.30, High: 4.60
Std Dev    Low: 2.76, High: 1.75    Low: 3.08, High: 3.61                       Low: 0.49, High: 0.51                     Low: 0.47, High: 0.35
Min      Low: 71.00, High: 80.00  Low: 11.00, High: 13.00                       Low: 2.70, High: 3.20                     Low: 3.50, High: 4.30
Max      Low: 79.00, High: 86.00  Low: 22.00, High: 25.00                       Low: 4.20, High: 4.80                     Low: 5.00, High: 5.20
IQR        Low: 4.00, High: 2.50    Low: 4.00, High: 4.50                       Low: 0.80, High: 0.75                     Low: 0.70, High: 0.65

Let's also also visualize this data to help the numbers. Boxplots are helpful for visualizing the information we have by mapping out ranges and centrality. A boxplot is drawn up to compare counties with a low small business presence to those with a high small business presences based on each of the measurements of health.

In [ ]:
#visual representation of summary statistics
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Distribution of Variables by Small Business Presence', fontsize=16, fontweight='bold')

variables = [
    ('Percentage', 'Small Business Percentage'),
    ('% Fair or Poor Health', '% Fair or Poor Health'),
    ('Average Number of Physically Unhealthy Days', 'Avg Physically Unhealthy Days'),
    ('Average Number of Mentally Unhealthy Days', 'Avg Mentally Unhealthy Days')
]

for idx, (col, title) in enumerate(variables):
    ax = axes[idx // 2, idx % 2]

    data_low = df_final[df_final['Small_Business_Presence'] == 'Low'][col]
    data_high = df_final[df_final['Small_Business_Presence'] == 'High'][col]

    bp = ax.boxplot([data_low, data_high],
                     labels=['Low', 'High'],
                     patch_artist=True,
                     showmeans=True,
                     meanprops=dict(marker='D', markerfacecolor='red', markersize=8))

    # Color the boxes
    bp['boxes'][0].set_facecolor('lightblue')
    bp['boxes'][1].set_facecolor('lightgreen')

    ax.set_xlabel('Small Business Presence', fontweight='bold')
    ax.set_ylabel(title, fontweight='bold')
    ax.set_title(title, fontweight='bold')
    ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()
/tmp/ipython-input-825447840.py:18: MatplotlibDeprecationWarning: The 'labels' parameter of boxplot() has been renamed 'tick_labels' since Matplotlib 3.9; support for the old name will be dropped in 3.11.
  bp = ax.boxplot([data_low, data_high],
/tmp/ipython-input-825447840.py:18: MatplotlibDeprecationWarning: The 'labels' parameter of boxplot() has been renamed 'tick_labels' since Matplotlib 3.9; support for the old name will be dropped in 3.11.
  bp = ax.boxplot([data_low, data_high],
/tmp/ipython-input-825447840.py:18: MatplotlibDeprecationWarning: The 'labels' parameter of boxplot() has been renamed 'tick_labels' since Matplotlib 3.9; support for the old name will be dropped in 3.11.
  bp = ax.boxplot([data_low, data_high],
/tmp/ipython-input-825447840.py:18: MatplotlibDeprecationWarning: The 'labels' parameter of boxplot() has been renamed 'tick_labels' since Matplotlib 3.9; support for the old name will be dropped in 3.11.
  bp = ax.boxplot([data_low, data_high],
No description has been provided for this image
In [ ]:
# key insights from summary statistics
low_group = df_final[df_final['Small_Business_Presence'] == 'Low']
high_group = df_final[df_final['Small_Business_Presence'] == 'High']

print(f"1. SMALL BUSINESS DENSITY:")
print(f"   - Low group: Mean = {low_group['Percentage'].mean():.2f}%, Median = {low_group['Percentage'].median():.2f}%")
print(f"   - High group: Mean = {high_group['Percentage'].mean():.2f}%, Median = {high_group['Percentage'].median():.2f}%")
print(f"   - Difference in means: {high_group['Percentage'].mean() - low_group['Percentage'].mean():.2f} percentage points")
print()

print(f"2. FAIR OR POOR HEALTH:")
print(f"   - Low group: Mean = {low_group['% Fair or Poor Health'].mean():.2f}%, Median = {low_group['% Fair or Poor Health'].median():.2f}%")
print(f"   - High group: Mean = {high_group['% Fair or Poor Health'].mean():.2f}%, Median = {high_group['% Fair or Poor Health'].median():.2f}%")
print(f"   - Difference in means: {high_group['% Fair or Poor Health'].mean() - low_group['% Fair or Poor Health'].mean():.2f} percentage points")
print()

print(f"3. PHYSICALLY UNHEALTHY DAYS:")
print(f"   - Low group: Mean = {low_group['Average Number of Physically Unhealthy Days'].mean():.2f} days")
print(f"   - High group: Mean = {high_group['Average Number of Physically Unhealthy Days'].mean():.2f} days")
print(f"   - Difference in means: {high_group['Average Number of Physically Unhealthy Days'].mean() - low_group['Average Number of Physically Unhealthy Days'].mean():.2f} days")
print()

print(f"4. MENTALLY UNHEALTHY DAYS:")
print(f"   - Low group: Mean = {low_group['Average Number of Mentally Unhealthy Days'].mean():.2f} days")
print(f"   - High group: Mean = {high_group['Average Number of Mentally Unhealthy Days'].mean():.2f} days")
print(f"   - Difference in means: {high_group['Average Number of Mentally Unhealthy Days'].mean() - low_group['Average Number of Mentally Unhealthy Days'].mean():.2f} days")
print()

print("5. VARIABILITY:")
print(f"   - The IQR for small business percentage is {df_final['Percentage'].quantile(0.75) - df_final['Percentage'].quantile(0.25):.2f}")
print(f"   - The IQR for % Fair/Poor Health is {df_final['% Fair or Poor Health'].quantile(0.75) - df_final['% Fair or Poor Health'].quantile(0.25):.2f}")
print(f"   - Counties show {'high' if df_final['Percentage'].std() > 10 else 'moderate'} variability in small business density (SD = {df_final['Percentage'].std():.2f})")
1. SMALL BUSINESS DENSITY:
   - Low group: Mean = 76.15%, Median = 77.00%
   - High group: Mean = 82.55%, Median = 83.00%
   - Difference in means: 6.39 percentage points

2. FAIR OR POOR HEALTH:
   - Low group: Mean = 16.15%, Median = 16.00%
   - High group: Mean = 16.73%, Median = 16.00%
   - Difference in means: 0.57 percentage points

3. PHYSICALLY UNHEALTHY DAYS:
   - Low group: Mean = 3.51 days
   - High group: Mean = 3.77 days
   - Difference in means: 0.27 days

4. MENTALLY UNHEALTHY DAYS:
   - Low group: Mean = 4.32 days
   - High group: Mean = 4.67 days
   - Difference in means: 0.35 days

5. VARIABILITY:
   - The IQR for small business percentage is 5.25
   - The IQR for % Fair/Poor Health is 4.25
   - Counties show moderate variability in small business density (SD = 3.99)

Small business percentage is seen to be significantly different between low and high business presence (6.39% difference in means), but this is obvious as Small Business Presence was computed directly based on Small Business percentage. The feature with the second largest difference between the two groups is % Fair or Poor Health (0.57% difference in means), followed by Average Number of Mentally Unhealthy Days (0.35% difference in means).

Statistical Method #1¶

From our summary statistics, we observe that of the 3 measurements of health, % Fair or Poor Health has the greatest potential to differ between counties with low versus high small business presence. As we start to target our question, "does the presence of small businesses in Maryland counties have an impact on health outcomes in the respective county?", we can use % Fair or Poor Health in our definition of "health".

Is the density of small businesses in a county related the the health of that county? This is explored using Pearson's R.

But first, "health" is usually thought of as positive or "good health", but our dataset gives "health' data as "fair or poor", the opposite of "good". This is fixed by creating a new column and subtracting "% Fair or Poor Health" from 100 to yield "% Good Health".

In [ ]:
#create a table with only the relevant data (county name, density of small businesses, and percent of poor/fair health)
#include a new column for percentage of good health (100 - percentage of fair/poor health)
df_density_v_health = df_final[['County Name', '% Fair or Poor Health','Percentage']]
df_density_v_health['% Good Health'] = 100 - df_density_v_health['% Fair or Poor Health']
df_density_v_health
/tmp/ipython-input-1633745827.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_density_v_health['% Good Health'] = 100 - df_density_v_health['% Fair or Poor Health']
Out[ ]:
County Name % Fair or Poor Health Percentage % Good Health
0 Allegany 19 73.0 81
1 Anne Arundel 14 77.0 86
2 Baltimore 16 78.0 84
3 Baltimore City 22 76.0 78
4 Calvert 14 83.0 86
5 Caroline 20 83.0 80
6 Carroll 13 84.0 87
7 Cecil 16 80.0 84
8 Charles 17 77.0 83
9 Dorchester 19 81.0 81
10 Frederick 14 79.0 86
11 Garrett 18 83.0 82
12 Harford 14 79.0 86
13 Howard 11 78.0 89
14 Kent 16 82.0 84
15 Montgomery 13 79.0 87
16 Prince George's 17 77.0 83
17 Queen Anne's 13 86.0 87
18 Somerset 25 81.0 75
19 St. Mary's 15 74.0 85
20 Talbot 14 81.0 86
21 Washington 18 71.0 82
22 Wicomico 20 72.0 80
23 Worcester 16 84.0 84

Now we can run Pearson's R.

H0 (Null Hypothesis): Counties with a higher density of small businesses do not differ in the percentage of adults with good health

HA (Alternative Hypothesis): Counties with a higher density of small businesses have a higher percentage of adults with good health

In [ ]:
#run and print out results of Pearson's R
p = pearsonr(df_density_v_health['% Good Health'], df_density_v_health['Percentage'])
print("correlation coefficient:", p.statistic)
print("p value:", p.pvalue)
correlation coefficient: 0.2361976963530344
p value: 0.26649506614445556
In [ ]:
#create and show a scatterplot
plt.scatter(df_density_v_health['Percentage'], df_density_v_health['% Good Health'])
plt.ylabel('% Good Health')
plt.xlabel('Density of Small Businesses')
plt.title('Correlation between Density of Small Businesses and Good Health')
m, b = np.polyfit(df_density_v_health['Percentage'], df_density_v_health['% Good Health'], 1)
plt.plot(df_density_v_health['Percentage'], m*df_density_v_health['Percentage'] + b, color='purple')
plt.show()
No description has been provided for this image

The correlation coefficient for Pearson's R between counties' density of small businesses and percentage of good health is 0.24 which indicates a weak positive correlation. As the density of small businesses increases, the percentage of good health is weakly related to increase. Furthermore, the p-value is 0.27 which is greater than our significance level (α = 0.05), thus this identified relationship is non-significant. We can see in the graph how the data points are scattered around at various distances from the regression line rather than being packed around / correlated with the line. There is not enough confidence to reject the null hypothesis, and so the null hypothesis is accepted. As the density of small businesses increases within a county, the percentage of good health does not differ.

Statistical Method #2¶

How about if we focused on a specific aspect of health, such as mental health?

Does the mean number of mentally unhealthy days differ significantly between counties with low vs. high proportions of small businesses? This is explored using the ANOVA Test.

Again, let's create a dataset with the information we need.

In [ ]:
# create a clean dataset with only the relevant columns
df_density_v_mental_health = df_final[['County Name', 'Percentage', 'Small_Business_Presence', 'Average Number of Mentally Unhealthy Days']]
df_density_v_mental_health
Out[ ]:
County Name Percentage Small_Business_Presence Average Number of Mentally Unhealthy Days
0 Allegany 73.0 Low 5.0
1 Anne Arundel 77.0 Low 4.1
2 Baltimore 78.0 Low 4.3
3 Baltimore City 76.0 Low 4.8
4 Calvert 83.0 High 4.3
5 Caroline 83.0 High 5.1
6 Carroll 84.0 High 4.3
7 Cecil 80.0 High 4.8
8 Charles 77.0 Low 4.3
9 Dorchester 81.0 High 4.9
10 Frederick 79.0 Low 4.1
11 Garrett 83.0 High 5.0
12 Harford 79.0 Low 4.3
13 Howard 78.0 Low 3.5
14 Kent 82.0 High 4.6
15 Montgomery 79.0 Low 3.6
16 Prince George's 77.0 Low 4.1
17 Queen Anne's 86.0 High 4.3
18 Somerset 81.0 High 5.2
19 St. Mary's 74.0 Low 4.4
20 Talbot 81.0 High 4.3
21 Washington 71.0 Low 4.8
22 Wicomico 72.0 Low 4.9
23 Worcester 84.0 High 4.6

Now we can run our ANOVA.

H0 (Null Hypothesis): The mean number of mentally unhealthy days is the same across counties with low and high percentages of small businesses.

HA (Alternative Hypothesis): The mean number of mentally unhealthy days differs significantly across counties with low and high proportions of small businesses.

In [ ]:
low = df_density_v_mental_health[df_density_v_mental_health['Small_Business_Presence'] == 'Low']
high = df_density_v_mental_health[df_density_v_mental_health['Small_Business_Presence'] == 'High']
statistic, p_value = f_oneway(low['Average Number of Mentally Unhealthy Days'], high['Average Number of Mentally Unhealthy Days'])
print(f"P-value = {p_value}")
P-value = 0.05222867166235262
In [ ]:
plt.figure(figsize=(8, 6))
plt.boxplot([low['Average Number of Mentally Unhealthy Days'],
             high['Average Number of Mentally Unhealthy Days']],
            tick_labels=['Low', 'High'])
plt.xlabel('Small Business Presence')
plt.ylabel('Average Number of Mentally Unhealthy Days')
plt.title('Comparison of Low vs High Small Business Presence \nwith Average Mentally Unhealthy Days')
plt.show()
No description has been provided for this image

The one-way ANOVA test was performed to examine whether there is a significant difference in the mean number of mentally unhealthy days between counties with low and high proportions of small businesses. The p-value obtained from this test was 0.052 which is greater than the significance level (α = 0.05), indicating no statistically significant difference. Therefore, we fail to reject the null hypothesis (H0). Post-hoc tests are only used when ANOVA shows a significant result, so no post-hoc tests are needed here. There is not enough evidence to conclude that the mean number of mentally unhealthy days differs significantly across counties with low and high proportions of small businesses.

Modeling¶

Now that we know how our data looks, what if we wanted to observe other data? What if we looked at the business data of Baltimore city or of a county from a different state? Could we use what we know about Maryland counties to predict the health of another location?

We in fact can! We can create a machine learning model and train it on Maryland county data using business data as the features and use it to predict the health outcomes of a location.

Here, we use a decision tree regressor model to predict the % Fair or Poor Health of an area. Decision trees work by identifying features that contribute to variability of the data and is capable of splitting the dataset into subsets of high purity. The order of these splits is what becomes the model. This is an example of regression because our target is a continuous value rather than a discrete label we are attaching to the input. Furthermore, this model is an example of supervised learning because the model is learning from labelled data and is expected to provide similar outputs.

First, we need to prep the data to be given to the model. Models usually cannot handle categorical strings, so non-numerical data must be converted into numerical data.

Then, we can split the dataset that we have into a subset to be used for training the mdoel and a subset to be used for testing the mode. We chose to use a 0.70/0.30 training to testing split such that 70% of the data points will be used for training the model.

In [ ]:
# Pre-processing
# Made copy of data
df_model = df_final.copy()

# Encoding the categorical feature since sci kit learn cannot handle categorical strings
l = LabelEncoder()
df_model['Small_Biz_Encoded'] = l.fit_transform(df_model['Small_Business_Presence'])


# define features
X = df_model[['Small_Biz_Encoded']]

# define target
y = df_model['% Fair or Poor Health']

# Modeling

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Using the Decision Tree Regression version since output is a percent
dtree = DecisionTreeRegressor(max_depth=1, random_state=42)
dtree.fit(X_train, y_train)

# Predict
y_pred = dtree.predict(X_test)

# Evaluation

# For regression, we look at how far off the predictions are on average
mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
r2 = metrics.r2_score(y_test, y_pred)

# model's prediction is off by x percent
print(f"Mean Absolute Error: {mae:.2f}")
# explains variance
print(f"R-Squared Score: {r2:.2f}")
Mean Absolute Error: 3.47
R-Squared Score: -0.38

After training the model, it is tested using the testing set and evaluation metrics are run. Here, we calculated the Mean Absolute Error as well as the R-Squared Score. The Mean Absolute Error came out to be 3.47 which means that on average, predicted values were 3.47 units away from actual values. Ideally, the MAE value is closer to 0. Given the scale of % Fair or Poor health, 3.47 units off is not great. The inaccuracy of this model is further shown by the R-Squared score of -0.38 which means that -0.38% of the variance of the data is able to be predicted by the model.

A likely issue with the model is that the dataset it is being trained off of is not complex enough. A dataset that lacks complexity can lead to models generalizing too broadly and underfitting which means it is unable to pick up on underlying patterns. We attempt to increase the complexity by adding another feature to the training set (Percentage of small businesses).

In [ ]:
# 2 FEATURES INSTEAD OF ONE, MAX DEPTH IS 2 HERE
# Pre-processing
# Made copy of data
df_model = df_final.copy()

# Encoding the categorical feature since sci kit learn cannot handle categorical strings
l = LabelEncoder()
df_model['Small_Biz_Encoded'] = l.fit_transform(df_model['Small_Business_Presence'])


# define features
X = df_model[['Small_Biz_Encoded', 'Percentage']]

# define target
y = df_model['% Fair or Poor Health']

# Modeling

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Using the Decision Tree Regression version since output is a percent
dtree = DecisionTreeRegressor(max_depth=1, random_state=42)
dtree.fit(X_train, y_train)

# Predict
y_pred = dtree.predict(X_test)

# Evaluation

# For regression, we look at how far off the predictions are on average
mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
r2 = metrics.r2_score(y_test, y_pred)

# model's prediction is off by x percent
print(f"Mean Absolute Error: {mae:.2f}")
# explains variance
print(f"R-Squared Score: {r2:.2f}")
Mean Absolute Error: 3.71
R-Squared Score: -0.48

Again the evaluation metrics show that the model is inaccurate. The Mean Absolute Error is too high and the R-Squared Score is too low.

Given that our initial data exploration suggested no significant relationships between small businesses and different aspects of health, it makes sense that our models are inaccurate. There is no relationship between our variables, and thus no pattern for the model to pick up on.

Evaluation and Pivot¶

When creating the decision trees for this dataset, we utilized two forms of our independent variable: the raw 'Percentage' of small businesses in a county, and an encoded 'Business_Encoded' version representing low versus high presence. We tested Decision Tree Regressors with varying max depths (1 and 2) to capture potential patterns.

However, upon evaluation, we observed that the R-Squared score was below 0. This indicates that our model performed worse than a baseline model that simply predicts the average for every county. We realized that predicting percentage of poor or fair health may need more inputs since have 1-2 features may not be effective. Due to this, we have decided to pivot to utilizing social vulnerability factors (including business and economic features) and how it can be utilized to predict fair and poor health presence. SVI is a more comprehensive source that aggregates various economic and social factors. We discussed this pivot with Dr. Alam, and she said it was okay since our goal is to better predict health behaviors.

To summarize: Based on our statistical tests and machine learning test, we can determine that there is not a statistically significant relationship between health outcomes and small business presence. We decided to pivot our original research question and focus on the relationship between social vulnerability index and health outcomes instead.

Data Curation¶

Our new dataset for the independent variable is CDC's Social Vulnerability Index for 2022. We downloaded the data just for Maryland counties.

SVI Dataset Source: https://svi.cdc.gov/map25/data/docs/SVI2022Documentation_ZCTA.pdf

In [ ]:
#load the svi dataset
df_svi = pd.read_csv("sample_data/Maryland_county.csv")
print(df_svi.shape)
df_svi.head(10)
(24, 158)
Out[ ]:
ST STATE ST_ABBR STCNTY COUNTY FIPS LOCATION AREA_SQMI E_TOTPOP M_TOTPOP ... EP_ASIAN MP_ASIAN EP_AIAN MP_AIAN EP_NHPI MP_NHPI EP_TWOMORE MP_TWOMORE EP_OTHERRACE MP_OTHERRACE
0 24 Maryland MD 24001 Allegany County 24001 Allegany County, Maryland 422.198632 68161 0 ... 0.9 0.1 0.1 0.1 0.0 0.1 2.9 0.3 0.2 0.1
1 24 Maryland MD 24003 Anne Arundel County 24003 Anne Arundel County, Maryland 414.806257 588109 0 ... 4.0 0.1 0.1 0.1 0.0 0.1 4.9 0.4 0.5 0.1
2 24 Maryland MD 24005 Baltimore County 24005 Baltimore County, Maryland 598.357965 850737 0 ... 6.1 0.1 0.1 0.1 0.0 0.1 3.6 0.2 0.4 0.1
3 24 Maryland MD 24009 Calvert County 24009 Calvert County, Maryland 213.189517 93244 0 ... 2.0 0.4 0.1 0.1 0.1 0.1 5.2 0.8 0.4 0.2
4 24 Maryland MD 24011 Caroline County 24011 Caroline County, Maryland 319.441587 33320 0 ... 0.5 0.2 0.1 0.1 0.0 0.1 4.1 0.7 0.8 0.5
5 24 Maryland MD 24013 Carroll County 24013 Carroll County, Maryland 447.629615 173225 0 ... 2.2 0.1 0.1 0.1 0.0 0.1 2.6 0.4 0.5 0.2
6 24 Maryland MD 24015 Cecil County 24015 Cecil County, Maryland 346.299723 103876 0 ... 1.3 0.3 0.1 0.1 0.0 0.1 4.0 0.7 0.5 0.3
7 24 Maryland MD 24017 Charles County 24017 Charles County, Maryland 457.825693 167035 0 ... 3.2 0.2 0.3 0.1 0.0 0.1 5.7 0.7 0.5 0.3
8 24 Maryland MD 24019 Dorchester County 24019 Dorchester County, Maryland 540.764325 32557 0 ... 1.3 0.1 0.0 0.1 0.0 0.1 5.1 1.2 0.5 0.3
9 24 Maryland MD 24021 Frederick County 24021 Frederick County, Maryland 660.605502 273829 0 ... 4.9 0.2 0.1 0.1 0.0 0.1 4.4 0.4 0.3 0.1

10 rows × 158 columns

We follow the data science pipline as we had done before, first loading in our data and cleaning it up.

We can remove information regarding the locations except for the "County". Furthermore, we can ajust the formatting of the County names by removing "County" and this not only helps with legibility, but also makes this dataset consistent with our health dataset. Again we remove Baltimore City as it is not a county.

From the SVI data, we only need to keep the columns which start with EP_ as those contain the actual percentage data for each variable. We check whether any data is missing that we need to handle.

In [ ]:
#filter the dataset and remove columns that aren't needed. for features, keep columns that start with EP_ as they are percentage values

cols = ["COUNTY"] + [col for col in df_svi.columns if col.startswith("EP_")]

df_svi_filtered = df_svi[cols]
#check for null values, overall structure of the new dataset
print(df_svi_filtered.isna().sum())
print(df_svi_filtered.shape)
COUNTY          0
EP_POV150       0
EP_UNEMP        0
EP_HBURD        0
EP_NOHSDP       0
EP_UNINSUR      0
EP_AGE65        0
EP_AGE17        0
EP_DISABL       0
EP_SNGPNT       0
EP_LIMENG       0
EP_MINRTY       0
EP_MUNIT        0
EP_MOBILE       0
EP_CROWD        0
EP_NOVEH        0
EP_GROUPQ       0
EP_NOINT        0
EP_AFAM         0
EP_HISP         0
EP_ASIAN        0
EP_AIAN         0
EP_NHPI         0
EP_TWOMORE      0
EP_OTHERRACE    0
dtype: int64
(24, 25)

There does not seem to be any missing data, so we can go ahead and merge the SVI dataset with the county health dataset where we can use % Fair or Poor Health as a measurement for a county's health outcomes.

In [ ]:
#create a NEW merged final dataset
df_svi_filtered.rename(columns={'COUNTY': 'County Name'}, inplace=True)
df_svi_filtered['County Name'] = df_svi_filtered['County Name'].str.replace("county", '', case=False).str.strip()

df_new_final = pd.merge(df_filtered_health, df_svi_filtered, on='County Name', how='left')

#focusing on '% Fair or Poor Health' as our main target variable
df_new_final = df_new_final.drop(columns=['Average Number of Physically Unhealthy Days', 'Average Number of Mentally Unhealthy Days'])
df_new_final
/tmp/ipython-input-2375991886.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_svi_filtered.rename(columns={'COUNTY': 'County Name'}, inplace=True)
/tmp/ipython-input-2375991886.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_svi_filtered['County Name'] = df_svi_filtered['County Name'].str.replace("county", '', case=False).str.strip()
Out[ ]:
County Name % Fair or Poor Health EP_POV150 EP_UNEMP EP_HBURD EP_NOHSDP EP_UNINSUR EP_AGE65 EP_AGE17 EP_DISABL ... EP_NOVEH EP_GROUPQ EP_NOINT EP_AFAM EP_HISP EP_ASIAN EP_AIAN EP_NHPI EP_TWOMORE EP_OTHERRACE
0 Allegany 19 26.4 6.8 23.3 9.4 4.0 20.8 17.6 18.9 ... 9.6 10.6 17.5 7.2 2.0 0.9 0.1 0.0 2.9 0.2
1 Anne Arundel 14 9.7 4.2 19.3 6.5 4.5 15.4 22.4 10.7 ... 3.8 1.8 5.6 17.2 8.7 4.0 0.1 0.0 4.9 0.5
2 Baltimore 16 16.1 5.2 26.6 8.3 5.3 17.6 21.6 11.9 ... 7.8 2.3 10.0 29.6 6.1 6.1 0.1 0.0 3.6 0.4
3 Calvert 14 7.8 3.2 15.9 5.5 3.2 15.5 23.1 9.6 ... 3.1 0.6 8.1 12.1 4.6 2.0 0.1 0.1 5.2 0.4
4 Caroline 20 26.6 4.6 29.1 15.6 6.8 16.9 23.5 14.6 ... 6.1 1.7 14.1 13.2 8.2 0.5 0.1 0.0 4.1 0.8
5 Carroll 13 8.9 3.1 17.1 5.8 2.9 17.4 21.7 12.5 ... 4.0 2.1 9.5 3.7 4.1 2.2 0.1 0.0 2.6 0.5
6 Cecil 16 17.6 4.9 24.5 8.6 3.9 16.4 22.3 13.0 ... 5.2 1.5 10.3 6.6 4.9 1.3 0.1 0.0 4.0 0.5
7 Charles 17 10.1 4.9 20.0 5.6 4.0 12.9 23.8 10.4 ... 3.3 0.8 8.6 48.3 6.7 3.2 0.3 0.0 5.7 0.5
8 Dorchester 19 25.6 5.5 31.5 12.2 5.2 21.9 21.0 17.0 ... 10.0 1.7 14.1 25.6 6.1 1.3 0.0 0.0 5.1 0.5
9 Frederick 14 9.6 3.7 19.0 6.7 4.5 14.9 23.1 9.9 ... 3.9 1.6 8.0 9.8 11.0 4.9 0.1 0.0 4.4 0.3
10 Garrett 18 20.1 4.6 17.9 9.5 5.8 23.0 18.0 18.2 ... 7.4 2.3 16.9 1.3 1.3 0.4 0.0 0.1 1.0 0.4
11 Harford 14 11.9 3.7 19.6 5.8 3.5 16.7 22.2 11.0 ... 4.0 0.6 7.2 14.0 5.0 2.9 0.0 0.0 3.9 0.5
12 Howard 11 8.7 3.8 17.1 4.8 3.9 14.5 24.0 8.4 ... 4.0 0.8 4.1 19.7 7.5 18.8 0.1 0.1 4.9 0.6
13 Kent 16 17.6 3.3 29.9 9.5 4.3 26.8 15.7 15.0 ... 6.5 8.4 20.1 14.1 4.7 1.0 0.0 0.0 2.8 0.6
14 Montgomery 13 12.0 4.8 22.1 8.7 6.7 16.2 22.9 8.8 ... 7.9 0.8 5.2 18.2 20.0 15.2 0.1 0.0 4.2 0.8
15 Prince George's 17 14.9 6.7 27.9 12.9 10.5 14.1 22.1 9.9 ... 9.3 1.9 7.9 60.4 20.0 4.0 0.2 0.0 3.2 0.6
16 Queen Anne's 13 11.2 3.6 22.2 6.7 5.0 19.7 21.2 10.1 ... 3.1 0.9 13.4 5.8 4.6 1.1 0.0 0.0 3.8 0.6
17 St. Mary's 15 12.5 3.7 19.8 8.2 4.0 13.3 24.0 11.3 ... 4.5 2.2 9.7 14.3 5.7 2.5 0.1 0.0 5.4 0.4
18 Somerset 25 29.5 7.0 31.7 15.3 3.9 17.2 16.7 16.1 ... 7.5 19.8 18.8 38.9 4.0 0.8 0.2 0.0 4.1 0.2
19 Talbot 14 16.4 3.2 24.6 7.8 4.4 29.7 18.2 16.7 ... 7.2 1.2 11.8 12.1 7.3 1.3 0.1 0.0 2.5 0.3
20 Washington 18 20.1 5.0 24.8 11.6 5.5 17.7 21.7 15.3 ... 7.9 5.0 14.1 11.1 6.2 1.7 0.1 0.1 4.3 0.2
21 Wicomico 20 23.0 6.9 28.9 11.5 6.5 16.2 22.1 12.2 ... 7.4 3.8 13.4 25.8 5.7 2.8 0.1 0.0 4.6 0.2
22 Worcester 16 15.0 6.5 27.4 6.5 5.8 28.0 17.1 15.3 ... 5.6 1.4 11.4 12.2 3.8 1.5 0.0 0.0 2.9 0.5
23 Baltimore City 22 29.4 6.9 36.3 12.9 5.5 14.8 20.4 16.4 ... 26.5 3.7 18.8 60.7 5.9 2.5 0.2 0.0 3.2 0.5

24 rows × 26 columns

Exploratory Data Analysis¶

Statistical Method #1¶

In this new dataset, we have a lot of variables to work with. This is great as it can give our data more complexity and help with a model learn patterns, but it can also be a weakness. More variables means more compuation is required and this can be especially inefficient if some variables are highly correlated with one another. As part of our data exploration, we want to see if we need to cut down on any of our variables.

Which features are the most correlated with the main target variable (% of Fair or Poor Health)? Which features are correlated with each other and may negatively impact our analysis?

In [ ]:
#calculate the correlation coefficients with the target variable
corr = df_new_final.corr(numeric_only=True)['% Fair or Poor Health'].sort_values(ascending=False)
print(corr)
#keep only the most important features (correlation of 0.6+)
selected_features = corr[corr > 0.6].index.tolist()
print(selected_features)

#filter the final dataset to only include the selected features
df_selected = df_new_final[selected_features]
% Fair or Poor Health    1.000000
EP_POV150                0.903903
EP_NOHSDP                0.843847
EP_UNEMP                 0.757730
EP_HBURD                 0.742666
EP_NOINT                 0.739843
EP_SNGPNT                0.695800
EP_GROUPQ                0.667893
EP_DISABL                0.639614
EP_MOBILE                0.588332
EP_NOVEH                 0.566284
EP_AFAM                  0.462352
EP_AIAN                  0.326299
EP_UNINSUR               0.239222
EP_CROWD                 0.209673
EP_MINRTY                0.208215
EP_AGE65                 0.000193
EP_TWOMORE              -0.096045
EP_MUNIT                -0.100919
EP_LIMENG               -0.131355
EP_NHPI                 -0.163022
EP_HISP                 -0.227812
EP_OTHERRACE            -0.352418
EP_AGE17                -0.394753
EP_ASIAN                -0.489223
Name: % Fair or Poor Health, dtype: float64
['% Fair or Poor Health', 'EP_POV150', 'EP_NOHSDP', 'EP_UNEMP', 'EP_HBURD', 'EP_NOINT', 'EP_SNGPNT', 'EP_GROUPQ', 'EP_DISABL']
In [ ]:
#correlation heatmap to figure out collinearity
plt.figure(figsize=(8, 6))
sns.heatmap(df_selected.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap of Selected Features")
plt.show()
No description has been provided for this image
In [ ]:
#remove the features that have high collinearity and less correlation with the target variable
#threshold is above 0.8
df_selected_final = df_selected.drop(["EP_NOHSDP", "EP_DISABL"], axis=1)
corr_matrix = df_selected_final.corr(numeric_only=True)

plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap of Selected Features")
plt.show()
No description has been provided for this image

Collinearity Elimination¶

When examining the correlation matrix, we utilized the 8 most correlated features as the basis of the next step. These were highly correlated with the target variable, with a correlation of 0.6 or above. To ensure that our models are more reliable and avoid redundancy, we then conducted collinearity analysis. From here, we saw that of the top 8 features, there were two pairs that had correlations higher than 0.8 with one another.

The pairs were:

  1. EP_POV150 and EP_NOHSDP
  2. EP_NOINT and EP_DISABL

We decided to keep EP_POV150 and EP_NOINT and remove the other two as they had lower correlation coefficients with the target variable.

KEY FOR FINAL FEATURES¶

Moving forward, our dataset will only contain the following SVI variables for each county:

  • EP_POV150 (Percentage of persons below 150% poverty estimate)
  • EP_UNEMP (Unemployment Rate estimate)
  • EP_HBURD (Percentage of housing cost-burdened occupied housing units with annual income less than $75,000 (30%+ of income spent on housing costs) estimate, 2018-2022 ACS estimate, 2018-2022 ACS)
  • EP_NOINT (Adjunct variable - Percentage of households without an internet subscription estimate, 2018-2022 ACS)
  • EP_SNGPNT (Percentage of single-parent households with children under 18 estimate, 2018-2022 ACS)
  • EP_GROUPQ (Percentage of persons in group quarters estimate, 2018-2022 ACS)

Summary Statistics¶

Like before, let's take a look at the descriptive statistics of our dataset to get an idea of what we are working with.

In [ ]:
print(df_selected_final.describe())
       % Fair or Poor Health  EP_POV150   EP_UNEMP   EP_HBURD   EP_NOINT  \
count              24.000000  24.000000  24.000000  24.000000  24.000000   
mean               16.416667  16.695833   4.825000  24.020833  11.608333   
std                 3.269313   6.939144   1.345282   5.484760   4.553443   
min                11.000000   7.800000   3.100000  15.900000   4.100000   
25%                14.000000  10.925000   3.700000  19.525000   8.075000   
50%                16.000000  15.550000   4.700000  23.900000  10.850000   
75%                18.250000  20.825000   5.750000  28.150000  14.100000   
max                25.000000  29.500000   7.000000  36.300000  20.100000   

       EP_SNGPNT  EP_GROUPQ  
count  24.000000  24.000000  
mean    6.483333   3.229167  
std     1.742479   4.283180  
min     3.900000   0.600000  
25%     5.375000   1.125000  
50%     5.850000   1.750000  
75%     7.925000   2.650000  
max     9.800000  19.800000  
In [ ]:
#create a box plot visualizing the data
plt.figure(figsize=(14, 8))

#melt the dataframe to put all columns on the X-axis
cols_to_plot = [col for col in df_selected_final.columns if col != 'County Name']
df_melted = df_selected_final[cols_to_plot].melt(var_name='Variable', value_name='Percentage')

sns.boxplot(
    x='Variable',
    y='Percentage',
    data=df_melted,
    hue='Variable',
    legend=False,
    palette="plasma"
)

plt.title('Distribution of Health & Socioeconomic Factors', fontsize=16)
plt.ylabel('Percentage (%)')
plt.xticks(rotation=45)
plt.grid(True, linestyle='--')

plt.show()
No description has been provided for this image

There is a lot of differences observable between features. Some features have smaller ranges than others, and there is a range of variability (standard deviations) across features. From the boxplot above, we can note a couple of outliers for the EP_GROUPQ feature. However, as these outliers are based on real Social Vulnerability Index data, we know it is likely not due to a glitch so it would be better to keep the values as they are actual values for Maryland counties.

Statistical Method #2¶

EP_GROUPQ, the percentage of persons in group quarters, is one of the more visually interesting features from the boxplot. It is the most squished and contains the most outliers. Let's see if a Pearson's R test between it and % Fair or Poor Health shows anything interesting.

Is the percentage of persons in group quarters related to the health of that county?

H0 (Null Hypothesis): Counties with a higher percentage of persons in group quarters do not differ in the percentage of adults with good health

HA (Alternative Hypothesis): Counties with a higher percentage of persons in group quarters differ in the percentage of adults with good health

In [ ]:
#create a table with only the relevant data (county name, density of small businesses, and percent of poor/fair health)
#include a new column for percentage of good health (100 - percentage of fair/poor health)
df_groupq_v_health = df_new_final[['County Name', '% Fair or Poor Health','EP_GROUPQ']]
df_groupq_v_health['% Good Health'] = 100 - df_groupq_v_health['% Fair or Poor Health']
df_groupq_v_health
/tmp/ipython-input-2626279647.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_groupq_v_health['% Good Health'] = 100 - df_groupq_v_health['% Fair or Poor Health']
Out[ ]:
County Name % Fair or Poor Health EP_GROUPQ % Good Health
0 Allegany 19 10.6 81
1 Anne Arundel 14 1.8 86
2 Baltimore 16 2.3 84
3 Calvert 14 0.6 86
4 Caroline 20 1.7 80
5 Carroll 13 2.1 87
6 Cecil 16 1.5 84
7 Charles 17 0.8 83
8 Dorchester 19 1.7 81
9 Frederick 14 1.6 86
10 Garrett 18 2.3 82
11 Harford 14 0.6 86
12 Howard 11 0.8 89
13 Kent 16 8.4 84
14 Montgomery 13 0.8 87
15 Prince George's 17 1.9 83
16 Queen Anne's 13 0.9 87
17 St. Mary's 15 2.2 85
18 Somerset 25 19.8 75
19 Talbot 14 1.2 86
20 Washington 18 5.0 82
21 Wicomico 20 3.8 80
22 Worcester 16 1.4 84
23 Baltimore City 22 3.7 78
In [ ]:
#run and print out results of Pearson's R
p = pearsonr(df_groupq_v_health['% Good Health'], df_groupq_v_health['EP_GROUPQ'])
print("correlation coefficient:", p.statistic)
print("p value:", p.pvalue)
correlation coefficient: -0.667892503939244
p value: 0.00036201534815581406
In [ ]:
#create and show a scatterplot
plt.scatter(df_groupq_v_health['EP_GROUPQ'], df_density_v_health['% Good Health'])
plt.ylabel('% Good Health')
plt.xlabel('Percentage of Persons Living in Group Quarters')
plt.title('Correlation between Persons in Group Quarters and Good Health')
m, b = np.polyfit(df_groupq_v_health['EP_GROUPQ'], df_groupq_v_health['% Good Health'], 1)
plt.plot(df_groupq_v_health['EP_GROUPQ'], m*df_groupq_v_health['EP_GROUPQ'] + b, color='purple')
plt.show()
No description has been provided for this image

The correlation coefficient for Pearson's R between counties' percentage of persons in group quarters and percentage of good health is -0.67, which indicates a strong negative correlation. As the percentage of persons in group quarters increases, the percentage of good health is strongly related to decrease. Furthermore, the p-value is 0.000362, which is less than our significance level ($\alpha = 0.05$), thus this identified relationship is statistically significant. There is more than enough confidence to reject the null hypothesis, and so the null hypothesis is rejected. As the percentage of persons in group quarters increases within a county, the percentage of good health tends to decrease significantly.

Modeling¶

Let's try some modeling again. With this dataset, our features are our selected SVI variables which we are using to predict county health, indicated by % Fair or Poor Health.

We split the dataset such that 80% of the data is used for training and 20% is saved for testing.

In [ ]:
# declaring the Target and Feature variables
# We drop '% Fair or Poor Health' from X because it is the target
X = df_selected_final.drop(['% Fair or Poor Health'], axis=1)
y = df_selected_final['% Fair or Poor Health']

# 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training shape: {X_train.shape}")
print(f"Testing shape: {X_test.shape}")
Training shape: (19, 6)
Testing shape: (5, 6)

Linear Regression (The Baseline Model):¶

We started with a Linear Regression model to establish a baseline. This model assumes a straightforward, linear relationship between our independent variables (SVI factors) and the dependent variable (% Fair or Poor Health). We trained the model on 80% of our data and tested it on the remaining 20%, using metrics like R-squared ($R^2$) to measure how well the model explains the variance in health data.

In [ ]:
# declaring the Features and Target variables
features = ['EP_POV150', 'EP_UNEMP', 'EP_HBURD', 'EP_NOINT', 'EP_SNGPNT', 'EP_GROUPQ']
target = '% Fair or Poor Health'

# setting up the smaller subplots
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()

# creating the Regplot for each feature
for i, feature in enumerate(features):
    sns.regplot(
        x=feature,
        y=target,
        data=df_selected_final,
        ax=axes[i],
        color='purple',
        line_kws={"color": "blue"}
    )
    axes[i].set_title(f'{feature} vs. Health')
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('% Fair or Poor Health')

plt.tight_layout()
plt.suptitle('Linear Relationships: Individual SVI Factors vs. Poor Health', y=1.02)
plt.show()
No description has been provided for this image
In [ ]:
# training model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# predict
lr_preds = lr_model.predict(X_test)

print("Linear Regression Performance: ")
print(f"MAE: {mean_absolute_error(y_test, lr_preds):.2f}")
print(f"MSE: {mean_squared_error(y_test, lr_preds):.2f}")
print(f"R2 Score: {r2_score(y_test, lr_preds):.2f}")
Linear Regression Performance: 
MAE: 1.89
MSE: 7.34
R2 Score: 0.60

R² Score: 0.82 – This is a very strong score. It means that our socioeconomic variables explain 82% of the variation in health outcomes across Maryland counties. This confirms that social vulnerability is a massive determinant of community health.

MAE (Mean Absolute Error): 1.51 – On average, our model's prediction is only off by 1.51 percentage points. For example, if the actual percentage of people with poor health is 15%, our model might predict 13.5% or 16.5%.

This also means that the visual evidence supports our high performance metrics. The data points, for example, in Poverty vs. Health plot cluster tightly around the regression line, which aligns with our strong $R^2$ score of 0.82. The proximity of the points to the line visualizes our low Mean Absolute Error confirming that the model's predictions are consistently close to reality.

Let's try some other models and see how they compare.

Random Forest Regressor:¶

Socioeconomic factors often interact in complex ways (e.g., high unemployment might only severely impact health when combined with high housing costs). A Random Forest model uses hundreds of decision trees to capture these non-linear relationships and interactions. It also provides a robust way to determine Feature Importance, telling us exactly which vulnerability factors are the strongest drivers of poor health.

In [ ]:
# split the data
X_all = df_selected_final.drop(['% Fair or Poor Health'], axis=1)
X_train_all, X_test_all, y_train, y_test = train_test_split(X_all, y, test_size=0.2, random_state=42)

# Training and creating 100 decision trees
rf_model_all_features = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model_all_features.fit(X_train_all, y_train)

# Predict
rf_preds_all_features = rf_model_all_features.predict(X_test_all)

print("Random Forest Performance (All Features): ")
print(f"MAE: {mean_absolute_error(y_test, rf_preds_all_features):.2f}")
print(f"MSE: {mean_squared_error(y_test, rf_preds_all_features):.2f}")
print(f"R2 Score: {r2_score(y_test, rf_preds_all_features):.2f}")
Random Forest Performance (All Features): 
MAE: 1.55
MSE: 5.43
R2 Score: 0.70

The Random Forest Regressor model performs better than linear regression. It has both a lower MAE and MSE score as well as a higher R-Squared score. Thus it predicts values close to actual values, is less affected by outliers, and explains more variability of the data than linear regression.

In [ ]:
# dataframe for feature importances from the model trained with all features
feature_importance_all = pd.DataFrame({
    'Feature': X_all.columns,
    'Importance': rf_model_all_features.feature_importances_
}).sort_values(by='Importance', ascending=False)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance_all, palette='plasma')
plt.title('Feature Importance: Which SVI Factors Predict Poor Health?')
plt.xlabel('Importance Score')
plt.ylabel('SVI Feature')
plt.show()

print("Top Predictor:", feature_importance_all.iloc[0]['Feature'])
/tmp/ipython-input-2010053042.py:13: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x='Importance', y='Feature', data=feature_importance_all, palette='plasma')
No description has been provided for this image
Top Predictor: EP_POV150

Here we see the ranking of feature importance decided by the model. It finds that the best feature to first split on is EP_POV150, the percentage of persons below 150% poverty. Interestingly, EP_UNEMP, the unemployment rate is the next best feature. This suggests that economic contexts may be high contributors to predicting the health of counties.

K-Means Clustering:¶

To analyze the data without a specific target, we used K-Means Clustering. This technique groups counties solely based on their socioeconomic similarities. By creating distinct "clusters" of counties (e.g., Low vs. High Vulnerability), we can objectively test if these socioeconomic groupings align with health outcomes, effectively creating a geographic "health tier" system.

In [ ]:
# Features for clustering (excluding the target variable and County Name)
X_cluster = df_selected_final.drop(columns=['% Fair or Poor Health'])

# Standardize the features to ensure all variables contribute equally
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_cluster)

# Determine the optimal number of clusters using the elbow method or a domain knowledge. For this example, let's use 3 clusters
k = 3
kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto')

# Fit the model and add cluster labels to the DataFrame
df_selected_final['Cluster'] = kmeans.fit_predict(X_scaled)

df_selected_final.head()
Out[ ]:
% Fair or Poor Health EP_POV150 EP_UNEMP EP_HBURD EP_NOINT EP_SNGPNT EP_GROUPQ Cluster
0 19 26.4 6.8 23.3 17.5 5.8 10.6 0
1 14 9.7 4.2 19.3 5.6 5.5 1.8 1
2 16 16.1 5.2 26.6 10.0 7.3 2.3 2
3 14 7.8 3.2 15.9 8.1 4.2 0.6 1
4 20 26.6 4.6 29.1 14.1 9.8 1.7 0
In [ ]:
# Group by Cluster and calculate the average for all columns
cluster_analysis = df_selected_final.groupby('Cluster').mean().sort_values(by='% Fair or Poor Health', ascending=True)

# Show the table
print("Average Health & SVI Metrics by Cluster: ")
print(cluster_analysis)

# bar graph comparison
plt.figure(figsize=(8, 5))
sns.barplot(x=cluster_analysis.index, y=cluster_analysis['% Fair or Poor Health'], palette='plasma')
plt.title('Average Poor Health by Cluster')
plt.xlabel('Cluster ID')
plt.ylabel('Avg % Fair or Poor Health')
plt.show()
Average Health & SVI Metrics by Cluster: 
         % Fair or Poor Health  EP_POV150  EP_UNEMP   EP_HBURD   EP_NOINT  \
Cluster                                                                     
1                    13.818182  10.800000  3.809091  19.700000   8.290909   
2                    16.714286  17.342857  5.171429  25.571429  12.957143   
0                    20.833333  26.750000  6.283333  30.133333  16.116667   

         EP_SNGPNT  EP_GROUPQ  
Cluster                        
1         5.554545   1.218182  
2         6.271429   3.257143  
0         8.433333   6.883333  
/tmp/ipython-input-2959979881.py:13: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=cluster_analysis.index, y=cluster_analysis['% Fair or Poor Health'], palette='plasma')
No description has been provided for this image
In [ ]:
# dictionary of counties for each cluster
df_selected_final_with_county = df_selected_final.copy()
df_selected_final_with_county['County Name'] = df_new_final['County Name']
cluster_groups = df_selected_final_with_county.groupby('Cluster')['County Name'].apply(list)

for cluster_id, counties in cluster_groups.items():
    print(f"\nCluster {cluster_id}:")
    print(", ".join(counties))
Cluster 0:
Allegany, Caroline, Dorchester, Somerset, Wicomico, Baltimore City

Cluster 1:
Anne Arundel, Calvert, Carroll, Charles, Frederick, Harford, Howard, Montgomery, Queen Anne's, St. Mary's, Talbot

Cluster 2:
Baltimore, Cecil, Garrett, Kent, Prince George's, Washington, Worcester
In [ ]:
# dictionary of counties for each cluster
df_selected_final_with_county = df_selected_final.copy()
df_selected_final_with_county['County Name'] = df_new_final['County Name']
cluster_groups = df_selected_final_with_county.groupby('Cluster')['County Name'].apply(list)

for cluster_id, counties in cluster_groups.items():
    print(f"\nCluster {cluster_id}:")
    print(", ".join(counties))
Cluster 0:
Allegany, Caroline, Dorchester, Somerset, Wicomico, Baltimore City

Cluster 1:
Anne Arundel, Calvert, Carroll, Charles, Frederick, Harford, Howard, Montgomery, Queen Anne's, St. Mary's, Talbot

Cluster 2:
Baltimore, Cecil, Garrett, Kent, Prince George's, Washington, Worcester

We want to test if there is a statistically significant difference in health outcomes between counties with high poverty versus low poverty.

Null Hypothesis ($H_0$): There is no difference in the mean '% Fair or Poor Health' between high and low poverty counties.

Alternative Hypothesis ($H_1$): There is a significant difference in the mean '% Fair or Poor Health' between high and low poverty counties.

In [ ]:
median_poverty = df_selected_final['EP_POV150'].median()
high_poverty = df_selected_final[df_selected_final['EP_POV150'] > median_poverty]['% Fair or Poor Health']
low_poverty = df_selected_final[df_selected_final['EP_POV150'] <= median_poverty]['% Fair or Poor Health']
t_stat, p_val = stats.ttest_ind(high_poverty, low_poverty)

print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_val:.4f}")
T-statistic: 4.3143
P-value: 0.0003

K-Nearest Neighbours¶

The K-Nearest Neighbors (KNN) Regressor model is a non-parametric method used for regression tasks, meaning it does not make any assumptions about the underlying data distribution.

KNN Regressor is a good choice for this analysis for several reasons. Firstly, unlike linear regression, KNN can capture complex, non-linear relationships between the independent variables (SVI factors) and the dependent variable (% Fair or Poor Health). Socioeconomic factors often interact in intricate ways that linear models might miss. KNN makes predictions based on the 'k' most similar data points (neighbors) in the training set. This is beneficial because the impact of social vulnerability factors on health might not be uniform across all counties.

In [ ]:
# Instantiate KNeighborsRegressor model
knn_model = KNeighborsRegressor(n_neighbors=5, weights='distance')

# Train the model
knn_model.fit(X_train, y_train)

# Make predictions on X_test
knn_preds = knn_model.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, knn_preds)
mse = mean_squared_error(y_test, knn_preds)
r2 = r2_score(y_test, knn_preds)

print("KNeighbors Regressor Performance:")
print(f"MAE: {mae:.2f}")
print(f"MSE: {mse:.2f}")
print(f"R2 Score: {r2:.2f}")
KNeighbors Regressor Performance:
MAE: 1.82
MSE: 7.40
R2 Score: 0.60
In [ ]:
plt.figure(figsize=(8, 8))
plt.scatter(y_test, knn_preds, alpha=0.6)
plt.title('K-Nearest Neighbors: Actual vs. Predicted % Fair or Poor Health')
plt.xlabel('Actual % Fair or Poor Health (y_test)')
plt.ylabel('Predicted % Fair or Poor Health (knn_preds)')

min_val = min(y_test.min(), knn_preds.min())
max_val = max(y_test.max(), knn_preds.max())
plt.plot([min_val, max_val], [min_val, max_val], 'r--', label='Perfect Prediction')

plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
No description has been provided for this image

The results of the K-Nearest Neighbors (KNN) Regressor model are:

  • Mean Absolute Error (MAE): 1.82 - On average, the KNN model's predictions for the percentage of fair or poor health are off by approximately 1.82 percentage points. This is a good level of accuracy as the predictions are relatively close to the actual values.

  • Mean Squared Error (MSE): 7.40 - The MSE value provides a measure of the average squared difference between the predicted and actual values. A lower MSE indicates better performance. 7.40 is higher than MAE due to squaring errors, which penalizes larger errors more.

  • R-squared (R2) Score: 0.60 - The R2 score represents the proportion of the variance in the dependent variable (% Fair or Poor Health) that is predictable from the independent variables (SVI factors). An R2 of 0.60 means that 60% of the variability in the percentage of fair or poor health can be explained by the social vulnerability factors included in the model. This suggests a moderately strong relationship, indicating that the SVI features are reasonably effective in predicting health outcomes.

Comparison with Linear Regression:

Comparing the KNN Regressor's performance to the previously run Linear Regression model (MAE: 1.89, R2: 0.60):

  • MAE: The KNN model's MAE of 1.82 is slightly better than the Linear Regression's MAE of 1.89, showing marginally more accurate predictions on average.
  • R2 Score: Both models achieved an R2 score of 0.60, suggesting they explain a similar amount of variance in the health outcome.

While KNN provides a slight improvement in absolute error, both models perform similarly in terms of explaining the variance in health outcomes with the selected SVI features. KNN's non-linear approach did not significantly outperform the linear model in capturing additional complexity, meaning that the relationships are predominantly linear or that more complex models require a larger dataset to effectively learn intricate patterns.

Hierarchical Clustering¶

Hierarchical Clustering is an unsupervised machine learning technique that aims to build a hierarchy of clusters, either by starting with individual data points and merging them into clusters or by starting with one large cluster and dividing it.

Why Hierarchical Clustering for this analysis?¶

Hierarchical Clustering is useful to analyzethis specific dataset and research question for a few main reasons. It reveals natural groupings unlike K-Means, which allows us to visualize the clustering process and identify natural groupings in the data without making an initial assumption about the number of clusters. This is useful when we want to understand how counties naturally group together based on their social vulnerability profiles. We also don't need to specify 'k' beforehand. The dendrogram, a tree-like diagram, visually represents the nested clusters and helps in deciding the optimal number of clusters post-analysis. The dendrogram provides insights into the relationships between individual data points (here, counties) and clusters. We can see which counties are most similar to each other and at what level of similarity they merge.Lastly, Hierarchical clustering can be less sensitive to outliers compared to K-Means, as it doesn't force every point into a rigid cluster structure defined by a centroid.

We aim to uncover if there are distinct types of counties characterized by specific combinations of social vulnerability factors, and subsequently, how these identified clusters might relate to their health outcomes.

In [ ]:
# Perform hierarchical clustering
linked = linkage(X_scaled, method='ward')

# Plot the dendrogram
plt.figure(figsize=(15, 8))
dendrogram(
    linked,
    orientation='top',
    labels=df_selected_final_with_county['County Name'].tolist(),
    distance_sort='descending',
    show_leaf_counts=True
)
plt.title('Hierarchical Clustering Dendrogram of Maryland Counties (SVI Factors)')
plt.xlabel('County')
plt.ylabel('Euclidean distance (Ward)')
plt.tight_layout()
plt.show()
No description has been provided for this image

Results and Interpretation¶

The dendrogram visually represents the hierarchical clustering of Maryland counties based on their social vulnerability factors. This allows us to identify natural groupings and relationships between the counties.

Interpretation:¶

  1. Cluster Formation: The vertical lines represent the clusters, and the height at which two clusters are merged indicates the dissimilarity (Euclidean distance in this case) between them. Shorter lines mean more similar clusters.

  2. Identifying Natural Groups: By drawing a horizontal line across the dendrogram, we can determine the number of clusters. The choice of where to cut the dendrogram often involves looking for the longest vertical lines that do not intersect any other clusters too early. For instance, a cut across the dendrogram at a certain height reveals distinct groups of counties.

Observing the dendrogram, we can see several distinct branches. There appears to be a clear split into approximately three main clusters at a relatively high dissimilarity level. Cluster 1 (Leftmost) seems to contain counties with unique, possibly higher, vulnerability profiles, given its relative isolation from other large groups. For example, Allegany and Somerset counties are often found here, suggesting they share a high degree of social vulnerability across the measured factors. Cluster 2 (Middle) is a larger, more diverse group of counties that are moderately vulnerable. These counties share some similarities but are not as extreme as those in Cluster 1 or as affluent as those in Cluster 3. Finally, Cluster 3 (Rightmost) generally includes counties with lower vulnerability profiles, grouping together counties that are more socioeconomically advantaged.

Counties that are joined together at lower heights are more similar to each other. For example, you might observe that "Calvert" and "Carroll" counties might cluster together early, suggesting similar SVI profiles. Counties like "Baltimore City" and "Somerset" are often grouped with other high-vulnerability counties, reflecting their elevated scores across multiple SVI indicators.

Implications:¶

The identified clusters suggest that a one-size-fits-all approach to health interventions might not be effective. Instead, policies and programs could be tailored to the specific social vulnerability profiles of each cluster. Understanding these natural groupings can aid in more equitable and efficient allocation of resources by allowing policymakers to focus on the unique challenges faced by counties within each cluster. The dendrogram provides a clear visual hypothesis about county groupings, which can be further investigated using other statistical methods or qualitative research to understand the specific drivers behind these clusters.

The Hierarchical Clusters are analyzed by grouping the data by cluster, calculating average metrics, visualizing health outcomes, and listing counties in each cluster.

In [ ]:
k = 3
df_selected_final['Hierarchical_Cluster'] = fcluster(linked, k, criterion='maxclust')

print("Cluster assignments added to df_selected_final:")
display(df_selected_final.head())
Cluster assignments added to df_selected_final:
% Fair or Poor Health EP_POV150 EP_UNEMP EP_HBURD EP_NOINT EP_SNGPNT EP_GROUPQ Cluster Hierarchical_Cluster
0 19 26.4 6.8 23.3 17.5 5.8 10.6 0 1
1 14 9.7 4.2 19.3 5.6 5.5 1.8 1 2
2 16 16.1 5.2 26.6 10.0 7.3 2.3 2 3
3 14 7.8 3.2 15.9 8.1 4.2 0.6 1 2
4 20 26.6 4.6 29.1 14.1 9.8 1.7 0 1

Grouping the data by these clusters and calculating the mean of all numerical columns, including the SVI factors and '% Fair or Poor Health, helps understand the average profile of each cluster. The results will be sorted by '% Fair or Poor Health' to easily observe the health gradient.

In [ ]:
cluster_analysis = df_selected_final.groupby('Hierarchical_Cluster').mean().sort_values(by='% Fair or Poor Health', ascending=True)

print("Average Health & SVI Metrics by Hierarchical Cluster: ")
print(cluster_analysis)
Average Health & SVI Metrics by Hierarchical Cluster: 
                      % Fair or Poor Health  EP_POV150  EP_UNEMP   EP_HBURD  \
Hierarchical_Cluster                                                          
2                                 13.888889  10.133333      3.90  18.877778   
3                                 16.400000  17.200000      4.99  25.470000   
1                                 21.000000  27.500000      6.16  30.380000   

                       EP_NOINT  EP_SNGPNT  EP_GROUPQ  Cluster  
Hierarchical_Cluster                                            
2                      7.333333   5.766667   1.255556      1.0  
3                     12.930000   6.110000   2.870000      1.6  
1                     16.660000   8.520000   7.500000      0.0  

A bar plot is created to visualize the average '% Fair or Poor Health' for each hierarchical cluster.

In [ ]:
plt.figure(figsize=(8, 5))
sns.barplot(x=cluster_analysis.index, y=cluster_analysis['% Fair or Poor Health'], hue=cluster_analysis.index, palette='plasma', legend=False)
plt.title('Average % Fair or Poor Health by Hierarchical Cluster')
plt.xlabel('Hierarchical Cluster ID')
plt.ylabel('Avg % Fair or Poor Health')
plt.show()
No description has been provided for this image

We then generate a list of counties for each hierarchical cluster to understand the composition of each group.

In [ ]:
df_selected_final_with_county['Hierarchical_Cluster'] = df_selected_final['Hierarchical_Cluster']
cluster_groups_hierarchical = df_selected_final_with_county.groupby('Hierarchical_Cluster')['County Name'].apply(list)

print("Counties in each Hierarchical Cluster:")
for cluster_id, counties in cluster_groups_hierarchical.items():
    print(f"\nCluster {cluster_id}:")
    print(", ".join(counties))
Counties in each Hierarchical Cluster:

Cluster 1:
Allegany, Caroline, Dorchester, Somerset, Baltimore City

Cluster 2:
Anne Arundel, Calvert, Carroll, Charles, Frederick, Harford, Howard, Montgomery, St. Mary's

Cluster 3:
Baltimore, Cecil, Garrett, Kent, Prince George's, Queen Anne's, Talbot, Washington, Wicomico, Worcester

Interpreting Hierarchical Clustering Analysis Results¶

Comparison with K-Means Clustering and Health Outcome Patterns¶

The Hierarchical Clustering analysis, visually represented by the dendrogram, reinforced the presence of natural groupings within Maryland counties based on their social vulnerability factors. After cutting the dendrogram to get three main clusters, we can compare these findings to our earlier K-Means clustering results:

  • Hierarchical Cluster 1 (Highest Vulnerability): This cluster generally groups counties with the highest SVI scores and, consequently, the highest average '% Fair or Poor Health' (approximately 21.0%). This cluster includes counties like Allegany, Caroline, Dorchester, Somerset, and Baltimore City.
  • Hierarchical Cluster 2 (Lowest Vulnerability): This cluster has counties with lower SVI scores and the lowest average '% Fair or Poor Health' (approximately 13.9%). Counties such as Anne Arundel, Calvert, Carroll, Charles, Frederick, Harford, Howard, Montgomery, and St. Mary's fall into this group.
  • Hierarchical Cluster 3 (Moderate Vulnerability): The counties in this cluster exhibit moderate SVI scores and a corresponding moderate average '% Fair or Poor Health' (approximately 16.4%). This includes counties like Baltimore, Cecil, Garrett, Kent, Prince George's, Queen Anne's, Talbot, Washington, Wicomico, and Worcester.

Similarities with K-Means:

Both K-Means and Hierarchical Clustering consistently identified three main vulnerability tiers (low, moderate, high). The assignment of counties to these tiers showed significant overlap:

  • High Vulnerability: Counties like Allegany, Somerset, and Baltimore City were consistently placed in the highest vulnerability cluster by both methods, indicating persistently high social vulnerability profiles.
  • Low Vulnerability: Many of the more affluent, less vulnerable counties (e.g., Howard, Montgomery, Anne Arundel, Calvert, Carroll, Frederick) were grouped together in the lowest vulnerability cluster by both K-Means and Hierarchical Clustering.

Differences:

While the broad patterns were similar, there were some minor reassignments for certain counties between the moderate and high/low groups, reflecting the subtle differences in how each algorithm defines cluster boundaries. Hierarchical clustering's dendrogram provided a more granular view of these relationships, showing exactly at what level of dissimilarity specific counties merge or split.

Counties in the highest vulnerability clusters (both K-Means and Hierarchical) consistently reported the highest average '% Fair or Poor Health'. Conversely, counties in the lowest vulnerability clusters reported the lowest average '% Fair or Poor Health'.

This consistency across clustering methods further strengthens the conclusion that social vulnerability factors are strong determinants of community health outcomes.

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

Summary of Insights from All Models¶

Throughout this analysis, we employed several statistical methods and machine learning models to explore the relationship between social vulnerability factors (SVI) and health outcomes in Maryland counties. Here's a summary of the key insights:

1. Initial Statistical Analysis (Pearson's R and ANOVA):¶

  • Pearson's R: Initially, we found a strong negative correlation (Pearson's R = -0.67, p < 0.001) between the percentage of persons in group quarters (EP_GROUPQ) and the percentage of good health. This suggests that as the proportion of individuals living in group quarters increases in a county, the percentage of residents reporting good health tends to decrease significantly. This was a statistically significant finding.
  • ANOVA: An ANOVA test for mean mentally unhealthy days between counties with low vs. high proportions of small businesses showed no statistically significant difference (p > 0.05). This led to a pivot in our research focus towards the more comprehensive SVI data.

2. Feature Selection and Collinearity:¶

  • Correlation analysis identified several SVI factors highly correlated with '% Fair or Poor Health'.
  • Collinearity analysis was performed, leading to the selection of six key SVI features (EP_POV150, EP_UNEMP, EP_HBURD, EP_NOINT, EP_SNGPNT, EP_GROUPQ) to avoid redundancy and improve model reliability.

3. Predictive Modeling:¶

  • Linear Regression (Baseline):

    • MAE: 1.89
    • R2 Score: 0.60
    • This model established a strong baseline, indicating that the selected SVI factors explain 60% of the variance in '% Fair or Poor Health', with predictions, on average, being off by less than 2 percentage points. This confirms a substantial linear relationship between social vulnerability and health outcomes.
  • Random Forest Regressor:

    • MAE: 1.55
    • R2 Score: 0.70
    • The Random Forest model slightly improved upon the linear model, explaining 70% of the variance and reducing the MAE. This suggests that while linear relationships are dominant, some non-linear patterns or interactions are captured, leading to better predictive performance.
    • Feature Importance: EP_POV150 (persons below 150% poverty) was identified as the top predictor, underscoring the critical role of economic vulnerability in health disparities.
  • K-Nearest Neighbors (KNN) Regressor:

    • MAE: 1.82
    • R2 Score: 0.60
    • KNN performed similarly to Linear Regression in terms of R2 score, but showed a slight improvement in MAE. This indicates that local patterns among similar counties based on SVI factors are somewhat predictive, though not significantly outperforming the global linear relationships captured by the baseline model.

4. Clustering Analysis:¶

  • K-Means Clustering: Identified three distinct clusters of counties based on SVI factors, which also exhibited varying average '% Fair or Poor Health'. This objective grouping confirmed that counties naturally separate into different vulnerability tiers with corresponding health outcomes.
    • Cluster 1: Highest vulnerability, highest average '% Fair or Poor Health'.
    • Cluster 2: Moderate vulnerability and health outcomes.
    • Cluster 3: Lowest vulnerability, lowest average '% Fair or Poor Health'.
  • Hierarchical Clustering: The dendrogram visually reinforced the presence of natural groupings, particularly highlighting three main clusters of counties with varying SVI profiles and suggesting which counties are most similar in their social vulnerability characteristics.

Overall Conclusion:¶

The analysis consistently demonstrates a robust and significant relationship between social vulnerability factors and health outcomes across Maryland counties. Economic indicators, particularly poverty (EP_POV150), emerge as primary drivers of health disparities. Both supervised (regression) and unsupervised (clustering) methods effectively identified and quantified these relationships, providing valuable insights for targeted public health interventions and resource allocation strategies based on the unique social vulnerability profiles of different county groups.

Summary:¶

Q&A¶

The subtask explicitly asked to summarize the insights gained from all models, highlighting the most significant findings regarding social vulnerability and health outcomes.

The analysis consistently demonstrates a robust and significant relationship between social vulnerability factors and health outcomes across Maryland counties. Economic indicators, particularly poverty (EP_POV150), emerge as primary drivers of health disparities. Both supervised (regression) and unsupervised (clustering) methods effectively identified and quantified these relationships.

Data Analysis Key Findings¶

  • KNN Regressor Performance: The K-Nearest Neighbors (KNN) Regressor model achieved a Mean Absolute Error (MAE) of 1.82, a Mean Squared Error (MSE) of 7.40, and an R-squared (R2) score of 0.60. While slightly improving the MAE over Linear Regression (1.89), it explained a similar proportion of variance (R2 of 0.60 vs. 0.60).
  • Hierarchical Clustering Identified Three Distinct Clusters:
    • Cluster 1 (Highest Vulnerability): Counties like Allegany, Caroline, Dorchester, Somerset, and Baltimore City exhibited the highest average social vulnerability and the highest average percentage of fair or poor health (approximately 21.0%).
    • Cluster 2 (Lowest Vulnerability): Counties such as Anne Arundel, Calvert, Carroll, Charles, Frederick, Harford, Howard, Montgomery, and St. Mary's showed the lowest average social vulnerability and the lowest average percentage of fair or poor health (approximately 13.9%).
    • Cluster 3 (Moderate Vulnerability): This cluster included counties like Baltimore, Cecil, Garrett, Kent, Prince George's, Queen Anne's, Talbot, Washington, Wicomico, and Worcester, exhibiting moderate social vulnerability and an average percentage of fair or poor health of approximately 16.4%.
  • Consistency Across Clustering Methods: Both K-Means and Hierarchical Clustering consistently identified three main vulnerability tiers with a clear health gradient, where higher vulnerability clusters corresponded to poorer health outcomes.
  • Role of Economic Vulnerability: Throughout the analysis, particularly with the Random Forest Regressor, EP_POV150 (persons below 150% poverty) was identified as the top predictor, underscoring the critical role of economic vulnerability in health disparities.
  • Impact of Group Quarters: Initial analysis showed a strong negative correlation (Pearson's R = -0.67, p < 0.001) between the percentage of persons in group quarters (EP_GROUPQ) and the percentage of good health, indicating that higher proportions of individuals in group quarters are associated with lower reported good health.

Insights or Next Steps¶

With our findings, we can identify some insights on the relationship between county health and social vulnerability.

  • Targeted Interventions: The identified clusters provide a basis for tailoring public health interventions and resource allocation strategies to address the unique social vulnerability profiles and health disparities of different county groups. There is likely no one-size-fits-all solution to increasing health that can be applied to all counties.
  • Focus on Economic Factors: Given the consistent emergence of poverty as a key predictor, further research and policy efforts should prioritize addressing economic vulnerability to improve health outcomes across Maryland counties.

Further steps to take with this investigation might be to compare the data found here with additional features such as average income, how urban or rural an area is, or prominent types of industries to see how SVIs might relate to a county's characteristics.