library(mosaic)
library(tidyverse)
library(pander)
library(car)
library(DT)
census <- read_csv('C:\\Users\\andre\\OneDrive\\Documents\\GitHub\\math325_stats_notebook_personal\\Statistics-Notebook-master\\Data\\census.csv')
crime_rates = c(409.1,
758.9,
431.5,
645.3,
499.5,
492.5,
150,
383.5,
258.9,
367,
259.6,
241.4,
287.3,
306.2,
286.5,
414.6,
214.1,
628.6,
103.3,
398.5,
322,
461,
280.6,
245,
488,
417.9,
282.8,
454,
125.6,
202.9,
780.5,
429.3,
405.1,
279.6,
293.6,
419.7,
342.4,
279.9,
172.3,
491.3,
377.4,
621.6,
431.9,
241.8,
221.9,
234,
375.6,
277.9,
297,
201.9)
states = c(
'Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida',
'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine',
'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska',
'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio',
'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas',
'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming'
)
states_abbr = c('AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL',
'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME',
'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE',
'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH',
'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY')
#creates data taken from FBI API
crime_by_state = data.frame(
ST = states,
abbr = states_abbr,
rate = crime_rates
)
#replaces states nums with names
states_key <- c(
"01" = "Alabama", "02" = "Alaska", "04" = "Arizona", "05" = "Arkansas",
"06" = "California", "08" = "Colorado", "09" = "Connecticut", "10" = "Delaware",
"11" = "District of Columbia", "12" = "Florida", "13" = "Georgia", "15" = "Hawaii",
"16" = "Idaho", "17" = "Illinois", "18" = "Indiana", "19" = "Iowa",
"20" = "Kansas", "21" = "Kentucky", "22" = "Louisiana", "23" = "Maine",
"24" = "Maryland", "25" = "Massachusetts", "26" = "Michigan", "27" = "Minnesota",
"28" = "Mississippi", "29" = "Missouri", "30" = "Montana", "31" = "Nebraska",
"32" = "Nevada", "33" = "New Hampshire", "34" = "New Jersey", "35" = "New Mexico",
"36" = "New York", "37" = "North Carolina", "38" = "North Dakota", "39" = "Ohio",
"40" = "Oklahoma", "41" = "Oregon", "42" = "Pennsylvania", "44" = "Rhode Island",
"45" = "South Carolina", "46" = "South Dakota", "47" = "Tennessee", "48" = "Texas",
"49" = "Utah", "50" = "Vermont", "51" = "Virginia", "53" = "Washington",
"54" = "West Virginia", "55" = "Wisconsin", "56" = "Wyoming"
)
agg_crime <- census %>% filter(AGEP >= 18) %>% select(POVPIP, PINCP, SCHL, ST) %>% group_by(ST) %>% mutate_all(as.numeric) %>% summarize(avg_pov_inc_ratio = mean(POVPIP), avg_income = mean(PINCP), avg_ed_lvl = mean(SCHL))
agg_crime <- merge(agg_crime, crime_by_state, by='ST')
According to the U.S. Census Bureau, almost 38 million Americans live in poverty. The national violent crime rate in 2022, according to the FBI, was 380.7 occurrences per 100,000 people.
This data pertaining to census information comes from a census survey conducted by the United States Census Bureau called the American Community Survey (ACS). More information about available data can be found on their website. They have multiple free APIs available for public use for data extraction, but the data from this set comes from the ACS 1-Year Estimates Public Use Microdata Sample (PUMS) from 2022. Observations were recorded for 3,373,378 individuals, and reflects the entirety of the United States. The data set was resampled for easier processing. Attributes in this data set are encoded, but better descriptions of each variable can be found at data.census.gov. The most important metric in this analysis is the income to poverty ratio (POVPIP). This represents an individual’s income relative to the federal poverty line expressed as a percentage.
Rates of violent crime by state were extracted from the FBI’s Crime Data Explorer (CDE) API for the year 2022 for every state. Rates are given by occurences per 100,000 people.
datatable(head(census, 100), extensions = 'Responsive', caption = 'Attributes for American Individuals - 2022')
datatable(crime_by_state, caption = 'Crime Rates per 100,000 - 2022')
This section is broken up into two parts. The first part establishes a link between poverty and crime. The second part identifies actionable metrics to focus on.
Grouping the data by state and aggregating the means of education level, income, income to poverty ratio, violent crime rates by state can be compared to these metrics.
Below is the aggregate table representing these numbers.
datatable(agg_crime, caption = 'Averages Aggregated by State - 2022')
A simple linear regression model can be constructed to evaluate the relationship between income-poverty ratio an violent crime. The equation can be modeled by:
\[ \underbrace{Y_i}_\text{Crime Rate} = \beta_0 + \beta_1 \underbrace{X_i}_\text{Average Ratio} + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0, \sigma^2) \]
This equation models crime rate (\(Y_i\)) by the average income-poverty ratio for each state linearly. The rate of change in crime rate is represented by \(\beta_1\), which can be evaluated with a hypothesis test with the following hypotheses against a significance level of \(\alpha = 0.05\):
\[H_0: \beta_1 = 0\hspace{1cm} \text{(There is no linear relationship)}\] \[H_a: \beta_1 \ne 0\hspace{1cm} \text{(There is a linear relationship)}\]crime2.lm <- lm(rate ~ avg_pov_inc_ratio, data = agg_crime)
summary(crime2.lm) %>% pander()
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 1142 | 262.6 | 4.351 | 7.045e-05 |
avg_pov_inc_ratio | -2.468 | 0.8256 | -2.989 | 0.004401 |
Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
---|---|---|---|
50 | 139.2 | 0.1569 | 0.1394 |
Given a p-value of approximately 0.004, there is evidence to suggest that there is a linear relationship. Given that \(\beta_0=1142\), the model predicts that at an average ratio of 0 (which implies an income of $0), the rate of violent crime in that state will be 1142 per 100,000 people, which is very high compared to the national average. Given \(\beta_1 = -2.4688\), the model predicts that on average, the crime rate will decrease by about 2.5 for every 1 unit increase in the income-poverty ratio. The fitted model is given by:
\[ \underbrace{Y_i}_\text{Crime Rate} = 1142 -2.468 \underbrace{X_i}_\text{Average Ratio} + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0, \sigma^2) \]ggplot(agg_crime, aes(x = rate, y = avg_pov_inc_ratio)) + geom_point(color = 'red') + theme_minimal() + labs(x = 'Income-Poverty Ratio', y = 'Violent Crime Rate', title = 'Violent Crime Rate by Income-Poverty Ratio - 2022') + geom_smooth(, method = 'lm', se = FALSE, formula = y ~ x, color = 'darkred')
As the plot demonstrates, there is a downward trend in crime rates for states that have higher income-poverty ratios. This suggests that the higher income rises above the poverty line, the lower violent crime rates get.
paste0('Correlation Coefficient: ', round(cor(agg_crime$rate ~ agg_crime$avg_pov_inc_ratio), 2)) %>% pander()
Correlation Coefficient: -0.4
The calculated correlation coefficient of -0.4 lends further evidence to the existence of a moderate, negative linear relationship between the variables.
Diagnostic plots can be used to evaluate the validity of the model:
par(mfrow = c(1, 3))
plot(crime2.lm, which = 1)
qqPlot(crime2.lm, id = FALSE)
plot(crime2.lm$residuals)
The quantile-comparison plot of the residuals shows that there are few deviations from normality. There is mild inequality in the variances, and the general trend in the line of best fit in the left-hand plot lends further support to the assumption that the relationship between the variables is linear. Overall, these diagnostic plots do not provide significant evidence to refute the validity of the model. However, it is worth noting that the correlation coefficient and the low \(R^2\) value (\(R^2=0.1569\)) do suggest that the model is a weak fit for the data. It only manages to capture the overall trend in the data rather than serve as an accurate predictor of crime rates.
There is significant evidence to suggest a negative linear relationship between income-poverty ratio and rates of violent crime across states. So what can be done about it?
This section of the analysis aims to evaluate the relationship between race and poverty. If certain demographics can be determined to have lower income-poverty ratios, interested parties can better cater their efforts to the most affected groups in the population. This section attempts to identify whether one or more races have lower ratios than the others. To answer this question, and analysis of variance will be designed and performed. Individuals will be separated into the following racial distinctions: white, black, Asian, mixed race, and other race. The question at hand is, “Do different races experience different income-poverty ratios?”
census <- census %>% mutate(race = case_when(
RAC1P == 1 ~ 'white',
RAC1P == 2 ~ 'black',
RAC1P == 6 ~ 'asian',
RAC1P == 9 ~ 'mixed',
.default = 'other'
))
ggplot(census, aes(y = POVPIP, x = race, fill = race)) + geom_boxplot(alpha = 0.5) +labs(title = "Distribution of Income-Poverty Ratio by Race", x = "Race", y = "Income-Poverty Ratio %") + theme_minimal() + theme(axis.title.x = element_text(margin = margin(t = 20)), axis.title.y = element_text(margin = margin(r = 20)), legend.position = "none")
Based on this plot, black individuals tend to experience lower income to poverty ratios compared to the other groups. This indicates that on average, black individuals may be closer to the poverty line than individuals of other races. Conversely, Asian individuals seem to exhibit the highest ratios. The significance of these differences will be tested.
In this test, the means of each group can be compared to one another to determine whether one or more of the means are different from the rest. The hypotheses for this test are as follows:
\[H_0: \text{The means are all equal.}\] \[H_a: \text{At least one mean is different.}\] As before, as significance level of 0.05 will be used to evaluate the p-value of this test.
#resampling because original data was too large to plot. Resampled data is largely similar in distribution to original
census_sample <- sample(census, size = 3373)
#ggplot(census_sample, aes(y = POVPIP, x = race, fill = race)) + geom_boxplot(alpha = 0.5) +labs(title = "New Sample Plot for Comparison to Full Dataset", x = "Race", y = "Income-Poverty Ratio %") + theme_minimal() + theme(axis.title.x = element_text(margin = margin(t = 20)), axis.title.y = element_text(margin = margin(r = 20)), legend.position = "none")
pov_race.aov <- aov(POVPIP ~ race, data = census_sample)
summary(pov_race.aov) %>% pander(caption = 'ANOVA')
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
race | 4 | 4758830 | 1189707 | 40.74 | 2.17e-33 |
Residuals | 3368 | 98345475 | 29200 | NA | NA |
par(mfrow = c(1,2))
plot(pov_race.aov, which = 1)
qqPlot(pov_race.aov$residuals, id = FALSE)
The results of the ANOVA show that at least one of the means is different (p<0.001). However, there are many issues with this test. The residuals are not normally distributed, and the assumption of constant variance is questionable. As we can see in the plot above, the data for each group do not seem to be normally distributed. Thus, an analysis of variance is likely inappropriate. A non-parametric version will be performed.
A Kruskal-Wallis test can be applied to the data to answer the same question, and it should be more appropriate given the skewed distributions of the groups. The hypotheses for this test concern the distributions of each group rather than the means. If the samples are distributed equally, this implies that their sample statistics will be similar. The test can be compared to a significance level of \(\alpha = 0.05\). They can be written as follows:
\[ H_0:\text{All samples come from the same distribution} \\ H_a:\text{At least one sample's distribution is stochastically different} \]
kruskal.test(POVPIP ~ race, data = census_sample) %>% pander(caption = 'Kruskal-Wallis Rank Sum Test')
Test statistic | df | P value |
---|---|---|
162.3 | 4 | 4.647e-34 * * * |
The p-value of this test (p<0.001) is extremely low compared to \(\alpha\), suggesting there is evidence in support of the alternative hypothesis. It does appear that at least one of the samples is stochastically different. The model is justified because of the skewness of the samples. This suggests that different races do exhibit different average income-poverty ratios. Based on the graphic, it appears that black Americans and other minority racial groups do seem to be closer to the poverty on average than other racial groups in America, especially compared to Asians and white Americans.
Based on the results of this analysis, it is reasonable to assume that there is a relationship between income-poverty ratios and crime rates. Because of this, it is recommended that actions are taken to increase these ratios across the states. Certain racial groups do seem to be more likely to have lower income-poverty ratios, particularly black Americans. It is recommended that steps be taken to improve socioeconomic conditions in this community. Ideally in doing so, violent crime rates across the country may be lowered.