Offense Code Crime Trends

A line plot is created to visualize the crime trend over the years for all offense code groups. The plt.figure(figsize=(15, 18)) line sets the size of the figure to be created, specifying a width of 15 units and a height of 18 units. The sns.lineplot function is utilized for generating the line plot, where the x-axis represents the years (‘YEAR’), the y-axis represents the count of crimes (‘COUNT’), and different offense code groups are distinguished by color, thanks to the ‘hue’ parameter. The data used for plotting is sourced from the crimedf_new DataFrame.

The legend function is applied to place the legend outside the plot area, specifically to the left of the plot. Additional adjustments to the legend’s appearance are made using parameters such as loc, bbox_to_anchor, fancybox, shadow, and borderpad.

In summary, this code produces a comprehensive line plot that visually depicts the crime trend over the years, considering various offense code groups. By incorporating different colors for each offense code group and providing a clear legend, the plot facilitates the interpretation of how different types of crimes have evolved over the specified time period. The figure size ensures that the plot is appropriately scaled to accommodate the complexity of visualizing multiple offense code groups over the years.

Crime Occurrences

A line graph is generated to visualize the pattern of crime occurrences over the years. The lineplot3 variable is assigned the result of plotting a line graph based on the crime_count_by_year DataFrame. In this plot, the x-axis represents the years, while the y-axis indicates the count of crimes for each respective year. The figure size is set to 12 units in width and 6 units in height, optimizing the visual representation of the line graph.

The primary objective of this visualization is to observe and interpret the temporal patterns and trends in crime incidents over the specified time period. A line graph is particularly effective in showcasing trends, allowing for the identification of any upward or downward trajectories in crime rates over the years.

In summary, this code contributes to a comprehensive analysis of the temporal dynamics within the crime dataset by extending the visualization to focus on annual trends. The line graph serves as a valuable tool for interpreting patterns and gaining insights into the overall trajectory of crime occurrences across different years.

Visualization

the matplotlib and seaborn libraries are imported for data visualization. The sns.set(style=”whitegrid”) line sets the style of the seaborn plots to a white grid background. Subsequently, a matplotlib figure (fig) and axis (ax) are created, specifying a size of 20 units in width and 5 units in height.

The main focus of this code is on visualizing the temporal distribution of crimes. It utilizes a bar plot (barplot1) to display the number of crimes versus the date. The data for the plot is derived from the previously created DataFrame (crime_count_by_date). Specifically, the first 30 dates with the maximum number of crimes are selected using iloc[:30, :]. The bar plot is configured to show these date-crime count relationships, with the bars colored in green.

To enhance the readability of the plot, axis labels are set with ax.set(ylabel=”Number of Crimes”, xlabel=”Date”). Additionally, the x-axis labels (dates) are rotated by 45 degrees for better visibility, and their alignment is adjusted to the right. The font weight is set to ‘light’, and the font size is increased to ‘large’, ensuring that the x-axis labels remain legible even with the rotation.

Overall, this code segment creates a visually informative bar plot that effectively communicates the temporal patterns of crimes, highlighting the specific dates with the highest crime counts in the dataset.

Results

The  Python code performs various operations on a crime dataset. It begins by importing the necessary libraries, such as pandas for data manipulation, and suppresses warning messages. The dataset is loaded into a Pandas DataFrame named crime_df from a CSV file. Information about the DataFrame, including its dimensions and column names, is then displayed. The code checks for the presence of NULL values in the dataset and identifies that there are some. Subsequently, it examines the distribution of values in the ‘SHOOTING’ column and removes this column from the DataFrame. Rows containing any NULL values are dropped, resulting in a cleaned DataFrame named cleaned_crimedf.

The code proceeds to handle temporal data, converting the ‘OCCURRED_ON_DATE’ column from string format to a timestamp and splitting it into separate ‘DATE’ and ‘TIME’ columns. The first five rows of the cleaned DataFrame are then displayed. Grouping the data by date, the code generates a new DataFrame (crime_count_by_date) that represents the count of crimes for each day, sorting the results in descending order based on the count. Finally, the first five rows and general information of this grouped DataFrame are printed, providing insights into the temporal distribution of crimes in the dataset.

Method

Crime rate prediction through machine learning involves a systematic approach to harnessing data and leveraging algorithms for accurate forecasting. The following method outlines the key steps involved in developing a machine learning model for predicting crime rates:

Data Collection:

Gather comprehensive and relevant datasets containing historical crime data. Include features such as time, location, type of crime, weather conditions, and socioeconomic factors.

Data Preprocessing:

Clean the dataset by handling missing values, outliers, and inconsistencies. Convert categorical variables into numerical representations through techniques like one-hot encoding. Normalize or standardize numerical features to ensure consistent scaling.

Feature Engineering:

Select relevant features that are likely to influence crime rates based on domain knowledge and exploratory data analysis. Create new features or transform existing ones to enhance the model’s predictive capabilities.

Data Splitting:

Divide the dataset into training and testing sets to evaluate the model’s performance on unseen data. Optionally, set aside a validation set for fine-tuning hyperparameters during model development.

Model Selection:

Choose a suitable machine learning algorithm for regression tasks, such as Linear Regression, Decision Trees, Random Forest, or Gradient Boosting. Consider ensemble methods for improved predictive accuracy.

Model Training:

Train the selected model on the training dataset, utilizing features to predict crime rates. Adjust hyperparameters to optimize the model’s performance.

Model Evaluation:

Assess the model’s performance on the testing set using evaluation metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), or R-squared. Identify areas where the model may be overfitting or underfitting.

Fine-Tuning:

If necessary, fine-tune the model by adjusting hyperparameters or exploring different algorithms. Utilize techniques like cross-validation for robust model validation.

Deployment:

Once satisfied with the model’s performance, deploy it to predict crime rates in real-world scenarios. Implement monitoring mechanisms to track the model’s ongoing accuracy and make adjustments as needed.

Issues

Data Quality and Availability Challenges:
– Limited or inconsistent information on crime incidents, demographics, and socio-economic factors
poses a significant challenge.
– The accuracy of predictive models relies on the quality and availability of comprehensive data.
Bias and Fairness Considerations:
– Addressing biases in the prediction process is crucial to ensure fairness in law enforcement practices.
– Predictive models must be developed and implemented with measures to prevent the perpetuation
or exacerbation of existing biases.
Enhancing Proactive Crime Prevention:
– The primary goal of implementing crime rate prediction is to enhance proactive crime prevention
strategies in the city of Boston.
– Predictive modeling allows law enforcement to anticipate and respond effectively to emerging crime
patterns.
Efficient Resource Allocation:
– Predictive models enable more efficient resource allocation by directing patrols and interventions to
high-risk areas.
– This targeted approach serves the dual purpose of deterring criminal activities and maintaining public
safety.
Tailored Crime Prevention Initiatives:
– Predictive modeling assists in formulating strategic crime prevention initiatives and community
engagement programs.
– These initiatives are tailored to specific neighborhoods and demographics, addressing the unique
challenges faced by different communities.

MTH 522 Project 3 Introduction

The analysis of crime incident reports plays a pivotal role in understanding and addressing public safety concerns within communities. These reports serve as a comprehensive record, documenting various criminal activities, incidents, and law enforcement responses. By delving into these reports, law enforcement agencies, policymakers, and researchers gain valuable insights into the patterns, trends, and dynamics of criminal behavior. Such insights are crucial for devising effective crime prevention strategies, allocating resources efficiently, and enhancing overall public safety measures. Crime incident reports provide a detailed narrative of incidents, including time, location, nature of the crime, and the actions taken by law enforcement. This information not only aids in the immediate response to incidents but also contributes to long-term strategies aimed at creating safer environments for residents and businesses alike. As communities strive to create secure living spaces, the examination of crime incident reports emerges as an indispensable tool in fostering evidence-based decision-making and fostering collaborative efforts to mitigate criminal activities.

Moreover, the significance of crime incident reports extends beyond law enforcement circles to encompass broader societal implications. Accessible and transparent reporting mechanisms empower communities by fostering a sense of awareness and accountability. Residents, local organizations, and policymakers can utilize these reports to advocate for improved safety measures, engage in community discussions, and work collaboratively towards creating environments that deter criminal activities.

Whether The Person Has Mental Illness

The bar plot suggests that the presence of signs of mental illness in individuals shot may contribute to the incidence of fatal police shootings in certain cases. Nevertheless, individuals displaying no signs of mental illness were more prone to being shot by the police compared to those with mental health conditions.

Armed Status of People Shot

In this plot, it is evident that the armed status of individuals has an impact on fatal police shootings. A greater percentage of individuals shot by the police were armed with guns, followed by those armed with knives, in comparison to unarmed individuals.

Number of Fatal Police Shootings per Year

Based on the  graph, it is evident that the number of fatal police shootings experienced a significant increase from 2016 to 2022. However, a notable and abrupt decrease is observed in 2023. This decline raises the possibility that missing data or discrepancies in reporting may contribute to the observed decrease. To confirm the accuracy of this trend and ascertain whether there were instances where the police shot victims or accused individuals, further investigation and consideration of the actual circumstances are warranted

.

Methodology

In conducting this analysis, we employed the Washington Post Police Shootings Database, a comprehensive source offering details on police shootings within the United States from 2015 to 2023. To ensure the robustness of our analysis, we took measures to address missing age values, replacing them appropriately. Additionally, we handled NaN and null values across all other columns before utilizing the data for plotting purposes. This meticulous approach ensures the integrity and completeness of our dataset, allowing us to derive meaningful insights with confidence.

libraries used :
1. pandas -> for handling datasets
2. matplotlib -> for plotting graphs to find the correlation between the different columns

MTH project 2

Certainly! In the second project, our focus revolves around a comprehensive examination of key demographic factors among individuals involved in police shootings. Leveraging data sourced from the Washington Post Police Shootings Database, we aim to delve into the age distribution, race, mental health conditions, gender, and other pertinent variables within the dataset. The primary objective is to extract meaningful insights that shed light on the age demographics of individuals affected by incidents of police violence.

To achieve this goal, our analysis will extend beyond statistical examination. We intend to conduct a thorough exploration of the various demographic dimensions, paying special attention to age groups, racial backgrounds, prevalence of mental health conditions, and gender disparities among those who have experienced police shootings. By scrutinizing these factors, we aspire to construct a detailed and nuanced analysis report that not only quantifies the demographic landscape but also offers valuable contextual insights.

K-Fold Cross Validation & Bootstrapping

Widely used approach for estimating test error. idea is to randomly divide the data into K equal sized parts. we leave out part k, fit the model to the other k-1 parts(combined) and then obtain predictions for the left-out Kth pair. This is done in turn for each part K =1,2,3….. K and then the results are combined. Since each training set is only (k-1)/K as big as the original training set the estimates of prediction error will typically be biased upward. The bias is minimized when K=n but this estimate has high variance.

Bootstrapping is a resampling method used in statistics to estimate the sampling distribution of a statistic by repeatedly resampling from the original data with replacement. It’s commonly used for estimating confidence intervals or assessing the variability of a statistic when the population distribution is unknown or difficult to model. Let’s say we have a dataset consisting of exam scores for a class of 50 students.

Bootstrapping involves creating multiple bootstrap samples by randomly selecting data points from the original dataset with replacement. Each bootstrap sample has the same size as the original dataset (50 in this case).Compute the statistic of interest (e.g., mean, median, standard deviation, etc.) for each of the bootstrap samples. For this example, let’s calculate the mean for each bootstrap sample. Repeat Resampling the data  and calculating the statistic a large number of times (typically thousands of times) to generate a distribution of the statistic. This distribution represents the variability of the statistic under different random samples. We can use the resulting bootstrap distribution to estimate confidence intervals or assess the variability of the statistic. For instance, you might calculate the 95% confidence interval for the mean exam score based on the bootstrap distribution.

RESAMPLING METHODS

These Methods refit a model of interest to samples formed from the training set, in order to obtain additional Information about the fitted model. For example they provide estimates of test set prediction error and the standard deviation and bias of our parameter estimates.

Distinction between the Test Error and Training Error:

Test error is the average error that results from using a statistical learning method to predict the response on a new observation one that was not used in training the method. In contrast the training error can be easily calculated by applying the statistical learning method to the observations used in its training. But the training error rate is often quite different from the test error rate and in particular the former can dramatically underestimate the latter.

Bias-Variance Trade-off:

Bias and variance together give us prediction error and there’s a trade-off they sum together to get prediction and the trade-off is minimized. so bias and variance gives us the test error.

Validation-Set Approach:

Here we randomly divide the available set of samples into two parts: a training set and a validation or hold -out set.

The model is fit on the training set and the fitted model is used to predict the response for the observations in the validation set. The resulting validation set error provides on estimate of the test error. This is typically assessed using MSE in the case of a qualitative response and misclassification rate in the case of qualitative (discrete) response.

Drawbacks of validation approach:

the validation estimate of the test error can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set . In the validation approach only a subset of the observations those that are included in the training set rather than in the validation set are used to fit the model. This suggests that the validation set error may tend to overestimate the test error for the model fit on the entire data set. why?  In general the more the data one has the lower the error.

Relation between Pre-Molt and Post-Molt

The relationship between the pre-molt and post-molt sizes of crabs using
statistical analysis. When we compare the histograms of crabs’ sizes pre-molt and post-molt side by side, we observe that the shape of the distributions is quite similar. The only notable distinction is a mean difference of  143.898 – 129.212=14.6858. The question is this difference in means statistically significant, To tackle this issue, we could employ a common statistical method known as a t-test. The estimated p-value, p = 0.0341998. With a p-value <0.05, we can conclude that we reject the null hypothesis that there is no real difference.

The primary use of a t-test is to assess whether there is a significant difference in the means of two populations. Furthermore, applying a t-test to compare two means using suitable software may not inherently provide a clear understanding of how the p-value was computed, For these reasons we carry out a Monte-Carlo Procedure to calculate a p-value for the observed difference in means while considering a null hypothesis that assumes no real difference. 472 post-molt data points and another set of 472 pre-molt data points. If we combine these two sets into one, resulting in a combined dataset of 944 points, and then randomly divide it into two separate buckets, namely Bucket A with 472 data points and Bucket B containing what remains, we can proceed to calculate the difference in means between these buckets. This process is repeated N times, and we keep a record of how many times n the difference in means is greater than or equal to 14.6858. The probability, denoted as P, is then calculated as P = n/N.

Linear Regression Model With More Than One Predictor Variable

We initially have a response variable Y and a simple linear regression mean function:

Y = β0 + β1 +ϵ

Now, let’s introduce a second variable  X2, and aim to understand how Y depends on both X1 and X2 simultaneously. By incorporating X2 into the analysis, we create a mean function that considers the values of both X1 and X2:

Y = β0 + β1 x1 + β2 x2 + ϵ

The primary objective in including X2 is to account for the portion of Y that hasn’t already been explained by X1.

% Diabetes(Predict) ← % Inactivity, % Obesity (Predictors or Factors)

The Generalized Linear Model  extends the concept of linear regression by introducing a link function that relates the linear model to the response variable and by permitting the measurement variance to be influenced by the predicted value of each measurement.

Breusch Pagan Test

A P-Value represents the likelihood of our data occurring randomly, and it’s crucial in deciding whether to accept or reject the null hypothesis. The Breusch-Pagan test, on the other hand, is used to detect heteroskedasticity. This test is running a regression where we predict the squared residuals from the initial regression model using predictor variables, and then evaluating the significance of these coefficients. If these coefficients significantly deviate from zero, it suggests the presence of heteroskedasticity. Based on the p-value, if we choose to opt the null hypothesis (H0), it implies that the data does not exhibit heteroskedasticity. Conversely, if we opt for the alternative hypothesis, it suggests the presence of heteroskedasticity in the data. If the p-value of the test falls below a specific significance threshold (e.g., α = .05), we reject the null hypothesis and infer that the regression model exhibits heteroscedasticity.

 

 

Linear and Multiple Linear Regression

Linear regression is a statistical technique that involves estimating the value of one variable based on the values of other variables. In the context of our class discussion, we explored how to predict the percentage of diabetes based on the percentage of inactivity alone, represented as (%diabetes = α + β %inactivity + ε), where % diabetes is considered the dependent variable, and % inactivity serves as the independent variable. We can also extend this approach to a multiple linear regression method, which incorporates more than one independent variable, such as (%diabetes = α + β1 %inactivity + β2 %obesity + ε).

When conducting multiple linear regression to predict %diabetes using both %inactivity and %obesity as independent variables, it’s important to note that we have a limited dataset comprising only 354 data points. In this scenario, we need to construct a model that describes the relationship between %diabetes and these two independent variables based on this relatively small dataset.