Neeraj_Final_Report

Analyzing 911 data¶

By: Neeraj Singh Rawat¶

Quick Introduction to this Dataset¶

Simple but informative -- that's the goal of this dataset.

Introduction : 911¶

Created by Congress in 2004 as the 911 Implementation and Coordination Office (ICO), the National 911 Program is housed within the National Highway Traffic Safety Administration at the U.S. Department of Transportation and is a joint program with the National Telecommunication and Information Administration in the Department of Commerce.

Emergency call system service as method to give easy and fast access to public safety answering point. The reason why 911 was chosen was due to the fact that being a number that could easily be remembered, it wasn't used before as an office/area/service code and also met switching cofiguration plans of telephone industry.

The main purpose of the 911 services is to activate all the emergency departments like Fire, EMS, Police and Traffic and accordingly route a particular case to a specific department which is required to take action urgently. In the modern times, the significance of the IT services is ever increasing as the size of the data is growing at an exponential rate and so is the number of emergency services being used. To process this large amount is not something a human analyst can perform, so the need of machine learning and data analysis models arise which can help to find patterns and relations between the previous years data and help us to predict occurrences of certain uncertain events in time which then could be avoided.

This paper basically tries to create a model that would be able to find the location information and detect the importance of the newly available transcripts from the continuation records. Also, side analysis to find the hot-spot areas that are mostly using the 911 services heavily is documented.

For more information : CLICK HERE

Here I will try to demonstrate the following:

Loading the data
Pivot table
Various graphs
Percent change
Seaborn heatmap

Import numpy and pandas
visualization libraries and set %matplotlib inline

In [1]:

import pandas as pd   #pandas
import numpy as np    #NumPy
import datetime

import warnings
warnings.filterwarnings("ignore") #ignor the warning (Non fatal errors)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Working with the Data¶

Read in the csv file as a dataframe called df

In [2]:

df = pd.read_csv('911.csv')  #911.csv is file name

In [3]:

df.head()                         # top 5 rows of data

Out[3]:

	lat	lng	desc	zip	title	timeStamp	twp	addr	e
0	40.297876	-75.581294	REINDEER CT & DEAD END; NEW HANOVER; Station ...	19525.0	EMS: BACK PAINS/INJURY	2015-12-10 17:40:00	NEW HANOVER	REINDEER CT & DEAD END	1
1	40.258061	-75.264680	BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP...	19446.0	EMS: DIABETIC EMERGENCY	2015-12-10 17:40:00	HATFIELD TOWNSHIP	BRIAR PATH & WHITEMARSH LN	1
2	40.121182	-75.351975	HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St...	19401.0	Fire: GAS-ODOR/LEAK	2015-12-10 17:40:00	NORRISTOWN	HAWS AVE	1
3	40.116153	-75.343513	AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;...	19401.0	EMS: CARDIAC EMERGENCY	2015-12-10 17:40:01	NORRISTOWN	AIRY ST & SWEDE ST	1
4	40.251492	-75.603350	CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S...	NaN	EMS: DIZZINESS	2015-12-10 17:40:01	LOWER POTTSGROVE	CHERRYWOOD CT & DEAD END	1

In [4]:

df.tail()                         # bottom 5 rows of data

Out[4]:

	lat	lng	desc	zip	title	timeStamp	twp	addr	e
99487	40.132869	-75.333515	MARKLEY ST & W LOGAN ST; NORRISTOWN; 2016-08-2...	19401.0	Traffic: VEHICLE ACCIDENT -	2016-08-24 11:06:00	NORRISTOWN	MARKLEY ST & W LOGAN ST	1
99488	40.006974	-75.289080	LANCASTER AVE & RITTENHOUSE PL; LOWER MERION; ...	19003.0	Traffic: VEHICLE ACCIDENT -	2016-08-24 11:07:02	LOWER MERION	LANCASTER AVE & RITTENHOUSE PL	1
99489	40.115429	-75.334679	CHESTNUT ST & WALNUT ST; NORRISTOWN; Station ...	19401.0	EMS: FALL VICTIM	2016-08-24 11:12:00	NORRISTOWN	CHESTNUT ST & WALNUT ST	1
99490	40.186431	-75.192555	WELSH RD & WEBSTER LN; HORSHAM; Station 352; ...	19002.0	EMS: NAUSEA/VOMITING	2016-08-24 11:17:01	HORSHAM	WELSH RD & WEBSTER LN	1
99491	40.207055	-75.317952	MORRIS RD & S BROAD ST; UPPER GWYNEDD; 2016-08...	19446.0	Traffic: VEHICLE ACCIDENT -	2016-08-24 11:17:02	UPPER GWYNEDD	MORRIS RD & S BROAD ST	1

Sneak peak into the dataset¶

So in the dataset we have the place of the call,reason,address etc.

In [5]:

df.info()

print('\n\n\n')

print(df.describe())      # The describe() function in pandas is very handy in getting various summary statistics.
                          # This function returns the count, mean, standard deviation, minimum and maximum values and
                          # the quantiles of the data.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99492 entries, 0 to 99491
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   lat        99492 non-null  float64
 1   lng        99492 non-null  float64
 2   desc       99492 non-null  object 
 3   zip        86637 non-null  float64
 4   title      99492 non-null  object 
 5   timeStamp  99492 non-null  object 
 6   twp        99449 non-null  object 
 7   addr       98973 non-null  object 
 8   e          99492 non-null  int64  
dtypes: float64(3), int64(1), object(5)
memory usage: 6.8+ MB




                lat           lng           zip        e
count  99492.000000  99492.000000  86637.000000  99492.0
mean      40.159526    -75.317464  19237.658298      1.0
std        0.094446      0.174826    345.344914      0.0
min       30.333596    -95.595595  17752.000000      1.0
25%       40.100423    -75.392104  19038.000000      1.0
50%       40.145223    -75.304667  19401.000000      1.0
75%       40.229008    -75.212513  19446.000000      1.0
max       41.167156    -74.995041  77316.000000      1.0

Summary of Dataset¶

In [6]:

print('Rows     :',df.shape[0])
print('Columns  :',df.shape[1])
print('\nFeatures :\n     :',df.columns.tolist())
print('\nMissing values    :',df.isnull().values.sum())
print('\nUnique values :  \n',df.nunique())

Rows     : 99492
Columns  : 9

Features :
     : ['lat', 'lng', 'desc', 'zip', 'title', 'timeStamp', 'twp', 'addr', 'e']

Missing values    : 13417

Unique values :  
 lat          14579
lng          14586
desc         99455
zip            104
title          110
timeStamp    72577
twp             68
addr         21914
e                1
dtype: int64

Attributes¶

In [7]:

df['twp'].values

Out[7]:

array(['NEW HANOVER', 'HATFIELD TOWNSHIP', 'NORRISTOWN', ...,
       'NORRISTOWN', 'HORSHAM', 'UPPER GWYNEDD'], dtype=object)

In [8]:

df.index

Out[8]:

RangeIndex(start=0, stop=99492, step=1)

In [9]:

df['lat'].dtype

Out[9]:

dtype('float64')

All good data analysis projects begin with trying to answer questions. Now that we know what column category data we have let's think of some questions or insights we would like to obtain from the data. So here's a list of questions we'll try to answer using our data analysis skills!¶

First some basic questions:

From where the calls come most?
Which are top townships (twp) for calls?
How many unique title?
What is the reason for most calls?

lat : String variable, Latitude
lng: String variable, Longitude
desc: String variable, Description of the Emergency Call
zip: String variable, Zipcode
title: String variable, Title
timeStamp: String variable, YYYY-MM-DD HH:MM:SS
twp: String variable, Township
addr: String variable, Address
e: String variable, Dummy variable (always 1)

In [10]:

df['zip'].value_counts().head().plot.bar();
plt.xlabel('Zip Code')                              # Top 5 zip
plt.ylabel('Count')
plt.show()
###################
df['zip'].value_counts().head()

Out[10]:

19401.0    6979
19464.0    6643
19403.0    4854
19446.0    4748
19406.0    3174
Name: zip, dtype: int64

Maximum Call comes from Zip Code 19401 which is a place called as Norristown in Pennsylvania, United States.

In [11]:

df['twp'].value_counts().head(5).plot.bar();
plt.xlabel('Township')                           # top 5 townships
plt.ylabel('Count')
plt.show()
################
df['twp'].value_counts().head()

Out[11]:

LOWER MERION    8443
ABINGTON        5977
NORRISTOWN      5890
UPPER MERION    5227
CHELTENHAM      4575
Name: twp, dtype: int64

Lower Merion township has the highest number of calls.

In [12]:

df['title'].nunique()               # Number of unique title
                                    # OR we can use
#len(df['title'].unique())

Out[12]:

Creating a columns with reason: The title column have the general reason for the call with the more detailed reason for the the call.There are three basic category for the call like EMS,Fire and Traffic

In [13]:

x=df['title'].iloc[0]

In [14]:

x.split(':')[0]

Out[14]:

'EMS'

In [15]:

df['Reason']=df['title'].apply(lambda x:x.split(':')[0])
df['Reason'].unique()

Out[15]:

array(['EMS', 'Fire', 'Traffic'], dtype=object)

With Above Transformations we have managed to create a columns with title reason having the values EMS,Fire and Traffic.

In [21]:

#  reason for most calls
f,ax=plt.subplots(1,2,figsize=(18,8))
df['Reason'].value_counts().plot.pie(explode=[0,0.1,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Reason for Call')
ax[0].set_ylabel('Count')
sns.countplot('Reason',data=df,ax=ax[1],order=df['Reason'].value_counts().index)
ax[1].set_title('Count of Reason')
plt.show()

49.10% call are for Medical Emergency followed by Traffic(35.9%) and Fire(15.00%).

Working with Time Data¶

Data about time is of time string.We need to convert it into Datetime Format.

In [22]:

df['timeStamp']=pd.to_datetime(df['timeStamp'])
type(df['timeStamp'].iloc[0])

Out[22]:

pandas._libs.tslibs.timestamps.Timestamp

In [23]:

time=df['timeStamp'].iloc[0]
time.hour, time.year, time.month, time.dayofweek

Out[23]:

(17, 2015, 12, 3)

In [24]:

df['Hour']=df['timeStamp'].apply(lambda x:x.hour)
df['Month']=df['timeStamp'].apply(lambda x:x.month)
df['DayOfWeek']=df['timeStamp'].apply(lambda x:x.dayofweek)

In [25]:

dmap={0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}

Calls Per Month

In [26]:

byMonth=df.groupby('Month').count()
byMonth['lat'].plot();

In [27]:

sns.lmplot(x='Month',y='twp',data=byMonth.reset_index());

In [28]:

mmap={1:'Jan',2:'Feb',3:'Mar',4:'Apr',5:'May',6:'Jun',7:'Jul',8:'Aug',9:'Sep',10:'Oct',11:'Nov',12:'Dec'}

In [29]:

df['Month']=df['Month'].map(mmap)
df['DayOfWeek']=df['DayOfWeek'].map(dmap)
df.head()

Out[29]:

	lat	lng	desc	zip	title	timeStamp	twp	addr	e	Reason	Hour	Month	DayOfWeek
0	40.297876	-75.581294	REINDEER CT & DEAD END; NEW HANOVER; Station ...	19525.0	EMS: BACK PAINS/INJURY	2015-12-10 17:40:00	NEW HANOVER	REINDEER CT & DEAD END	1	EMS	17	Dec	Thu
1	40.258061	-75.264680	BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP...	19446.0	EMS: DIABETIC EMERGENCY	2015-12-10 17:40:00	HATFIELD TOWNSHIP	BRIAR PATH & WHITEMARSH LN	1	EMS	17	Dec	Thu
2	40.121182	-75.351975	HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St...	19401.0	Fire: GAS-ODOR/LEAK	2015-12-10 17:40:00	NORRISTOWN	HAWS AVE	1	Fire	17	Dec	Thu
3	40.116153	-75.343513	AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;...	19401.0	EMS: CARDIAC EMERGENCY	2015-12-10 17:40:01	NORRISTOWN	AIRY ST & SWEDE ST	1	EMS	17	Dec	Thu
4	40.251492	-75.603350	CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S...	NaN	EMS: DIZZINESS	2015-12-10 17:40:01	LOWER POTTSGROVE	CHERRYWOOD CT & DEAD END	1	EMS	17	Dec	Thu

Call during the Week

In [30]:

sns.set_style('darkgrid')
f,ax=plt.subplots(1,2,figsize=(18,8))
k1=sns.countplot(x='DayOfWeek',data=df,ax=ax[0],palette='viridis')
k2=sns.countplot(x='DayOfWeek',data=df,hue='Reason',ax=ax[1],palette='viridis')

More Emergency calls happen on Friday.EMS call are more.

Call during the month

Creating a groupby object called byMonth, where you group the DataFrame by the month column and use the count() method for aggregation. Using the head() method on this returned DataFrame.

In [31]:

byMonth.head()

Out[31]:

	lat	lng	desc	zip	title	timeStamp	twp	addr	e	Reason	Hour	DayOfWeek
Month
1	13205	13205	13205	11527	13205	13205	13203	13096	13205	13205	13205	13205
2	11467	11467	11467	9930	11467	11467	11465	11396	11467	11467	11467	11467
3	11101	11101	11101	9755	11101	11101	11092	11059	11101	11101	11101	11101
4	11326	11326	11326	9895	11326	11326	11323	11283	11326	11326	11326	11326
5	11423	11423	11423	9946	11423	11423	11420	11378	11423	11423	11423	11423

A simple plot off of the dataframe indicating the count of calls per month.

In [32]:

byMonth['twp'].plot()

Out[32]:

<AxesSubplot:xlabel='Month'>

Creating a new column called 'Date' that contains the date from the timeStamp column.
Groupby Date column with the count() aggregate and create a plot of counts of 911 calls.

In [33]:

df['Date']=df['timeStamp'].apply(lambda x: x.date())


df.groupby('Date').count()['twp'].plot()
plt.tight_layout()

Recreate the plot but create 3 separate plots with each plot representing a Reason for the 911 call.

In [34]:

df[df['Reason']=='Traffic'].groupby('Date').count()['twp'].plot()


plt.title('Traffic')
plt.tight_layout()

In [35]:

df[df['Reason']=='Fire'].groupby('Date').count()['twp'].plot()
plt.title('Fire')
plt.tight_layout()

In [36]:

df[df['Reason']=='EMS'].groupby('Date').count()['twp'].plot()
plt.title('EMS')
plt.tight_layout()

Creating heatmaps with seaborn and our data. Restructuring the dataframe so that the columns become the Hours and the Index becomes the Day of the Week.

In [37]:

dayHour=df.groupby(by=['DayOfWeek','Hour']).count()['Reason'].unstack()
dayHour.head()

Out[37]:

Hour	0	1	2	3	4	5	6	7	8	9	...	14	15	16	17	18	19	20	21	22	23
DayOfWeek
Fri	275	235	191	175	201	194	372	598	742	752	...	932	980	1039	980	820	696	667	559	514	474
Mon	282	221	201	194	204	267	397	653	819	786	...	869	913	989	997	885	746	613	497	472	325
Sat	375	301	263	260	224	231	257	391	459	640	...	789	796	848	757	778	696	628	572	506	467
Sun	383	306	286	268	242	240	300	402	483	620	...	684	691	663	714	670	655	537	461	415	330
Thu	278	202	233	159	182	203	362	570	777	828	...	876	969	935	1013	810	698	617	553	424	354

5 rows × 24 columns

Heat Map using this new DataFrame.

In [38]:

plt.figure(figsize=(12,6))
sns.heatmap(dayHour,cmap='viridis');

Cluster Map based on day

In [39]:

plt.figure(figsize=(12,6));
sns.clustermap(dayHour,cmap='viridis');

<Figure size 864x432 with 0 Axes>

We can see from heatmap that we have more calls on Friday and Wenesday between 16-17 Hours.

More calls come in the Evening.

Very Less calls during the Night time.

We have very less 911 calls on weekends.

Heat Map Based on Month

In [40]:

dayMonth=df.groupby(by=['DayOfWeek','Month']).count()['Reason'].unstack()

In [41]:

plt.figure(figsize=(10,5))
sns.heatmap(dayMonth, cmap='viridis');

Cluster Map Based on Month

In [42]:

plt.figure(figsize=(10,5));
sns.clustermap(dayMonth,cmap='coolwarm');

<Figure size 720x360 with 0 Axes>

We have the highest calls in month of January on Saturday.

CONCLUSION¶

From the above driven results, we can make following conclusions: While this analysis had three major parts to it, there was one overall goal and that was to predict and better analyze 911 calls through spatial analysis. The hotspot analysis was done to determine areas of high and low concentrations of calls, which identiﬁes those clusters of points with values higher in magnitude than you might expect to ﬁnd by random chance.

This is particularly important to many people because if they can determine which areas have high or lower calls, then they can better use resources towards responding to emergencies.

The data also provides valuable insight into the types of calls in each State. This decision is out of reach for our analysis, but that could be a potential implication of an analysis like this.

This analysis provided some insights into how spatial analysis is applied to a dataset. This analysis attempted to determine hot spots of 911 calls and then interpret which factors have an inﬂuence in them.

The major goals of this work were successively accomplished for the sake of data analysis as far as results and outputs go.

FUTURE WORK¶

A possible extension to this project would be giving a user interface that would be able to ingest the 911 transcripts and print out statistics based on the data history and several similar cases could be tagged as references for assisting with solving the current scenario.
With the use of enough computing power and model accuracy being gradually improved as new cases are added to the training data, we can certainly think of a system that would predict the live cases and perform appropriate actions without the help of an operator to direct cases to specific department automatically.
A major limitation on this project was the spatial and temporal aspectsof the data. More data over a larger area could possibly lead to diﬀerent results. Also because the data was relatively small, applying the model across other areas may not lead to a valid and acceptable model.
It would be great if repeated report of incident is identified. For example, fire incident may be reported for multiple times.
How does this help identify fake calls?

ACKNOWLEDGEMENTS¶

I would like to thank Dexlab Analytics for guiding me through each milestone of the project and providing me with constant feedback and suggestions which help me to understand the data much better which immensely helped me to work towards creating a custom but fairly accurate analysis report.

Also, I would like to take this opportunity to thank Niharika Rai mam for her enlightenment, counsel & instruction to me through each milestones, preparation which is an integral part of the project.

Reka of Peer Learning (Reka is Slavic Word)

Search This Blog

Analyzing 911 data¶

By: Neeraj Singh Rawat¶

Quick Introduction to this Dataset¶

Introduction : 911¶

Working with the Data¶

Sneak peak into the dataset¶

Summary of Dataset¶

Attributes¶

All good data analysis projects begin with trying to answer questions. Now that we know what column category data we have let's think of some questions or insights we would like to obtain from the data. So here's a list of questions we'll try to answer using our data analysis skills!¶

Working with Time Data¶

CONCLUSION¶

FUTURE WORK¶

ACKNOWLEDGEMENTS¶

REFERENCES¶

Labels

Comments

Post a Comment

Popular posts from this blog

Databases and SQL for Data Science

Sampling Techniques