Analyzing 911 data¶
By: Neeraj Singh Rawat¶
Quick Introduction to this Dataset¶
Simple but informative -- that's the goal of this dataset.
Introduction : 911¶
Created by Congress in 2004 as the 911 Implementation and Coordination Office (ICO), the National 911 Program is housed within the National Highway Traffic Safety Administration at the U.S. Department of Transportation and is a joint program with the National Telecommunication and Information Administration in the Department of Commerce.
Emergency call system service as method to give easy and fast access to public safety answering point. The reason why 911 was chosen was due to the fact that being a number that could easily be remembered, it wasn't used before as an office/area/service code and also met switching cofiguration plans of telephone industry.
The main purpose of the 911 services is to activate all the emergency departments like Fire, EMS, Police and Traffic and accordingly route a particular case to a specific department which is required to take action urgently. In the modern times, the significance of the IT services is ever increasing as the size of the data is growing at an exponential rate and so is the number of emergency services being used. To process this large amount is not something a human analyst can perform, so the need of machine learning and data analysis models arise which can help to find patterns and relations between the previous years data and help us to predict occurrences of certain uncertain events in time which then could be avoided.
This paper basically tries to create a model that would be able to find the location information and detect the importance of the newly available transcripts from the continuation records. Also, side analysis to find the hot-spot areas that are mostly using the 911 services heavily is documented.
For more information : CLICK HERE
Here I will try to demonstrate the following:
- Loading the data
- Pivot table
- Various graphs
- Percent change
- Seaborn heatmap
- Import numpy and pandas
- visualization libraries and set %matplotlib inline
import pandas as pd #pandas
import numpy as np #NumPy
import datetime
import warnings
warnings.filterwarnings("ignore") #ignor the warning (Non fatal errors)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
Working with the Data¶
- Read in the csv file as a dataframe called df
df = pd.read_csv('911.csv') #911.csv is file name
df.head() # top 5 rows of data
lat | lng | desc | zip | title | timeStamp | twp | addr | e | |
---|---|---|---|---|---|---|---|---|---|
0 | 40.297876 | -75.581294 | REINDEER CT & DEAD END; NEW HANOVER; Station ... | 19525.0 | EMS: BACK PAINS/INJURY | 2015-12-10 17:40:00 | NEW HANOVER | REINDEER CT & DEAD END | 1 |
1 | 40.258061 | -75.264680 | BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP... | 19446.0 | EMS: DIABETIC EMERGENCY | 2015-12-10 17:40:00 | HATFIELD TOWNSHIP | BRIAR PATH & WHITEMARSH LN | 1 |
2 | 40.121182 | -75.351975 | HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St... | 19401.0 | Fire: GAS-ODOR/LEAK | 2015-12-10 17:40:00 | NORRISTOWN | HAWS AVE | 1 |
3 | 40.116153 | -75.343513 | AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;... | 19401.0 | EMS: CARDIAC EMERGENCY | 2015-12-10 17:40:01 | NORRISTOWN | AIRY ST & SWEDE ST | 1 |
4 | 40.251492 | -75.603350 | CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S... | NaN | EMS: DIZZINESS | 2015-12-10 17:40:01 | LOWER POTTSGROVE | CHERRYWOOD CT & DEAD END | 1 |
df.tail() # bottom 5 rows of data
lat | lng | desc | zip | title | timeStamp | twp | addr | e | |
---|---|---|---|---|---|---|---|---|---|
99487 | 40.132869 | -75.333515 | MARKLEY ST & W LOGAN ST; NORRISTOWN; 2016-08-2... | 19401.0 | Traffic: VEHICLE ACCIDENT - | 2016-08-24 11:06:00 | NORRISTOWN | MARKLEY ST & W LOGAN ST | 1 |
99488 | 40.006974 | -75.289080 | LANCASTER AVE & RITTENHOUSE PL; LOWER MERION; ... | 19003.0 | Traffic: VEHICLE ACCIDENT - | 2016-08-24 11:07:02 | LOWER MERION | LANCASTER AVE & RITTENHOUSE PL | 1 |
99489 | 40.115429 | -75.334679 | CHESTNUT ST & WALNUT ST; NORRISTOWN; Station ... | 19401.0 | EMS: FALL VICTIM | 2016-08-24 11:12:00 | NORRISTOWN | CHESTNUT ST & WALNUT ST | 1 |
99490 | 40.186431 | -75.192555 | WELSH RD & WEBSTER LN; HORSHAM; Station 352; ... | 19002.0 | EMS: NAUSEA/VOMITING | 2016-08-24 11:17:01 | HORSHAM | WELSH RD & WEBSTER LN | 1 |
99491 | 40.207055 | -75.317952 | MORRIS RD & S BROAD ST; UPPER GWYNEDD; 2016-08... | 19446.0 | Traffic: VEHICLE ACCIDENT - | 2016-08-24 11:17:02 | UPPER GWYNEDD | MORRIS RD & S BROAD ST | 1 |
Sneak peak into the dataset¶
So in the dataset we have the place of the call,reason,address etc.
df.info()
print('\n\n\n')
print(df.describe()) # The describe() function in pandas is very handy in getting various summary statistics.
# This function returns the count, mean, standard deviation, minimum and maximum values and
# the quantiles of the data.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 99492 entries, 0 to 99491 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 lat 99492 non-null float64 1 lng 99492 non-null float64 2 desc 99492 non-null object 3 zip 86637 non-null float64 4 title 99492 non-null object 5 timeStamp 99492 non-null object 6 twp 99449 non-null object 7 addr 98973 non-null object 8 e 99492 non-null int64 dtypes: float64(3), int64(1), object(5) memory usage: 6.8+ MB lat lng zip e count 99492.000000 99492.000000 86637.000000 99492.0 mean 40.159526 -75.317464 19237.658298 1.0 std 0.094446 0.174826 345.344914 0.0 min 30.333596 -95.595595 17752.000000 1.0 25% 40.100423 -75.392104 19038.000000 1.0 50% 40.145223 -75.304667 19401.000000 1.0 75% 40.229008 -75.212513 19446.000000 1.0 max 41.167156 -74.995041 77316.000000 1.0
Summary of Dataset¶
print('Rows :',df.shape[0])
print('Columns :',df.shape[1])
print('\nFeatures :\n :',df.columns.tolist())
print('\nMissing values :',df.isnull().values.sum())
print('\nUnique values : \n',df.nunique())
Rows : 99492 Columns : 9 Features : : ['lat', 'lng', 'desc', 'zip', 'title', 'timeStamp', 'twp', 'addr', 'e'] Missing values : 13417 Unique values : lat 14579 lng 14586 desc 99455 zip 104 title 110 timeStamp 72577 twp 68 addr 21914 e 1 dtype: int64
Attributes¶
df['twp'].values
array(['NEW HANOVER', 'HATFIELD TOWNSHIP', 'NORRISTOWN', ..., 'NORRISTOWN', 'HORSHAM', 'UPPER GWYNEDD'], dtype=object)
df.index
RangeIndex(start=0, stop=99492, step=1)
df['lat'].dtype
dtype('float64')
All good data analysis projects begin with trying to answer questions. Now that we know what column category data we have let's think of some questions or insights we would like to obtain from the data. So here's a list of questions we'll try to answer using our data analysis skills!¶
First some basic questions:
- From where the calls come most?
- Which are top townships (twp) for calls?
- How many unique title?
- What is the reason for most calls?
- lat : String variable, Latitude
- lng: String variable, Longitude
- desc: String variable, Description of the Emergency Call
- zip: String variable, Zipcode
- title: String variable, Title
- timeStamp: String variable, YYYY-MM-DD HH:MM:SS
- twp: String variable, Township
- addr: String variable, Address
- e: String variable, Dummy variable (always 1)
df['zip'].value_counts().head().plot.bar();
plt.xlabel('Zip Code') # Top 5 zip
plt.ylabel('Count')
plt.show()
###################
df['zip'].value_counts().head()
19401.0 6979 19464.0 6643 19403.0 4854 19446.0 4748 19406.0 3174 Name: zip, dtype: int64
Maximum Call comes from Zip Code 19401 which is a place called as Norristown in Pennsylvania, United States.
df['twp'].value_counts().head(5).plot.bar();
plt.xlabel('Township') # top 5 townships
plt.ylabel('Count')
plt.show()
################
df['twp'].value_counts().head()
LOWER MERION 8443 ABINGTON 5977 NORRISTOWN 5890 UPPER MERION 5227 CHELTENHAM 4575 Name: twp, dtype: int64
Lower Merion township has the highest number of calls.
df['title'].nunique() # Number of unique title
# OR we can use
#len(df['title'].unique())
110
Creating a columns with reason: The title column have the general reason for the call with the more detailed reason for the the call.There are three basic category for the call like EMS,Fire and Traffic
x=df['title'].iloc[0]
x.split(':')[0]
'EMS'
df['Reason']=df['title'].apply(lambda x:x.split(':')[0])
df['Reason'].unique()
array(['EMS', 'Fire', 'Traffic'], dtype=object)
With Above Transformations we have managed to create a columns with title reason having the values EMS,Fire and Traffic.
# reason for most calls
f,ax=plt.subplots(1,2,figsize=(18,8))
df['Reason'].value_counts().plot.pie(explode=[0,0.1,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Reason for Call')
ax[0].set_ylabel('Count')
sns.countplot('Reason',data=df,ax=ax[1],order=df['Reason'].value_counts().index)
ax[1].set_title('Count of Reason')
plt.show()
49.10% call are for Medical Emergency followed by Traffic(35.9%) and Fire(15.00%).
Working with Time Data¶
Data about time is of time string.We need to convert it into Datetime Format.
df['timeStamp']=pd.to_datetime(df['timeStamp'])
type(df['timeStamp'].iloc[0])
pandas._libs.tslibs.timestamps.Timestamp
time=df['timeStamp'].iloc[0]
time.hour, time.year, time.month, time.dayofweek
(17, 2015, 12, 3)
df['Hour']=df['timeStamp'].apply(lambda x:x.hour)
df['Month']=df['timeStamp'].apply(lambda x:x.month)
df['DayOfWeek']=df['timeStamp'].apply(lambda x:x.dayofweek)
dmap={0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
Calls Per Month
byMonth=df.groupby('Month').count()
byMonth['lat'].plot();
sns.lmplot(x='Month',y='twp',data=byMonth.reset_index());
mmap={1:'Jan',2:'Feb',3:'Mar',4:'Apr',5:'May',6:'Jun',7:'Jul',8:'Aug',9:'Sep',10:'Oct',11:'Nov',12:'Dec'}
df['Month']=df['Month'].map(mmap)
df['DayOfWeek']=df['DayOfWeek'].map(dmap)
df.head()
lat | lng | desc | zip | title | timeStamp | twp | addr | e | Reason | Hour | Month | DayOfWeek | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40.297876 | -75.581294 | REINDEER CT & DEAD END; NEW HANOVER; Station ... | 19525.0 | EMS: BACK PAINS/INJURY | 2015-12-10 17:40:00 | NEW HANOVER | REINDEER CT & DEAD END | 1 | EMS | 17 | Dec | Thu |
1 | 40.258061 | -75.264680 | BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP... | 19446.0 | EMS: DIABETIC EMERGENCY | 2015-12-10 17:40:00 | HATFIELD TOWNSHIP | BRIAR PATH & WHITEMARSH LN | 1 | EMS | 17 | Dec | Thu |
2 | 40.121182 | -75.351975 | HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St... | 19401.0 | Fire: GAS-ODOR/LEAK | 2015-12-10 17:40:00 | NORRISTOWN | HAWS AVE | 1 | Fire | 17 | Dec | Thu |
3 | 40.116153 | -75.343513 | AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;... | 19401.0 | EMS: CARDIAC EMERGENCY | 2015-12-10 17:40:01 | NORRISTOWN | AIRY ST & SWEDE ST | 1 | EMS | 17 | Dec | Thu |
4 | 40.251492 | -75.603350 | CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S... | NaN | EMS: DIZZINESS | 2015-12-10 17:40:01 | LOWER POTTSGROVE | CHERRYWOOD CT & DEAD END | 1 | EMS | 17 | Dec | Thu |
Call during the Week
sns.set_style('darkgrid')
f,ax=plt.subplots(1,2,figsize=(18,8))
k1=sns.countplot(x='DayOfWeek',data=df,ax=ax[0],palette='viridis')
k2=sns.countplot(x='DayOfWeek',data=df,hue='Reason',ax=ax[1],palette='viridis')
More Emergency calls happen on Friday.EMS call are more.
Call during the month
Creating a groupby object called byMonth, where you group the DataFrame by the month column and use the count() method for aggregation. Using the head() method on this returned DataFrame.
byMonth.head()
lat | lng | desc | zip | title | timeStamp | twp | addr | e | Reason | Hour | DayOfWeek | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Month | ||||||||||||
1 | 13205 | 13205 | 13205 | 11527 | 13205 | 13205 | 13203 | 13096 | 13205 | 13205 | 13205 | 13205 |
2 | 11467 | 11467 | 11467 | 9930 | 11467 | 11467 | 11465 | 11396 | 11467 | 11467 | 11467 | 11467 |
3 | 11101 | 11101 | 11101 | 9755 | 11101 | 11101 | 11092 | 11059 | 11101 | 11101 | 11101 | 11101 |
4 | 11326 | 11326 | 11326 | 9895 | 11326 | 11326 | 11323 | 11283 | 11326 | 11326 | 11326 | 11326 |
5 | 11423 | 11423 | 11423 | 9946 | 11423 | 11423 | 11420 | 11378 | 11423 | 11423 | 11423 | 11423 |
A simple plot off of the dataframe indicating the count of calls per month.
byMonth['twp'].plot()
<AxesSubplot:xlabel='Month'>
- Creating a new column called 'Date' that contains the date from the timeStamp column.
- Groupby Date column with the count() aggregate and create a plot of counts of 911 calls.
df['Date']=df['timeStamp'].apply(lambda x: x.date())
df.groupby('Date').count()['twp'].plot()
plt.tight_layout()
- Recreate the plot but create 3 separate plots with each plot representing a Reason for the 911 call.
df[df['Reason']=='Traffic'].groupby('Date').count()['twp'].plot()
plt.title('Traffic')
plt.tight_layout()
df[df['Reason']=='Fire'].groupby('Date').count()['twp'].plot()
plt.title('Fire')
plt.tight_layout()
df[df['Reason']=='EMS'].groupby('Date').count()['twp'].plot()
plt.title('EMS')
plt.tight_layout()
Creating heatmaps with seaborn and our data. Restructuring the dataframe so that the columns become the Hours and the Index becomes the Day of the Week.
dayHour=df.groupby(by=['DayOfWeek','Hour']).count()['Reason'].unstack()
dayHour.head()
Hour | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DayOfWeek | |||||||||||||||||||||
Fri | 275 | 235 | 191 | 175 | 201 | 194 | 372 | 598 | 742 | 752 | ... | 932 | 980 | 1039 | 980 | 820 | 696 | 667 | 559 | 514 | 474 |
Mon | 282 | 221 | 201 | 194 | 204 | 267 | 397 | 653 | 819 | 786 | ... | 869 | 913 | 989 | 997 | 885 | 746 | 613 | 497 | 472 | 325 |
Sat | 375 | 301 | 263 | 260 | 224 | 231 | 257 | 391 | 459 | 640 | ... | 789 | 796 | 848 | 757 | 778 | 696 | 628 | 572 | 506 | 467 |
Sun | 383 | 306 | 286 | 268 | 242 | 240 | 300 | 402 | 483 | 620 | ... | 684 | 691 | 663 | 714 | 670 | 655 | 537 | 461 | 415 | 330 |
Thu | 278 | 202 | 233 | 159 | 182 | 203 | 362 | 570 | 777 | 828 | ... | 876 | 969 | 935 | 1013 | 810 | 698 | 617 | 553 | 424 | 354 |
5 rows × 24 columns
Heat Map using this new DataFrame.
plt.figure(figsize=(12,6))
sns.heatmap(dayHour,cmap='viridis');
Cluster Map based on day
plt.figure(figsize=(12,6));
sns.clustermap(dayHour,cmap='viridis');
<Figure size 864x432 with 0 Axes>
We can see from heatmap that we have more calls on Friday and Wenesday between 16-17 Hours.
More calls come in the Evening.
Very Less calls during the Night time.
We have very less 911 calls on weekends.
Heat Map Based on Month
dayMonth=df.groupby(by=['DayOfWeek','Month']).count()['Reason'].unstack()
plt.figure(figsize=(10,5))
sns.heatmap(dayMonth, cmap='viridis');
Cluster Map Based on Month
plt.figure(figsize=(10,5));
sns.clustermap(dayMonth,cmap='coolwarm');
<Figure size 720x360 with 0 Axes>
We have the highest calls in month of January on Saturday.
.
.
CONCLUSION¶
From the above driven results, we can make following conclusions: While this analysis had three major parts to it, there was one overall goal and that was to predict and better analyze 911 calls through spatial analysis. The hotspot analysis was done to determine areas of high and low concentrations of calls, which identifies those clusters of points with values higher in magnitude than you might expect to find by random chance.
This is particularly important to many people because if they can determine which areas have high or lower calls, then they can better use resources towards responding to emergencies.
The data also provides valuable insight into the types of calls in each State. This decision is out of reach for our analysis, but that could be a potential implication of an analysis like this.
This analysis provided some insights into how spatial analysis is applied to a dataset. This analysis attempted to determine hot spots of 911 calls and then interpret which factors have an influence in them.
The major goals of this work were successively accomplished for the sake of data analysis as far as results and outputs go.
FUTURE WORK¶
A possible extension to this project would be giving a user interface that would be able to ingest the 911 transcripts and print out statistics based on the data history and several similar cases could be tagged as references for assisting with solving the current scenario.
With the use of enough computing power and model accuracy being gradually improved as new cases are added to the training data, we can certainly think of a system that would predict the live cases and perform appropriate actions without the help of an operator to direct cases to specific department automatically.
A major limitation on this project was the spatial and temporal aspectsof the data. More data over a larger area could possibly lead to different results. Also because the data was relatively small, applying the model across other areas may not lead to a valid and acceptable model.
It would be great if repeated report of incident is identified. For example, fire incident may be reported for multiple times.
How does this help identify fake calls?
ACKNOWLEDGEMENTS¶
I would like to thank Dexlab Analytics for guiding me through each milestone of the project and providing me with constant feedback and suggestions which help me to understand the data much better which immensely helped me to work towards creating a custom but fairly accurate analysis report.
Also, I would like to take this opportunity to thank Niharika Rai mam for her enlightenment, counsel & instruction to me through each milestones, preparation which is an integral part of the project.
Comments
Post a Comment