18.104.22.168 Lab – San Francisco Crime (Instructor Version)
Demonstrate your knowledge of the Data Analysis Lifecycle using a given set of data and the tools, Python and Jupyter Notebook
- Part 1: Import the Python Packages
- Part 2: Load the Data
- Part 3: Prepare the Data
- Part 4: Analyze the Data
- Part 5: Visualize the Data
Background / Scenario
In this lab, you will import some Python packages required to analyze a data set containing San Francisco crime information. You will then use Python and Jupyter Notebook to prepare this data for analysis, analyze it, graph it, and communicate your findings.
- 1 PC with Internet access
- Raspberry Pi version 2 or higher
- Python libraries: pandas, numpy, matplotlib, folium, datetime, and csv
- Datafiles: Map-Crime_Incidents-Previous_Three_Months.csv
Part 1: Import the Python Packages
In this part, you will import the following Python packages necessary for the rest of this lab.
NumPy is the fundamental package for scientific computing with Python. It contains among other things: a powerful N-dimensional array object and sophisticated (broadcasting) functions.
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy.
Folim is a library to create interactive map.
# Code cell 1 %matplotlib inline import numpy as np import pandas as pd import matplotlib.pyplot as plt import folium
Part 2: Load the Data
In this part, you will load the San Francisco Crime Dataset and the Python packages necessary to analyze and visualize it.
Step 1: Load the San Francisco Crime data into a data frame.
In this step, you will import the San Francisco crime data from a comma separated values (csv) file into a data frame.
# code cell 2 # This should be a local path dataset_path = './Data/Map-Crime_Incidents-Previous_Three_Months.csv' # read the original dataset (in comma separated values format) into a DataFrame SF = pd.read_csv(dataset_path)
To view the first five lines of the csv file, the Linux command head is used.
# code cell 3 !head -n 5 ./Data/Map-Crime_Incidents-Previous_Three_Months.csv
Step 2: View the imported data.
a) By typing the name of the data frame variable into a cell, you can visualize the top and bottom rows in a structured way.
# Code cell 4 pd.set_option('display.max_rows', 10) #Visualize 10 rows SF
b) Use the function columns to view the name of the variables in the DataFrame.
# Code cell 5 SF.columns
How many variables are contained in the SF data frame (ignore the Index)?
c) Use the function len to determine the number of rows in the dataset.
# Code cell 6 len(SF)
Part 3: Prepare the Data
Now that you have the data loaded into the work environment and determined the analysis you want to perform, it is time to prepare the data for analysis.
Step 1: Extract the month and day from the Date field.
lambda is a Python keyword to define so-called anonymous functions. lambda allows you to specify a function in one line of code, without using def and without defining a specific name for it. The syntax for a lambda expression is :
lambda parameters : expression.
In the following, the lambda function is used to create an inline function that selects only the month digits from the Date variable, and int to transform a string representation into an integer. Then, the pandas function apply is used to apply this function to an entire column (in practice, apply implicitly defines a for loop and passes one by one the rows to the lambda function). The same procedure can be done for the Day.
# Code cell 7 SF['Month'] = SF['Date'].apply(lambda row: int(row[0:2])) SF['Day'] = SF['Date'].apply(lambda row: int(row[3:5]))
To verify that these two variables were added to the SF data frame, use the print function to print some values from these columns, and type to check that these new columns contain indeed numerical values.
# Code cell 8 print(SF['Month'][0:2]) print(SF['Day'][0:2])
# Code cell 9 print(type(SF['Month']))
Step 2: Remove variables from the SF data frame.
a) The column IncidntNum contains many cells with NaN. In this instance, the data is missing. Furthermore, the IncidntNum is not providing any value to the analysis. The column can be dropped from the data frame. One way to remove unwanted variables in a data frame is by using the del function.
# Code cell 10 del SF['IncidntNum']
b) Similarly, the Location attribute will not be in this analysis. It can be droped from the data frame.
Alternatively, you can use the drop function on the data frame, specifying that the axis is the 1 (0 for rows), and that the command does not require an assignment to another value to store the result (inplace = True ).
# Code cell 11 SF.drop('Location', axis=1, inplace=True )
c) Check that the columns have been removed.
# Code cell 12 SF.columns
Part 4: Analyze the Data
Now that the data frame has been prepared with the data, it is time to analyze the data.
Step 1: Summarize variables to obtain statistical information.
a) Use the function value_counts to summarize the number of crimes committed by type, then print to display the contents of the CountCategory variable.
# Code cell 13 CountCategory = SF['Category'].value_counts() print(CountCategory)
b) By default, the counts are ordered in descending order. The value of the optional parameter ascending can be set to True to reverse this behavior.
# Code cell 14 SF['Category'].value_counts(ascending=True)
What type of crime was committed the most?
c) By nesting the two functions into one command, you can accomplish the same result with one line of code.
# Code cell 15 print(SF['Category'].value_counts(ascending=True))
Challenge Question: Which PdDistrict had the most incidents of reported crime? Provide the Python command(s) used to support your answer.
# code cell 16 # Possible code for the challenge question print(SF['PdDistrict'].value_counts(ascending=True))
Step 2: Subset the data into smaller data frames.
a) Logical indexing can be used to select only the rows for which a given condition is satisfied. For example, the following code extracts only the crimes committed in August, and stores the result in a new DataFrame.
# Code cell 17 AugustCrimes = SF[SF['Month'] == 8] AugustCrimes
How many crime incidents were there for the month of August?
How many burglaries were reported in the month of August?
# code cell 18 # Possible code for the question: How many burglaries were reported in the month of August? AugustCrimes = SF[SF['Month'] == 8] AugustCrimesB = SF[SF['Category'] == 'BURGLARY'] len(AugustCrimesB)
b) To create a subset of the SF data frame for a specific day, use the function query operand to compare Month and Day at the same time.
# Code cell 19 Crime0704 = SF.query('Month == 7 and Day == 4') Crime0704
# Code cell 20 SF.columns
Part 5: Present the Data
Visualization and presentation of the data provides an instant overview that might not be apparent by simply looking at the raw data. The SF data frame contains longitude and latitude coordinates that can be used to plot the data.
Step 1: Plot a graph of the SF data frame using the X and Y variables.
a) Use the plot() function to plot the SF data frame. Use the optional parameter to plot the graph in red and setting the marker shape to a circle using ro .
# Code cell 21 plt.plot(SF['X'],SF['Y'], 'ro') plt.show()
b) Identify the number of police department district, then build the dictionary pd_districts to associate their string to an integer.
# Code cell 22 pd_districts = np.unique(SF['PdDistrict']) pd_districts_levels = dict(zip(pd_districts, range(len(pd_districts)))) pd_districts_levels
c) Use apply and lambda to add the police deparment integer id to a new column of the DataFrame
# Code cell 23 SF['PdDistrictCode'] = SF['PdDistrict'].apply(lambda row: pd_districts_levels[row])
d) Use the newly create PdDistrictCode to automatically change the color
# Code cell 24 plt.scatter(SF['X'], SF['Y'], c=SF['PdDistrictCode']) plt.show()
Step 2: Add Map packages to enhance the plot.
In Step 1, you created a simple plot that displays where crime incidents took place in SF County. This plot is useful, but folium provides additional functions that will allow you to overlay this plot onto an OpenStreet map.
a) Folium requires the color of the marker to be specified using an hexadecimal value. For this reason, we use the colors package, and select the necessary colors.
# Code cell 25 from matplotlib import colors districts = np.unique(SF['PdDistrict']) print(list(colors.cnames.values())[0:len(districts)])
b) Create a color dictionary for each police department district.
# Code cell 26 color_dict = dict(zip(districts, list(colors.cnames.values())[0:-1:len(districts)])) color_dict
c) Create the map using the middle coordinates of the SF Data to center the map (using mean). To reduce the computation time, plotEvery is used to limit amount of plotted data. Set this value to 1 to plot all the rows (might take a long time to visualize the map).
# Code cell 27 # Create map map_osm = folium.Map(location=[SF['Y'].mean(), SF['X'].mean()], zoom_start = 12) plotEvery = 50 obs = list(zip( SF['Y'], SF['X'], SF['PdDistrict'])) for el in obs[0:-1:plotEvery]: folium.CircleMarker(el[0:2], color=color_dict[el], fill_color=el,radius=10).add_to(map_osm)
# Code cell 28 map_osm