2.5.1.4 Lab – Internet Meter Data Analysis (Instructor Version)
Objectives
- **Part 1: Collect and Store Data**
- **Part 2: Manipulate Data**
Scenario/Background
In this lab, you will acquire Internet speed statistics and store the live data in a comma separated values (csv) file. You will also load the stored data from the csv file to a Python data structure, the Pandas DataFrame, and use its functionalities to explore the data and manipulate it so that it is easily readable.
Required Resources
- 1 PC with Internet access
- Raspberry Pi version 2 or higher
- Python libraries: datetime, csv, subprocess, pandas, numpy
- Datafiles: rpi_data_long.csv
Part 1: Collect and Store Data
The goal of this first part of the lab is to gather internet speed measurements through the Raspberry Pi. Three kinds of measurements will be collected:
- Ping speed
- Download speed
- Upload speed
Step 1: Install Speedtest and Import Python Libraries.
In this step, you will install Speedtest and import Python libraries.
Speedtest-cli is a Python script that measures the upload and download speed of your Internet connection. For more information about speedtest, go to https://github.com/sivel/speedtest-cli.
a) Install speedtest-cli.
# Code cell 1 !pip install speedtest-cli
This cli allows the Jupyter notebook to connect to the website and store the data.
b) Import the necessary Python libraries.
# Code cell 2 # Python library to manage date and time data import datetime # Python library to read and write csv files import csv # Python library to execute bash commands from the notebook. # If you want to know more about this, check this resource: # http://www.pythonforbeginners.com/os/subprocess-for-system-administrators import subprocess
Step 2: Generate timestamps using the datetime package.
In this lab, measurements of Internet speed statistics will be generated. A crucial step in data acquisition for the majority of data analytics applications, is to associate a timestamp to measurements.
a) To generate a timestamp, use the datetime.now function of the datetime package:
# Code cell 3 date_time = datetime.datetime.now() print(date_time, type(date_time))
b) An instance of the class datetime cannot be directly written to in text form. The function strftime parses the date information into a string. The arguments of this function determine the format of the output sting. A description of these parameters can be found in the documentation of the strftime function at https://docs.python.org/2/library/time.html.
# Code cell 4 date_time.strftime('%a, %d %b %Y %H:%M:%S')
After reading the documentation of the strftime function, generate a timestamp and parse it into a string with the following format: YYYY-MM-DD HH:MM:SS.
# Code cell 5 # enter your code
Step 3: Run the process and collect the output with Python.
The speedtest-cli
command, if run from a terminal, returns a string with download and upload speeds. To run the command from this notebook, it is necessary to use the Python module subprocess, which allows running a process directly from the notebook code cell.
a) Run a speed test using the speedtest-cli
command from Python. The output will be stored in the process_output variable.
# Code cell 6 # This string contains the command line to interface with speedtest.net speedtest_cmd = "speedtest-cli --simple" # Execute the process process = subprocess.Popen(speedtest_cmd.split(), stdout=subprocess.PIPE) # Collect the command output process_output = process.communicate()[0]
b) Print the process output. Notice the type for the process_output
variable.
# Code cell 7 print(process_output, type(process_output))
c) The speed test result is split, and a timestamp is appended to the results.
# Code cell 8 # Store the time at which the speedtest was executed date_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") process_output = process_output.split() process_output.append(date_time) print(process_output, type(process_output))
d) The speedtest() function is created to return the results from the speedtest-cli command.
# Code cell 9 # function to excute the speed test def speedtest(): # We need to store the time at which the speedtest was executed date_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") # This is a string that contains what we would write on the command line #to interface with speedtest.net speedtest_cmd = "speedtest-cli --simple" # We now execute the process: process = subprocess.Popen(speedtest_cmd.split(), stdout=subprocess.PIPE) process_output = process.communicate()[0] process_output = process_output.split() # and we add the date and time process_output.append(date_time) return process_output
What does the speedtest() function return? What is the code to view results from the speedtest() function?
# Code cell 10 # Code to view the results from speedtest() function
Step 4: Store the output of the speedtest() function.
The comma separated values (csv) is the most common import and export format for spreadsheets and databases. To learn more information about working with csv in Python, navigate to https://docs.python.org/2/library/csv.html.
a) Create a file named test.txt in the /tmp directory and write “test_msg” in the file.
# Code cell 11 with open("/tmp/test.txt",'w') as f: f.write('test_msg')
b) Use the Linux command cat
to verify the creation and content of the file.
# Code cell 12 !cat /tmp/test.txt
c) To check that the file was successfully open:
# Code cell 13 with open("/tmp/test.txt",'r') as f: str = f.read() print(str)
d) Understanding the meaning of the with statement, especially in combination with try and except is not required for the rest of this lab, but a useful resource about this is http://effbot.org/zone/python-with-statement.htm.
To write into a csv file, it is necessary to create a csv.writer object. Check https://docs.python.org/2/library/csv.html and find out which function of the ‘csv.writer’ object can be used to add a row to a csv file.
# Code cell 14 # function to save data to csv def save_to_csv(data, filename): try: # If the file exists, we want to append a new line to it, with the #results of the current experiment with open(filename + '.csv', 'a') as f: wr = csv.writer(f) wr.writerow(data) except: # If it does not exist, create the file first with open(filename + '.csv', 'w') as f: # Hint: This is similar to appending new lines to a file. # Create a csv writer object # ADD CODE HERE # Save (write) to file # ADD CODE HERE
Step 5: Check the collected data.
Write a function to open a csv file and print its content to screen. You can find an example in the 13.1.5 section of https://docs.python.org/2/library/csv.html
# Code cell 15 def print_from_csv(filename): with open(filename + '.csv', 'r') as f: re = csv.reader(f) # 1. Loop over the rows # 2. print
Now, all the functions needed to collect and store Internet speed data are finished.
Step 6: Run the Speedtest multiple times and store the data.
a) Write a for loop that calls the speedtest 5 times, prints the output of the tests, and stores the data in a csv file.
# Code cell 16 for i in range(5): speedtest_output = speedtest() print('Test number {}'.format(i)) print(speedtest_output) save_to_csv(speedtest_output, '/tmp/rpi_data_test')
b) Display the file to verify that the data has been saved correctly.
# Code cell 17 print_from_csv('/tmp/rpi_data_test')
If a bigger dataset is needed, the speedtest can be running in the background for more samples.
How would you change the code if you wanted to run the code 100 times?
# Code cell 18 # Code to run 100 times # for i in xrange(100): # speedtest_output = speedtest() # print 'Test number: {}'.format(i) # print speedtest_output # save_to_csv(speedtest_output, '/tmp/rpi_data')
Part 2: Manipulate Data
The Python library pandas is very useful for working with structured data. The full documentation can be found here: http://pandas.pydata.org/pandas-docs/version/0.14.1/</font>
A larger dataset collected in advance will be used for this part of the lab. The filename is rpi_data_long.csv
.
Step 1: Import the Python libraries.
Import pandas and the other libraries used for the next tasks.
# Code cell 19 import datetime import csv import pandas as pd # NumPy is a library that adds support for large, multi-dimensional arrays and matrices # along with high-level mathematical functions to operate on these arrays import numpy as np
Step 2: Load the csv file into a DataFrame object using pandas.
A pandas DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table. The pandas library function read_csv automatically converts a csv file into a DataFrame object.
Read the read_csv documentation in http://pandas.pydata.org/pandas-docs/version/0.14.1/generated/pandas.read_csv.html. This function contains a lot of parameters. The only non-optional one is the filepath, i.e. the location of the csv file. All the other parameters are optional.
In this step, you will import and view the content of the csv file, rpi_data_long.csv. This csv file is located in the same directory as this Jupyter notebook.
a) Assign the file rpi_data_long.csv to the variable data_file.
# Code cell 20 data_file = './Data/rpi_data_long.csv'
b) Use the Linux command head to view the first 10 lines of the csv file.
# Code cell 21 !head -n 5 ./Data/rpi_data_long.csv
c) Use the names parameter of the read_csv function to specify the name of the DataFrame columns.
# Code cell 22 column_names = [ 'Type A', 'Measure A', 'Units A', 'Type B', 'Measure B', 'Units B', 'Type C', 'Measure C', 'Units C', 'Datetime']
d) Use the read_csv function to read from data_file and assign column_names as the column names in the dataframe.
# Code cell 23 with open(data_file, 'r') as f: df_redundant = pd.read_csv(f, names = column_names)
e) The command head() displays the first few rows of the dataframe.
# Code cell 24 # You can specify the number of rows you want to print to screen: # you do so passing the number as an argument to the function # (e.g., head(10)) df_redundant.head()
What is the code to read the first 20 lines of the csv file?
Step 3: Create a concise representation.
In this step, you will create a more compact representation using a copy of the data frame df_redundant.
a) Copy df_redundant into another dataframe called df_compact using copy().
# Code cell 25 df_compact = df_redundant.copy()
b) Rename the columns relative to the measures as shown:
Measure A -> Ping (ms) Measure B -> Download (Mbit/s) Measure C -> Upload (Mbit/s)
# Code cell 26 df_compact.rename(columns={'Measure A':'Ping (ms)', 'Measure B': 'Download (Mbit/s)', 'Measure C': 'Upload (Mbit/s)'}, inplace=True) df_compact.head(3)
c) Because the Types and Units columns are no longer necessary, these columns can be dropped.
# Code cell 27 df_compact.drop(['Type A', 'Type B', 'Type C', 'Units A', 'Units B', 'Units C'], axis=1, inplace=True) df_compact.head()
In the table above, the Datetime field is a string. Pandas and Python offer a number of operations to work with date and time that can be very helpful.
In the next step, the string in the Datetime column will be separated into two new columns.
Step 4: Separate data into two columns.
In this step, you will use Pandas to generate the columns Date and Time from the column Datetime and then drop the Datetime column.
The lambda function is used create two anonymous functions that extract only the date and the time from a datetime object, respectively. Then, use the pandas function apply to apply this function to an entire column (in practice, apply implicitly defines a for loop and passes the rows one by one to our lambda function). Store the result of the apply functions in two new columns of the DataFrame.
a) Apply the lambda function to iterate through the data frame to split the date from the Datetime column.
# Code cell 28 df_compact['Date'] = df_compact['Datetime'].apply(lambda dt_str: pd.to_datetime(dt_str).date())
b) Repeat step a to split time from Datetime column.
# Code cell 29 # Please note, this requires an intermediate step, because of how NaT are treated by the time() function. # Reference: https://github.com/pandas-dev/pandas/issues/11453 temp = df_compact['Datetime'].apply(lambda dt_str: pd.to_datetime(dt_str)) df_compact['Time'] = temp.dt.time
c) All the information for the Datetime column has now been copied to the Date and Time columnns. The Datetime column serves no purpose. The Datetime column can be dropped from the data frame.
Enter the code to drop the Datetime column in the cell below.
# Code cell 30
Enter the code to print out the first 3 rows of the data frame to verify the changes.
# Code cell 31
d) Use the type function to print out the variable type of the values in the Date and Time columns.
# Code cell 32 print(df_compact['Date'][0], type(df_compact['Date'][0]) ) print(df_compact['Time'][0], type(df_compact['Time'][0]) )
Step 5: Save the new data frame.
Save the pandas dataframe df_compact as a csv file called rpi_data_compact:
# Code cell 33 df_compact.to_csv('./Data/rpi_data_compact.csv')