While working on the Statistics unit, you used Microsoft Excel's data analytics capabilities to analyze Southern Water Corp's Pump Data.
Now, Joanna Luez — Southern Water Corp's Lead Scientist — has requested that you convert your earlier analysis in Excel to Python Code. After all, with all the formulas in Excel, it can be tricky for others with less experience in Excel to follow.
Excel is an excellent tool for adhoc analysis, but Python is an invaluable tool thanks to its advanced data analysis capabilities that only take a few lines of code to complete.
Please note that this case study is composed of two parts — once you have completed part 1, which involves descriptive statistics, please submit your work and discuss it with your mentor before moving on to part 2.
Import the libraries you'll need for your analysis. You will need the following libraries:
Matplotlib - This is Python's basic plotting library. You'll use the pyplot and dates function collections from matplotlib throughout this case study so we encourage you to important these two specific libraries with their own aliases. Also, include the line '%matplotlib inline' so that your graphs are easily included in your notebook.
Don't forget that to modify the matplotlib plot sizes so they're at a comfortable reading size you should use the following:
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (20,5)
Seaborn - This library will enable you to create aesthetically pleasing plots.
Pandas - This library will enable you to view and manipulate your data in a tabular format.
statsmodels.api - This library will enable you to create statistical models. You will need this library when performing regession analysis in Part 2 of this case study.
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.dates as md
%matplotlib inline
mpl.rcParams['figure.figsize'] = (20,10)
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
The data you've received from Southern Water Corp has been split into two files. The first file, titled DF_Raw_Data contains all the 'raw' Pump Data you will need for your analysis. The second file, titled DF_Rolling_Stdev contains the Rolling Standard Deviation Data you will need for questions 10 Onwards.
We have deliberately set up the data in this manner so please ensure that when you need to perform the rolling standard deviation calculations, you use the DF_Rolling_Stdev.csv file.
i. Import each of the two data sources and store them into their individual dataframes. Suggested names: dataframe_raw & dataframe_stdev respectively. Don't forget to use the header argument to ensure your columns have meaningful names!
ii. Print descriptive statistics for each of the dataframes using .describe() and .info()
dataframe_raw = pd.read_csv('DF_Raw_Data.csv')
dataframe_stdev = pd.read_csv('DF_Rolling_Stdev.csv')
dataframe_raw.head()
Data Source | TIMEFRAME (DD/MM/YYYY) | Volumetric Flow Meter 1 | Volumetric Flow Meter 2 | Pump Speed (RPM) | Pump Torque | Ambient Temperature | Horse Power | Pump Efficiency | PUMP FAILURE (1 or 0) | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Raw | 9/12/2014 0:00 | 41.30 | 41.16 | 98 | 207 | 54 | 3.86 | 74.84 | 0 |
1 | Raw | 9/12/2014 0:01 | 42.40 | 41.39 | 92 | 212 | 46 | 3.71 | 75.25 | 0 |
2 | Raw | 9/12/2014 0:02 | 41.43 | 41.15 | 80 | 207 | 55 | 3.15 | 74.82 | 0 |
3 | Raw | 9/12/2014 0:03 | 42.21 | 40.93 | 83 | 190 | 49 | 3.00 | 74.42 | 0 |
4 | Raw | 9/12/2014 0:04 | 40.51 | 43.32 | 90 | 195 | 50 | 3.34 | 78.76 | 0 |
dataframe_raw.describe()
Volumetric Flow Meter 1 | Volumetric Flow Meter 2 | Pump Speed (RPM) | Pump Torque | Ambient Temperature | Horse Power | Pump Efficiency | PUMP FAILURE (1 or 0) | |
---|---|---|---|---|---|---|---|---|
count | 2453.000000 | 2453.000000 | 2453.000000 | 2453.000000 | 2453.000000 | 2453.000000 | 2453.000000 | 2453.000000 |
mean | 41.802629 | 41.796702 | 90.796576 | 202.851610 | 50.226661 | 3.540897 | 76.015149 | 0.021199 |
std | 3.656576 | 3.654873 | 10.217885 | 22.683977 | 5.298203 | 0.579055 | 6.651633 | 0.144075 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 41.050000 | 41.000000 | 85.000000 | 191.000000 | 48.000000 | 3.220000 | 74.560000 | 0.000000 |
50% | 42.100000 | 42.140000 | 91.000000 | 203.000000 | 51.000000 | 3.480000 | 76.620000 | 0.000000 |
75% | 43.140000 | 43.130000 | 97.000000 | 215.000000 | 53.000000 | 3.780000 | 78.470000 | 0.000000 |
max | 45.900000 | 45.840000 | 124.000000 | 264.000000 | 65.000000 | 7.560000 | 83.450000 | 1.000000 |
dataframe_raw.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2453 entries, 0 to 2452 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Data Source 2453 non-null object 1 TIMEFRAME (DD/MM/YYYY) 2453 non-null object 2 Volumetric Flow Meter 1 2453 non-null float64 3 Volumetric Flow Meter 2 2453 non-null float64 4 Pump Speed (RPM) 2453 non-null int64 5 Pump Torque 2453 non-null int64 6 Ambient Temperature 2453 non-null int64 7 Horse Power 2453 non-null float64 8 Pump Efficiency 2453 non-null float64 9 PUMP FAILURE (1 or 0) 2453 non-null int64 dtypes: float64(4), int64(4), object(2) memory usage: 191.8+ KB
dataframe_stdev.head()
Data Source | TIMEFRAME (DD/MM/YYYY) | Volumetric Flow Meter 1 | Volumetric Flow Meter 2 | Pump Speed (RPM) | Pump Torque | Ambient Temperature | Horse Power | Pump Efficiency | PUMP FAILURE (1 or 0) | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Rolling Stdev (30 Minute) | 9/12/2014 0:00 | 1.04 | 0.96 | 5.54 | 11.70 | 3.40 | 0.32 | 1.74 | 0 |
1 | Rolling Stdev (30 Minute) | 9/12/2014 0:01 | 1.06 | 1.01 | 5.49 | 11.73 | 3.36 | 0.31 | 1.83 | 0 |
2 | Rolling Stdev (30 Minute) | 9/12/2014 0:02 | 1.06 | 1.03 | 5.62 | 11.94 | 3.40 | 0.31 | 1.87 | 0 |
3 | Rolling Stdev (30 Minute) | 9/12/2014 0:03 | 1.06 | 1.05 | 5.61 | 12.10 | 3.30 | 0.31 | 1.90 | 0 |
4 | Rolling Stdev (30 Minute) | 9/12/2014 0:04 | 1.07 | 1.03 | 5.61 | 12.31 | 3.36 | 0.30 | 1.88 | 0 |
dataframe_stdev.describe()
Volumetric Flow Meter 1 | Volumetric Flow Meter 2 | Pump Speed (RPM) | Pump Torque | Ambient Temperature | Horse Power | Pump Efficiency | PUMP FAILURE (1 or 0) | |
---|---|---|---|---|---|---|---|---|
count | 2452.000000 | 2452.000000 | 2452.000000 | 2452.000000 | 2452.000000 | 2452.00000 | 2452.000000 | 2452.000000 |
mean | 1.485126 | 1.497361 | 6.648308 | 13.945338 | 3.436370 | 0.37060 | 2.725232 | 0.021207 |
std | 2.294950 | 2.282053 | 5.722897 | 12.394302 | 3.043042 | 0.29979 | 4.186723 | 0.144104 |
min | 0.380000 | 0.640000 | 0.580000 | 5.000000 | 0.900000 | 0.11000 | 1.170000 | 0.000000 |
25% | 1.070000 | 1.080000 | 5.520000 | 11.210000 | 2.920000 | 0.28000 | 1.960000 | 0.000000 |
50% | 1.160000 | 1.170000 | 5.990000 | 12.180000 | 3.160000 | 0.32000 | 2.120000 | 0.000000 |
75% | 1.230000 | 1.260000 | 6.460000 | 13.110000 | 3.370000 | 0.36000 | 2.270000 | 0.000000 |
max | 21.390000 | 21.530000 | 59.310000 | 124.710000 | 30.650000 | 3.32000 | 39.150000 | 1.000000 |
dataframe_stdev.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2452 entries, 0 to 2451 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Data Source 2452 non-null object 1 TIMEFRAME (DD/MM/YYYY) 2452 non-null object 2 Volumetric Flow Meter 1 2452 non-null float64 3 Volumetric Flow Meter 2 2452 non-null float64 4 Pump Speed (RPM) 2452 non-null float64 5 Pump Torque 2452 non-null float64 6 Ambient Temperature 2452 non-null float64 7 Horse Power 2452 non-null float64 8 Pump Efficiency 2452 non-null float64 9 PUMP FAILURE (1 or 0) 2452 non-null int64 dtypes: float64(7), int64(1), object(2) memory usage: 191.7+ KB
When you look at your dataframe, you should now be able to see the upper and lower quartiles for each row of data from the .info command you used previously.
You should now also have a rough sense of the number of entires in each dataset (~2,452). However, just as you learned when using Excel, creating a visualization of the data using Python is often more informative than viewing the table statistics. Next up — convert the tables you created into a boxplot by following these instructions:
i) Using the dataframe_raw, create a boxplot visualising this information.
ii) Using the dataframe_raw, create a lineplot visualising this information.
Hint: You might want to reference the following .plot function (https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.plot.html)
dataframe_raw.plot.box(title='Boxplot - Raw Data')
dataframe_stdev.plot.box(title='Boxplot - STDEV Data')
<AxesSubplot:title={'center':'Boxplot - STDEV Data'}>
dataframe_raw.plot.line(title='Lineplot - Raw Data')
dataframe_stdev.plot.line(title='Lineplot - STDEV Data')
<AxesSubplot:title={'center':'Lineplot - STDEV Data'}>
In Box plot, we can see because of some outliers it shifts the IQR in **Pump Torque & Pump speed** a little in Ambient temperature. In Line plot, we see tall peaks before pump failure in **Pump Torque, Pump speed and Ambient temperature**.
i) Using the dataframe_raw dataset, create two boxplots specifically for when the Pump has failed (i.e. Pump Failure = 1) and 0 (Pump is in normal operations).
As part of best practice, don't forget to clearly title your box plots so we can identify which plot is for the failure and which plot is for the normal operations.
To do this, you'll have to recall how to apply boolean filtering to a dataframe.
dataframe_raw[dataframe_raw['PUMP FAILURE (1 or 0)'] == 0].plot.box(title='Boxplot - PUMP FAILURE = 0')
dataframe_raw[dataframe_raw['PUMP FAILURE (1 or 0)'] == 1].plot.box(title='Boxplot - PUMP FAILURE = 1')
<AxesSubplot:title={'center':'Boxplot - PUMP FAILURE = 1'}>