While working on the Statistics unit, you used Microsoft Excel's data analytics capabilities to analyze Southern Water Corp's Pump Data.
Now, Joanna Luez — Southern Water Corp's Lead Scientist — has requested that you convert your earlier analysis in Excel to Python Code. After all, with all the formulas in Excel, it can be tricky for others with less experience in Excel to follow.
Excel is an excellent tool for adhoc analysis, but Python is an invaluable tool thanks to its advanced data analysis capabilities that only take a few lines of code to complete.
Please note that this case study is composed of two parts — once you have completed part 1, which involves descriptive statistics, please submit your work and discuss it with your mentor before moving on to part 2.
Import the libraries you'll need for your analysis. You will need the following libraries:
Matplotlib - This is Python's basic plotting library. You'll use the pyplot and dates function collections from matplotlib throughout this case study so we encourage you to important these two specific libraries with their own aliases. Also, include the line '%matplotlib inline' so that your graphs are easily included in your notebook.
Don't forget that to modify the matplotlib plot sizes so they're at a comfortable reading size you should use the following:
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (20,5)
Seaborn - This library will enable you to create aesthetically pleasing plots.
Pandas - This library will enable you to view and manipulate your data in a tabular format.
statsmodels.api - This library will enable you to create statistical models. You will need this library when performing regession analysis in Part 2 of this case study.
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.dates as md
%matplotlib inline
mpl.rcParams['figure.figsize'] = (20,10)
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
The data you've received from Southern Water Corp has been split into two files. The first file, titled DF_Raw_Data contains all the 'raw' Pump Data you will need for your analysis. The second file, titled DF_Rolling_Stdev contains the Rolling Standard Deviation Data you will need for questions 10 Onwards.
We have deliberately set up the data in this manner so please ensure that when you need to perform the rolling standard deviation calculations, you use the DF_Rolling_Stdev.csv file.
i. Import each of the two data sources and store them into their individual dataframes. Suggested names: dataframe_raw & dataframe_stdev respectively. Don't forget to use the header argument to ensure your columns have meaningful names!
ii. Print descriptive statistics for each of the dataframes using .describe() and .info()
dataframe_raw = pd.read_csv('DF_Raw_Data.csv')
dataframe_stdev = pd.read_csv('DF_Rolling_Stdev.csv')
dataframe_raw.head()
Data Source | TIMEFRAME (DD/MM/YYYY) | Volumetric Flow Meter 1 | Volumetric Flow Meter 2 | Pump Speed (RPM) | Pump Torque | Ambient Temperature | Horse Power | Pump Efficiency | PUMP FAILURE (1 or 0) | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Raw | 9/12/2014 0:00 | 41.30 | 41.16 | 98 | 207 | 54 | 3.86 | 74.84 | 0 |
1 | Raw | 9/12/2014 0:01 | 42.40 | 41.39 | 92 | 212 | 46 | 3.71 | 75.25 | 0 |
2 | Raw | 9/12/2014 0:02 | 41.43 | 41.15 | 80 | 207 | 55 | 3.15 | 74.82 | 0 |
3 | Raw | 9/12/2014 0:03 | 42.21 | 40.93 | 83 | 190 | 49 | 3.00 | 74.42 | 0 |
4 | Raw | 9/12/2014 0:04 | 40.51 | 43.32 | 90 | 195 | 50 | 3.34 | 78.76 | 0 |
dataframe_raw.describe()
Volumetric Flow Meter 1 | Volumetric Flow Meter 2 | Pump Speed (RPM) | Pump Torque | Ambient Temperature | Horse Power | Pump Efficiency | PUMP FAILURE (1 or 0) | |
---|---|---|---|---|---|---|---|---|
count | 2453.000000 | 2453.000000 | 2453.000000 | 2453.000000 | 2453.000000 | 2453.000000 | 2453.000000 | 2453.000000 |
mean | 41.802629 | 41.796702 | 90.796576 | 202.851610 | 50.226661 | 3.540897 | 76.015149 | 0.021199 |
std | 3.656576 | 3.654873 | 10.217885 | 22.683977 | 5.298203 | 0.579055 | 6.651633 | 0.144075 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 41.050000 | 41.000000 | 85.000000 | 191.000000 | 48.000000 | 3.220000 | 74.560000 | 0.000000 |
50% | 42.100000 | 42.140000 | 91.000000 | 203.000000 | 51.000000 | 3.480000 | 76.620000 | 0.000000 |
75% | 43.140000 | 43.130000 | 97.000000 | 215.000000 | 53.000000 | 3.780000 | 78.470000 | 0.000000 |
max | 45.900000 | 45.840000 | 124.000000 | 264.000000 | 65.000000 | 7.560000 | 83.450000 | 1.000000 |
dataframe_raw.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2453 entries, 0 to 2452 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Data Source 2453 non-null object 1 TIMEFRAME (DD/MM/YYYY) 2453 non-null object 2 Volumetric Flow Meter 1 2453 non-null float64 3 Volumetric Flow Meter 2 2453 non-null float64 4 Pump Speed (RPM) 2453 non-null int64 5 Pump Torque 2453 non-null int64 6 Ambient Temperature 2453 non-null int64 7 Horse Power 2453 non-null float64 8 Pump Efficiency 2453 non-null float64 9 PUMP FAILURE (1 or 0) 2453 non-null int64 dtypes: float64(4), int64(4), object(2) memory usage: 191.8+ KB
dataframe_stdev.head()
Data Source | TIMEFRAME (DD/MM/YYYY) | Volumetric Flow Meter 1 | Volumetric Flow Meter 2 | Pump Speed (RPM) | Pump Torque | Ambient Temperature | Horse Power | Pump Efficiency | PUMP FAILURE (1 or 0) | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Rolling Stdev (30 Minute) | 9/12/2014 0:00 | 1.04 | 0.96 | 5.54 | 11.70 | 3.40 | 0.32 | 1.74 | 0 |
1 | Rolling Stdev (30 Minute) | 9/12/2014 0:01 | 1.06 | 1.01 | 5.49 | 11.73 | 3.36 | 0.31 | 1.83 | 0 |
2 | Rolling Stdev (30 Minute) | 9/12/2014 0:02 | 1.06 | 1.03 | 5.62 | 11.94 | 3.40 | 0.31 | 1.87 | 0 |
3 | Rolling Stdev (30 Minute) | 9/12/2014 0:03 | 1.06 | 1.05 | 5.61 | 12.10 | 3.30 | 0.31 | 1.90 | 0 |
4 | Rolling Stdev (30 Minute) | 9/12/2014 0:04 | 1.07 | 1.03 | 5.61 | 12.31 | 3.36 | 0.30 | 1.88 | 0 |
dataframe_stdev.describe()
Volumetric Flow Meter 1 | Volumetric Flow Meter 2 | Pump Speed (RPM) | Pump Torque | Ambient Temperature | Horse Power | Pump Efficiency | PUMP FAILURE (1 or 0) | |
---|---|---|---|---|---|---|---|---|
count | 2452.000000 | 2452.000000 | 2452.000000 | 2452.000000 | 2452.000000 | 2452.00000 | 2452.000000 | 2452.000000 |
mean | 1.485126 | 1.497361 | 6.648308 | 13.945338 | 3.436370 | 0.37060 | 2.725232 | 0.021207 |
std | 2.294950 | 2.282053 | 5.722897 | 12.394302 | 3.043042 | 0.29979 | 4.186723 | 0.144104 |
min | 0.380000 | 0.640000 | 0.580000 | 5.000000 | 0.900000 | 0.11000 | 1.170000 | 0.000000 |
25% | 1.070000 | 1.080000 | 5.520000 | 11.210000 | 2.920000 | 0.28000 | 1.960000 | 0.000000 |
50% | 1.160000 | 1.170000 | 5.990000 | 12.180000 | 3.160000 | 0.32000 | 2.120000 | 0.000000 |
75% | 1.230000 | 1.260000 | 6.460000 | 13.110000 | 3.370000 | 0.36000 | 2.270000 | 0.000000 |
max | 21.390000 | 21.530000 | 59.310000 | 124.710000 | 30.650000 | 3.32000 | 39.150000 | 1.000000 |
dataframe_stdev.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2452 entries, 0 to 2451 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Data Source 2452 non-null object 1 TIMEFRAME (DD/MM/YYYY) 2452 non-null object 2 Volumetric Flow Meter 1 2452 non-null float64 3 Volumetric Flow Meter 2 2452 non-null float64 4 Pump Speed (RPM) 2452 non-null float64 5 Pump Torque 2452 non-null float64 6 Ambient Temperature 2452 non-null float64 7 Horse Power 2452 non-null float64 8 Pump Efficiency 2452 non-null float64 9 PUMP FAILURE (1 or 0) 2452 non-null int64 dtypes: float64(7), int64(1), object(2) memory usage: 191.7+ KB
When you look at your dataframe, you should now be able to see the upper and lower quartiles for each row of data from the .info command you used previously.
You should now also have a rough sense of the number of entires in each dataset (~2,452). However, just as you learned when using Excel, creating a visualization of the data using Python is often more informative than viewing the table statistics. Next up — convert the tables you created into a boxplot by following these instructions:
i) Using the dataframe_raw, create a boxplot visualising this information.
ii) Using the dataframe_raw, create a lineplot visualising this information.
Hint: You might want to reference the following .plot function (https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.plot.html)
dataframe_raw.plot.box(title='Boxplot - Raw Data')
dataframe_stdev.plot.box(title='Boxplot - STDEV Data')
<AxesSubplot:title={'center':'Boxplot - STDEV Data'}>
dataframe_raw.plot.line(title='Lineplot - Raw Data')
dataframe_stdev.plot.line(title='Lineplot - STDEV Data')
<AxesSubplot:title={'center':'Lineplot - STDEV Data'}>
In Box plot, we can see because of some outliers it shifts the IQR in **Pump Torque & Pump speed** a little in Ambient temperature. In Line plot, we see tall peaks before pump failure in **Pump Torque, Pump speed and Ambient temperature**.
i) Using the dataframe_raw dataset, create two boxplots specifically for when the Pump has failed (i.e. Pump Failure = 1) and 0 (Pump is in normal operations).
As part of best practice, don't forget to clearly title your box plots so we can identify which plot is for the failure and which plot is for the normal operations.
To do this, you'll have to recall how to apply boolean filtering to a dataframe.
dataframe_raw[dataframe_raw['PUMP FAILURE (1 or 0)'] == 0].plot.box(title='Boxplot - PUMP FAILURE = 0')
dataframe_raw[dataframe_raw['PUMP FAILURE (1 or 0)'] == 1].plot.box(title='Boxplot - PUMP FAILURE = 1')
<AxesSubplot:title={'center':'Boxplot - PUMP FAILURE = 1'}>
Open-ended Question:
When we separate data by failure, outliers removed from plots, **Pump Torque** compressed and moved higher,**Pump Speed** also moved higher.
When you did this work in Excel, you used the interquartile ranges to remove the outliers from each column. Happily, Python allows you to do this same process more quickly and efficiently, as you'll see when working on Step 5.
i) Create two new variables called Q1 and Q3.
Q1 should contain the 25th percentile for all columns in the dataframe while Q3 should contain the 75th percentile for all the columns in the dataframe.
ii) Calculate the interquartile range (IQR = Q3 - Q1) for all columns in the dataframe and print it to the screen.
Q1 = dataframe_raw.quantile(0.25)
Q3 = dataframe_raw.quantile(0.75)
IQR = Q3 - Q1
IQR
Volumetric Flow Meter 1 2.09 Volumetric Flow Meter 2 2.13 Pump Speed (RPM) 12.00 Pump Torque 24.00 Ambient Temperature 5.00 Horse Power 0.56 Pump Efficiency 3.91 PUMP FAILURE (1 or 0) 0.00 dtype: float64
How many outliers do you have? What will happen to your dataset if you remove them all? Let's find out!
i) Calculate how many entries you currently have in the original dataframe.
ii) Using the quartiles and IQR previously calculated, identify the number of entries you'd have if you were to remove the outliers.
ii) Find the proportion of outliers that exist in the dataset.
dataframe_raw.count()
Data Source 2453 TIMEFRAME (DD/MM/YYYY) 2453 Volumetric Flow Meter 1 2453 Volumetric Flow Meter 2 2453 Pump Speed (RPM) 2453 Pump Torque 2453 Ambient Temperature 2453 Horse Power 2453 Pump Efficiency 2453 PUMP FAILURE (1 or 0) 2453 dtype: int64
outliers = (dataframe_raw < (Q1 - 1.5 * IQR)) |(dataframe_raw > (Q3 + 1.5 * IQR))
dataframe_raw[outliers].count()
Data Source 0 TIMEFRAME (DD/MM/YYYY) 0 Volumetric Flow Meter 1 44 Volumetric Flow Meter 2 45 Pump Speed (RPM) 25 Pump Torque 21 Ambient Temperature 24 Horse Power 83 Pump Efficiency 43 PUMP FAILURE (1 or 0) 52 dtype: int64
print(dataframe_raw[outliers].count() / dataframe_raw.count() * 100)
Data Source 0.000000 TIMEFRAME (DD/MM/YYYY) 0.000000 Volumetric Flow Meter 1 1.793722 Volumetric Flow Meter 2 1.834488 Pump Speed (RPM) 1.019160 Pump Torque 0.856095 Ambient Temperature 0.978394 Horse Power 3.383612 Pump Efficiency 1.752956 PUMP FAILURE (1 or 0) 2.119853 dtype: float64
df_new = dataframe_raw[outliers]
df_new
Data Source | TIMEFRAME (DD/MM/YYYY) | Volumetric Flow Meter 1 | Volumetric Flow Meter 2 | Pump Speed (RPM) | Pump Torque | Ambient Temperature | Horse Power | Pump Efficiency | PUMP FAILURE (1 or 0) | |
---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2448 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2449 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2450 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2451 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2452 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2453 rows × 10 columns
Sometime it can help, but in this situation I don’t think, as we see when we separate data by pump failure outliers removed from data.
With the dataset now stripped of outliers, create the following boxplots:
i) A boxplot when PUMP FAILURE is 1 (Check what the length of the dataframe is before you try and plot this. You may be surprised!)
ii) A boxplot when PUMP FAILURE is 0
df_new[df_new['PUMP FAILURE (1 or 0)'] == 0].plot.box(title='Boxplot - PUMP FAILURE = 0')
df_new[df_new['PUMP FAILURE (1 or 0)'] == 1].plot.box(title='Boxplot - PUMP FAILURE = 1')
<AxesSubplot:title={'center':'Boxplot - PUMP FAILURE = 1'}>
As you might recall from the earlier plot you had made with the line plot; it was hard to see which variables were the most significant with respect to pump failure when all the variables are plotted together. This is why we are going to ITERATE through the dataframe and plot each individual variable out and compare this with the Pump Failure.
This will require you to make use of the following syntax:
#Loop through each variable in the dataframe (i.e. dataframe[___].plot
#Specify the dual-axis (i.e. ax.twinx())
#Plot the Pump Failure (1 or 0) on the secondary axes
#Include Plot Titles for each plot (i.e. print ("This is for the attribute " + i))
Note: For each plot, ensure that you have a dual axis set up so you can see the Pump Behaviour (0 or 1) on the second Y-axis, and the attribute on the first Y-Axis. It might be helpful to give the failureState it's own color and add a legend to the axis to make it easier to view.
Check out this link to learn how to do this: https://matplotlib.org/gallery/api/two_scales.html
list=[2,3,4,5,6,7,8]
for i in list:
fig, ax1 = plt.subplots()
ax1.plot(dataframe_raw.iloc[:,i])
ax1.tick_params(axis='y')
ax2 = ax1.twinx()
ax2.plot(dataframe_raw.iloc[:,-1], color = 'red')
fig.suptitle('This is for the attribute ' + dataframe_raw.columns[i] , fontweight ="bold", fontsize= 24)
fig.tight_layout()
plt.show()
During Pump Failure, Pump speed, Pump Torque, Ambient temperature and Horse Power going up.
Of course, given that all the attributes have varying units, you might need more than one plot to make sense of all this data. For this next step, let's view the information by comparing the ROLILNG DEVIATIONS over a 30-point period.
This is where we will switch to using the dataframe_stdev that you had previously defined in Q1.
As the deviations will likely be a lot lower, the scale should be much simpler to view on one plot. Make sure that you include the 'PUMP FAILURE 1 or 0' attribute on the secondary Y-axis.
i) Set the index of the dataframe to the TIMEFRAME (DD/MM/YYYY) attribute
ii) Exactly as you did in Q8, Re-plot all variables, now transformed via a rolling standard deviation in the dataframe_stdev for the time period 10/12/2014 13:30 to 10/12/2014 14:30 against Pump Failure.
Note: To effectively filter on the time period you will need to make use of the below syntax
# dataframe_time_filtered = dataframe[(dataframe.index >= "_____") & (dataframe.index <= "_____")
df_RTP = dataframe_stdev.set_index('TIMEFRAME (DD/MM/YYYY)')
df_rtp_filtered = df_RTP[(df_RTP.index >= '10/12/2014 13:30') & (df_RTP.index <= '10/12/2014 14:30')]
list=[1,2,3,4,5,6,7]
for i in list:
fig, ax1 = plt.subplots()
ax1.plot(df_rtp_filtered.iloc[:,i])
ax1.tick_params(axis='y')
ax2 = ax1.twinx()
ax2.plot(df_rtp_filtered.iloc[:,-1], color = 'red')
fig.suptitle('This is for the attribute ' + df_rtp_filtered.columns[i] , fontweight ="bold", fontsize= 24)
fig.tight_layout()
plt.show()
When you performed inferential statistics for Southern Water Corp using Excel, you made use of the data analysis package to create a heatmap using the correlation function. The heatmap showed the attributes that strongly correlated to Pump Failure.
Now, you'll create a heatmap using Seaborn's heatmap function — another testament to the fact that having Matplotlib and Seaborn in your toolbox will allow you to quickly create beautiful graphics that provide key insights.
i) Using Seaborn's heatmap function, create a heatmap that clearly shows the correlations (including R Squared) for all variables using the dataframe_raw dataset.
Link: (https://seaborn.pydata.org/generated/seaborn.heatmap.html)
sns.heatmap(dataframe_raw.corr(),annot=True)
<AxesSubplot:>
Open-ended Question:
Which variables seem to correlate with Pump Failure?
Horse Power, Pump Speed, Pump Torque & Ambient Temperature
Create a barplot that shows the correlated features against PUMP FAILURE (1 or 0), in descending order.
You can do this with the matplotlib library when you specify matplotlib.pyplot(kind='bar')
dataframe_raw.corr().sort_values("PUMP FAILURE (1 or 0)", ascending=False)['PUMP FAILURE (1 or 0)'].plot(kind='bar')
<AxesSubplot:>
Previously, you created a correlation matrix using 'raw' variables. We saw some correlations with the raw data but they weren't necessarily as strong as we would have liked. This time, we'll recreate a Heatmap using the rolling standard deviation dataframe you had imported in Q1.
ii) Using Seaborn's heatmap function, create a heatmap that clearly shows the correlations (including R Squared) for all variables using the dataframe_stdev dataset.
Do any variables stand out? If yes, list these out below your heatmap.
sns.heatmap(dataframe_stdev.corr(),annot=True)
<AxesSubplot:>
When you worked on this case study in Excel, you went through the tricky process of using the rolling standard deviation variables to generate a regression equation. Happily, this process is much simpler in Python.
For this step, you'll be using the statsmodel.api library you imported earlier and calling the Ordinary Least Squares Regression to create a multivariate regression model (which is a linear regression model with more than one independent variable).
i) Using the OLS Regression Model in the statsmodel.api library, create a regression equation that models the Pump Failure (Y-Variable) against all your independent variables, which include every other variable that is not PUMP FAILURE (1 or 0).
In order to fit a linear regression model with statsmodels.api there are a few steps that need to be taken. We have demonstrated this below:
## Add a constant to follow the equation form: Ab + x (X = sm.add_constant(X))
## Instantiate the Ordinary Least Squares Model with: model = sm.OLS(Y,X) where Y is the dependent variable and X is the independent variable (Make sure you don't include the PUMP FAILURE (1 or 0) in your list of independent variables as this is what you are trying to predict)
## Fit the Model (OLSmodelResult = OLSmodel.fit())
## Print the OLSModel Summary
Link: https://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.OLS.html
ii) Repeat i) but this time use the dataframe_stdev you imported previously. What is the R Squared for the model and what does this signify?
YVar = dataframe_raw[['PUMP FAILURE (1 or 0)']]
XVar = dataframe_raw[['Volumetric Flow Meter 1',
'Volumetric Flow Meter 2', 'Pump Speed (RPM)', 'Pump Torque ',
'Ambient Temperature', 'Horse Power', 'Pump Efficiency']]
model = sm.OLS(YVar, XVar).fit()
model.summary()
Dep. Variable: | PUMP FAILURE (1 or 0) | R-squared (uncentered): | 0.371 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.369 |
Method: | Least Squares | F-statistic: | 206.1 |
Date: | Thu, 04 Feb 2021 | Prob (F-statistic): | 7.29e-241 |
Time: | 13:46:09 | Log-Likelihood: | 1814.7 |
No. Observations: | 2453 | AIC: | -3615. |
Df Residuals: | 2446 | BIC: | -3575. |
Df Model: | 7 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Volumetric Flow Meter 1 | -0.0139 | 0.002 | -7.666 | 0.000 | -0.017 | -0.010 |
Volumetric Flow Meter 2 | -0.0178 | 0.003 | -5.302 | 0.000 | -0.024 | -0.011 |
Pump Speed (RPM) | 0.0027 | 0.001 | 3.305 | 0.001 | 0.001 | 0.004 |
Pump Torque | 0.0009 | 0.000 | 2.444 | 0.015 | 0.000 | 0.002 |
Ambient Temperature | 0.0102 | 0.001 | 14.464 | 0.000 | 0.009 | 0.012 |
Horse Power | 0.0701 | 0.019 | 3.746 | 0.000 | 0.033 | 0.107 |
Pump Efficiency | 0.0020 | 0.002 | 1.010 | 0.313 | -0.002 | 0.006 |
Omnibus: | 2053.340 | Durbin-Watson: | 0.457 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 56905.683 |
Skew: | 3.886 | Prob(JB): | 0.00 |
Kurtosis: | 25.279 | Cond. No. | 2.00e+03 |
Open-ended Question:
Great job creating those regressive equations! You've reached the final step of this case study!
i) Use the regression equation you created in the previous step and apply the .predict() function to the dataframe to see whether or not your model 'picks' up the Pump Failure Event.
ii) Plot the rolling linear regression equation against the attribute 'PUMP FAILURE (1 or 0)'
Note: Please ensure all axes are clearly labelled and ensure that you use Dual Axes to plot this.
ax = model.predict(XVar).plot()
ax2 = ax.twinx()
ax2.plot(YVar, 'red')
plt.tight_layout()
plt.show()
You've made it to the end of this challenging case study — well done! You've now converted all of the analysis you did for Southern Water Corp using Excel into Python. You created visualizations using Seaborn, manipulated datasets with pandas, and so much more! This case study was designed to give you practice using Python to analyze datasets both large and small — you can now apply these skills to work you do throughout your career as a data analyst.