Welcome to the Southern Water Corp Python Case Study!

While working on the Statistics unit, you used Microsoft Excel's data analytics capabilities to analyze Southern Water Corp's Pump Data.

Now, Joanna Luez — Southern Water Corp's Lead Scientist — has requested that you convert your earlier analysis in Excel to Python Code. After all, with all the formulas in Excel, it can be tricky for others with less experience in Excel to follow.

Excel is an excellent tool for adhoc analysis, but Python is an invaluable tool thanks to its advanced data analysis capabilities that only take a few lines of code to complete.

Please note that this case study is composed of two parts — once you have completed part 1, which involves descriptive statistics, please submit your work and discuss it with your mentor before moving on to part 2.

Let's get started!


Part I: Descriptive Statistics

Step 1: Import Libraries

Import the libraries you'll need for your analysis. You will need the following libraries:

Matplotlib - This is Python's basic plotting library. You'll use the pyplot and dates function collections from matplotlib throughout this case study so we encourage you to important these two specific libraries with their own aliases. Also, include the line '%matplotlib inline' so that your graphs are easily included in your notebook.

Don't forget that to modify the matplotlib plot sizes so they're at a comfortable reading size you should use the following:

import matplotlib as mpl

mpl.rcParams['figure.figsize'] = (20,5)

Seaborn - This library will enable you to create aesthetically pleasing plots.

Pandas - This library will enable you to view and manipulate your data in a tabular format.

statsmodels.api - This library will enable you to create statistical models. You will need this library when performing regession analysis in Part 2 of this case study.

Place your code here


Step 2: Descriptive Statistics

The data you've received from Southern Water Corp has been split into two files. The first file, titled DF_Raw_Data contains all the 'raw' Pump Data you will need for your analysis. The second file, titled DF_Rolling_Stdev contains the Rolling Standard Deviation Data you will need for questions 10 Onwards.

We have deliberately set up the data in this manner so please ensure that when you need to perform the rolling standard deviation calculations, you use the DF_Rolling_Stdev.csv file.

i. Import each of the two data sources and store them into their individual dataframes. Suggested names: dataframe_raw & dataframe_stdev respectively. Don't forget to use the header argument to ensure your columns have meaningful names!

ii. Print descriptive statistics for each of the dataframes using .describe() and .info()

Load Data

Raw Data

Standard Deviation data


Step 3: Create a Boxplot

When you look at your dataframe, you should now be able to see the upper and lower quartiles for each row of data from the .info command you used previously.

You should now also have a rough sense of the number of entires in each dataset (~2,452). However, just as you learned when using Excel, creating a visualization of the data using Python is often more informative than viewing the table statistics. Next up — convert the tables you created into a boxplot by following these instructions:

i) Using the dataframe_raw, create a boxplot visualising this information.

ii) Using the dataframe_raw, create a lineplot visualising this information.

Hint: You might want to reference the following .plot function (https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.plot.html)

Please put your code here

We've included an example of what your Box Plot should look like once you've plotted this using the dataframe_raw dataset