Reading Time: 6 minutes

[ad_1]

Standard Deviation

Standard Deviation describes how far the data is deviated from the mean. In other words, it describes the spread of the dataset.

In this post, we will discuss:

  1. Why do you need standard deviation?
  2. How to derive the formula for standard deviation?
  3. What does standard deviation say about your dataset?
  4. What is Empirical Rule or Three Sigma Rule?
  5. How to use standard deviation to find outliers?
  6. Python code to remove outliers in Boston Housing dataset.

1. Why do you need standard deviation?

Is mean not enough to describe a dataset? Let us see. Take two example datasets as below:

Dataset A: {1, 2, 3, 4, 5}

Dataset B: {3, 3, 3, 3, 3}

Both datasets A and B have mean 3. But, see how different the data is. Just by knowing the mean value of a dataset, you cannot guess how the underlying data will look like.

So, you need a measure to capture the variation or deviation in the data. That is where you will need standard deviation. This will tell you how the data is spread out.

2. How to derive the formula for standard deviation?

Step -1: Finding Error

To capture the deviation, best thing to do is, calculate the distance from the expected value (mean). This can also be called as the error. Similarly, in machine learning, the difference between the expected value and the actual predicted value is called the error.

Error = value – mean

Step – 2: Finding squared error

If you calculate error this way, and accumulate distance from each data point to the mean by summing up, the positive errors and negative errors will affect each other and bear no value at all in the end.

I will show this for dataset A, whose mean is 3. I am calculating the sum of distance from each data point to the mean.

For dataset A,

Sum of errors =

= (1-3) + (2-3) + (3-3) + (4-3) + (5-3)

= (-2) + (-1) + (0) + (1) + (2)

= 0

We know that, for dataset A, the deviation from mean can’t be represented by 0. This makes no sense as I told. Since the sign of the numbers is troubling us, we get rid of the sign by squaring each number.

Step – 3: Calculate sum of squared errors

Adding all the squared errors,

Sum of squared errors = sum of all [(value – mean)2]

For dataset A,

Sum of squared errors =

= (1-3)2 + (2-3)2 + (3-3)2 + (4-3)2 + (5-3)2

= 4 + 1 + 0 + 1 + 4

= 10

Sum of squared errors is a commonly used term in machine learning. As we now know why squaring is needed, let’s move on.

Step – 4: Finding variance

To incorporate the total number of data points in this measure, we divide sum of squared errors by the total number of data points. And this is a special term called the Variance.

Variance = Sum of squared errors / Total number of data points.

For the dataset A,

Variance = Sum of squared errors / Total number of data points

= 10 / 5

= 2

Step – 5: Finding standard deviation

As we have previously used the squaring operation, it is time to revert it back by taking square root of the term we just derived (square root of variance). The square root of variance is a special term called as the Standard Deviation.

The formula for standard deviation is:

Standard Deviation = Square root of (Variance)

Or,

Standard deviation = Square root of (Sum of squared errors / Total number of data points)

Also written as:

standard deviation formula
Standard deviation formula

And that is how we arrive at the formula for standard deviation. And we can agree that the term we just derived, accurately describes the deviation of each data point from the mean.

For Dataset A,

Standard Deviation = Square root of (2)

= 1.414

3. What does standard deviation say about your dataset?

If all values in a dataset are equal (like Dataset B which is {3, 3, 3, 3, 3}), the standard deviation is 0. In other words,

  • If the standard deviation is small, the values lie close to the mean.
  • If the standard deviation is large, the values lie far away from the mean.

4. What is Empirical Rule or Three Sigma Rule?

Empirical rule goes by other names such as three sigma rule or 68 – 95 – 99.7 rule. Empirical rule describes how the data lies within a normal distribution. A dataset having normal distribution will have a bell-shaped curve with the mean at its center.

Empirical rule states that, in a normal distribution,

  1. 68 percent of data lies within one standard deviation from the mean.
  2. 95 percent of data lies within two standard deviations from the mean.
  3. 99.7 percent of data lies within three standard deviations from the mean.
Data lying within 1 standard deviation
Three Sigma Rule: Data lying within 1 standard deviation
Data lying within 2 standard deviations |
Three Sigma Rule: Data lying within 2 standard deviations
Data lying within 3 standard deviations
Three Sigma Rule: Data lying within 3 standard deviations

5. How to use standard deviation to find outliers?

Standard deviation can be used to find outliers if the data follows Normal distribution (Gaussian distribution).

As discussed in Empirical rule section, we know that the majority of data (99.7%) lies within three standard deviations from the mean. The remaining 0.3 percent of data points lie far away from the mean. These can be considered as outliers because they are located at the extremities from the mean.

So, the data lying less than -3*sigma from the mean, and greater than 3*sigma from the mean can be removed from the dataset.

Please note that this method will be accurate only if the dataset follows normal distribution.

In reality we cannot expect every dataset to follow a normal distribution. There are other methods such as IQR to remove outliers for a non-Gaussian distribution.

6. Python code to remove outliers – Boston Housing dataset

[A copy of the below code and dataset is also available in my GitHub repository.]

Boston housing price dataset has 489 rows and 4 columns (RM, LSTAT, PTRATIO, MEDV). The column MEDV is the house price column.

We will focus only on MEDV column here. The values in this column fall under Normal distribution, also known as Gaussian distribution. So, we will apply three sigma rule and remove outliers from Boston Housing dataset.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("housing.csv")
print("The dataset has {} rows and {} columns".format(data.shape[0], data.shape[1]))
Output: The dataset has 489 rows and 4 columns
data.head()
Boston Housing Data - Sample rows
Boston Housing Data – Sample rows
plt.hist(data["MEDV"])
plt.xlabel("MEDV")
plt.ylabel("Number of occurences")
plt.title("Boston Housing dataset - Histogram showing normal distribution")
plt.show()
Boston Housing dataset - Histogram showing normal distribution
Boston Housing dataset – Histogram showing normal distribution
mean = np.mean(data["MEDV"])
sigma = np.std(data["MEDV"])
print("Mean: {:.2f}".format(mean))
print("Standard Deviation: {:.2f}".format(sigma))
Output:
Mean: 447652.17
Standard Deviation: 154798.23
lower_range = mean-(3*sigma)
upper_range = mean+(3*sigma)
print("Good data should lie between {:.2f} and {:.2f}".format(lower_range, upper_range))
Output: Good data should lie between -41170.45 and 949856.34
outliers = [i for i in data["MEDV"] if i<lower_range or i>upper_range]
print("Number of outliers:",len(outliers))
print("Outliers:", outliers)
Output:
Number of outliers: 6
Outliers: [1018500.0, 980700.0, 1014300.0, 1024800.0, 953400.0, 966000.0]
data.drop(data[(data["MEDV"]<lower_range) | (data["MEDV"]>upper_range)].index, inplace=True)
data.shape
Output: (483, 4)

Note that there were 489 rows in the beginning. Now we got 483 rows after removing the 6 outlier records.

Similar Topics that might interest you

Here is some Statistics basics for machine learning.

Please leave me a “Like” below if you find this article useful. Thanks for reading. 🙂

(Featured Image: Image by Jill Wellington from Pixabay)

[ad_2]

Source link

Spread the Word!