Many statistical tests rely on the assumption that the
residuals of the model are normally distributed. One of the first steps in assessing normality is simply
by looking into the histogram of the variable in question. We would like to see
this bell curve.
The term bell curve is used to describe the
mathematical concept called normal distribution,
sometimes referred to as Gaussian
distribution. Bell curve refers to the bell shape that is created when a
line is plotted using the data points for an item that meets the criteria of
normal distribution. When the data is normal, the bell curve is symmetrical around
its center so the right side of the center is a mirror image of the left side. This
means that half of the data will fall to the left of the mean and half will
fall to the right.
However, whether we like it or not, oftentimes the shape
of the bell doesn’t appear to be symmetrical which means that the data are not normally distributed. A possible
way to fix non-normal data is to apply the transformation. Data transformation is
a method of changing the distribution by applying a mathematical function to
each data value.
Data transformation is usually applied so that the data appear to more
closely meet the assumptions of a statistical inference procedure that is to be applied or
to improve the interpretability or appearance of graphs. In the case of
parametric statistics, normality test is one of the assumptions to perform
diagnostics. It is important that the normality test is fulfilled prior to
undergoing different assumptions. Often, the non-normality of the residuals
can lead to heteroscedasticity which results in increased type-I error rates and
reduced power.
In the given sample data set below, this is the result of normality
test before the data transformation.
Normality test using the Kolmogorov-Smirnov and Shapiro-Wilk. Use the Kolmogorov-Smirnov for samples more than
50, while, Shapiro-Wilk for less than
50 samples. If the Sig. value is > .05, it means that the data are normally
distributed. If the Sig. value is < .05, it means that the data are not normally distributed. In the table below, the Sig. value is .000 which is < .05, which would mean that the data is not normal.
Follow the steps below to normalize the ranks using inverse distribution:
1.
Normalizing / Transforming
Ranks
1.1
Open the data set, then click
Transform.
1.2
Click Rank Cases
1.3
A dialog box will open with
the name Rank Cases. Drag the selected variable. On the right
side, click Rank Types. Click Fractional Rank and click Continue.
1.4
Click Ok.
2.
Inverse Distribution
2.1
Click Transform
2.2
Click Compute Variable
2.3
Type a new name for the new
variable (normalized values), for example per_mean_normal.
2.4
Click Inverse DF
2.5
Double Click ldf.Normal
2.6
IDF.NORMAL(?,?,?) will
appear below the Numeric Expression
2.7 Write the corresponding
values on the (?,?,?). On the first
(?), delete it and double click the Fractional
Rank which is found on the left side of the dialog box in the Type & Label section. On the second
(?), delete it and replace it with the mean value. On the third (?), replace with
the standard deviation.
3.
To obtain the Mean and
Standard Deviation
3.1
Click Analyze
3.2 Click Descriptive Statistics. It will open a dialog box with the name Descriptives.
3.3
Click Options. Click Mean and Standard Deviation, and then click Continue.
3.4
Click Ok.
Steps with Pictures:
1. Normalizing / Transforming Ranks
1.1
Open the data set, then click
Transform.
1.2 Click Rank Cases
1.3 A dialog box will open with the name Rank Cases. Drag the selected
variable. On the right side, click Rank
Types. Click Fractional Rank and
click Continue.
1.4 Click Ok.
1.5 Now, an additional 2 columns will appear in your data view. You will be using the RFR001 (Fractional Rank) in the inverse distribution for normalizing the values.
2.
Inverse Distribution
2.1 Click Transform
2.2 Click Compute
Variable
2.3 Type a new name for the new variable (normalized values),
for example per_mean_normal. Type the new name below the Target Variable.
2.4 Scroll down the Function Group and click Inverse DF.
2.5 Scroll down the Functions and Special Variables and Double Click ldf.Normal.
2.6 IDF.NORMAL(?,?,?) will appear below the Numeric Expression.
2.7 Write the corresponding values for the (?,?,?). On the first (?), delete it and double click the Fractional Rank which is found on the left side of the dialog box in the Type & Label section. On the second (?), delete it and replace it with the mean value. On the third (?), replace with the standard deviation.
2.8 You now have the normalized values on the fifth column with the new name (per_mean_normal) you provided. You may run the normality test again on the generated output to check if the values have been normalized.
3.
To obtain the Mean and Standard Deviation
3.1
Click Analyze
3.2 Click Descriptive
Statistics. It will open a dialog box with the name Descriptives:Options.
3.3 Click Options.
Click Mean and Standard Deviation, and then click Continue.
3.4 Click Ok.
3.5 Check the SPSS Output and get the Mean and Standard Deviation from the table.
3.5 Check the SPSS Output and get the Mean and Standard Deviation from the table.
This is the result of the Normality Test after the data transformation.
Although transforming the data contributes to
normalizing the values, the trade-off is that interpreting the data may be much
more difficult. For example, if you run a t-test to check for differences
between two groups, and the data you are comparing has been transformed, you
cannot simply say that there is a difference between the two groups’ means. There is
an added step of interpreting the data based on the square root. For this
reason, data transformations are not usually recommended unless, otherwise,
necessary.
Is satisfying the Normality Assumption really necessary?
According
to Stevens (2016) in his book of Multivariate Statistics for Social Sciences,
for analyses like Dependent and Independent Sample t-tests, ANOVA, MANOVA, and
Regressions violations of normality is acceptable for validity as long as the
sample size exceeds greater than 50. Therefore, there is not usually too much
impact on validity from non-normal data.
References:
1. Berry, W. (1993).
Understanding regression assumptions. Quantitative Applied Social Science. (92) 81-82.
2. Feingold, E. (2002). Regression-based
quantitative-trait–locus mapping in the 21st century. Am J Hum Genet. (71) 217–22.
3. Stevens, J. (2016). Multivariate Statistics for Social Sciences 5th
edition. New York, NY, US: Routledge/Taylor & Francis Group.
4. Statistics Solutions. Transforming Data for Normality.
https://www.statisticssolutions.com/transforming-data-for-normality/. Retrieved on October 9, 2019.
Thank you. I can't find a step-by-step guide on 'how to transform non-normal data to normal' anywhere in the web but here. Great help. Highly recommended.
ReplyDeleteyou're welcome, thank you for your feedback :-)
Delete