Created
December 20, 2018 13:46
-
-
Save oxidizeddreams/4b75ce5ff824efe4922c1ab378aaca0b to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Which Transformation? | |
The main criterion in choosing a transformation is: what works with the data? As above examples indicate, it is important to consider as well two questions. | |
What makes physical (biological, economic, whatever) sense, for example in terms of limiting behaviour as values get very small or very large? This question often leads to the use of logarithms. | |
Can we keep dimensions and units simple and convenient? If possible, we prefer measurement scales that are easy to think about. | |
The cube root of a volume and the square root of an area both have the dimensions of length, so far from complicating matters, such transformations may simplify them. Reciprocals usually have simple units, as mentioned earlier. Often, however, somewhat complicated units are a sacrifice that has to be made. | |
When to Use What? | |
The most useful transformations in introductory data analysis are the reciprocal, logarithm, cube root, square root, and square. In what follows, even when it is not emphasised, it is supposed that transformations are used only over ranges on which they yield (finite) real numbers as results. | |
Reciprocal: The reciprocal, x to 1/x, with its sibling the negative reciprocal, x to -1/x, is a very strong transformation with a drastic effect on distribution shape. It can not be applied to zero values. Although it can be applied to negative values, it is not useful unless all values are positive. The reciprocal of a ratio may often be interpreted as easily as the ratio itself: Example: | |
population density (people per unit area) becomes area per person | |
persons per doctor becomes doctors per person | |
rates of erosion become time to erode a unit depth | |
(In practice, we might want to multiply or divide the results of taking the reciprocal by some constant, such as 1000 or 10000, to get numbers that are easy to manage, but that itself has no effect on skewness or linearity.) | |
The reciprocal reverses order among values of the same sign: largest becomes smallest, etc. The negative reciprocal preserves order among values of the same sign. | |
Logarithm: The logarithm, x log10 x, or x log e x or ln x, or x log 2 x, is a strong transformation with a major effect on distribution shape. It is commonly used for reducing right skewness and is often appropriate for measured variables. It can not be applied to zero or negative values. One unit on a logarithmic scale means a multiplication by the base of logarithms being used. Exponential growth or decline. | |
y=aexp(bx) | |
is made linear by - lny=lna+bx | |
so that the response variable y should be logged. (Here exp() means raising to the power e, approximately 2.71828, that is the base of natural logarithms). An aside on this exponential growth or decline equation: x=0, and y=aexp(0)=a | |
so that a is the amount or count when x = 0. If a and b > 0, then y grows at a faster and faster rate (e.g. compound interest or unchecked population growth), whereas if a > 0 and b < 0, y declines at a slower and slower rate (e.g. radioactive decay). | |
Power functions : | |
y=axb | |
are made linear by logy=loga+blogx so that both variables y and x should be logged. An aside on such power | |
functions: put x=0, and for b>0 | |
, | |
y=axb=0 | |
so the power function for positive b goes through the origin, which often makes physical or biological or economic sense. Think: does zero for x imply zero for y? This | |
kind of power function is a shape that fits many data sets | |
rather well. | |
Consider ratios y = p / q where p and q are both positive in practice. | |
Examples are: | |
Males / Females | |
Dependants / Workers | |
Downstream length / Downvalley length | |
Then y is somewhere between 0 and infinity, or in the last case, between 1 and infinity. If p = q, then y = 1. Such definitions often lead to skewed data, because there is a clear lower limit and no clear upper limit. The logarithm, however, namely | |
log y = log p / q = log p - log q, is somewhere between -infinity and infinity and p = q means that log y = 0. Hence the logarithm of such a ratio is likely to be more symmetrically distributed. | |
Cube root: The cube root, x 1/3. This is a fairly strong transformation with a substantial effect on distribution shape: it is weaker than the logarithm. It is also used for reducing right skewness, and has the advantage that it can be applied to zero and negative values. Note that the cube root of a volume has the units of a length. It is commonly applied to rainfall data. | |
Applicability to negative values requires a special note. Consider | |
(2)(2)(2) = 8 and (-2)(-2)(-2) = -8. These examples show that the | |
cube root of a negative number has negative sign and the same | |
absolute value as the cube root of the equivalent positive number. A similar property is possessed by any other root whose power is the | |
reciprocal of an odd positive integer (powers 1/3, 1/5, 1/7, etc.) | |
This property is a little delicate. For example, change the power just a smidgen from 1/3, and we can no longer define the result as a product of precisely three terms. However, the property is there to be exploited if useful. | |
Square root:The square root, x to x(1/2) | |
= sqrt(x), is a transformation with a moderate effect on distribution shape: it is weaker than the logarithm and the cube root. It is also used for reducing right skewness, and also has the advantage that it can be applied to zero values. Note that the square root of an area has the units of a length. It is commonly applied to counted data, especially if the values are mostly rather small. | |
Square: The square, x to x2 | |
, has a moderate effect on distribution shape and it could be used to reduce left skewness. In | |
practice, the main reason for using it is to fit a response by a | |
quadratic function y=a+bx+cx2. Quadratics have a turning | |
point, either a maximum or a minimum, although the turning point in a function fitted to data might be far beyond the limits of the | |
observations. The distance of a body from an origin is a quadratic if that body is moving under constant acceleration, which gives a very | |
clear physical justification for using a quadratic. Otherwise | |
quadratics are typically used solely because they can mimic a | |
relationship within the data region. Outside that region they may | |
behave very poorly, because they take on arbitrarily large values for extreme values of x, and unless the intercept a is constrained to be 0, they may behave unrealistically close to the origin. | |
Squaring usually makes sense only if the variable concerned is zero or positive, given that (−x)2 | |
and x2 are identical. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment