Monday, July 8, 2013

What's Wrong with This Picture?

Here are some examples of how not to display performance data. Remember: collecting and analyzing your performance data is only half the battle. The other, equally difficult, half is presenting your performance data and conclusions.

Example 1

This first example is an oldie but a baddie. It will also provide some context for the second example below.

Figure 1 (click to enlarge)
See the problem?
The x-axis is logarithmic.
There's no visual warning about this distortion.
Even worse, it's base-2 logarithms.

Choosing a logarithmic scale causes the otherwise linear intervals on the axis to be transformed in a nonlinear way. That nonlinearity in the scale causes a distortion in apparent the "shape" of the data. It tends to expand the horizontal separation between data points near the origin and contract the horizontal separation between data points that are far away from the origin. In the case of Figure 1, logarithmically rescaling the x-axis results in curves that take on an artificial sigmoid or 'S' shape. The sigmoid distortion presents the wrong visual cue, which in turn can cause the reader (including yourself) to jump to the wrong conclusion.

That's not to say you should never use a log scale, but there are provisos to consider before doing so.

  1. Only use it when it serves to illuminate something hidden in the data, e.g., exponentially distributed data will look linear on a plot with a log-scaled x-axis. I used this to great effect for analyzing Oracle query performance
  2. Never use it merely for the convenience of compressing data with a large x-range into the available width of a plot window. Use multiple views.
  3. Never use it without alerting the reader that the axis is not a conventional linear scale, e.g., label it a log(x) instead of just x.
  4. Try to indicate the base of the logarithm, e.g., log2(x) or log10(x). Don't worry about using illegible subscripts. The idea is to make the label as visible as possible, not as correct as possible. The reader will figure out the base and other details once their attention has been drawn to it.

More detailed discussions can be found in my previous blog posts on this topic:

Example 2

This example is something I don't recall seeing before and it completely threw me, initially.

Figure 2 (click to enlarge)
See the problem?
The x-axis looks linear (percentages) but it's not.
The sigmoid distortion looks similar to that in Figure 1, but it's not.

At first glance, your visual cortex interprets the x-axis as a linear scale because you see the values range between 1% and 100%. The 0% is not shown because the tick-marks occur at the midpoint of each interval.

On closer inspection, however, you can see that there are also 1%, 2%, 5%, 10% marks that are evenly spaced when they should be roughly doubling. Moreover, the 100% mark is not 10 times longer than the 10% mark. That suggests some kind of log-scaling, which could indeed lead to a sigmoid shape in the data, just like Figure 1. But the right-hand end of the scale clearly has linear intervals. So, what kind of scaling is this?

Surprise! It's not any kind of mathematical scale transformation. It's the result of mindlessly plotting categorical data in Excel. Although the x-labels look like numerical values, they are not. They are names. There's no warning about this effect because it was probably quite unintentional. It would have been easier to spot this false distortion if Excel had labelled the intervals as "1%, "2%", etc.

Exercise for the reader: If the x-axis is straightened out by using numerical values, what shape do the curves take?

I'll have more to say about techniques for presenting performance data in the upcoming Guerrilla Capacity Planning classes.