Part F: A Closer Look at....

Analyzing Survival Data

In this section, we'll use the survival experiment to illustrate that both qualitative and quantitative understanding are important, and to show how to appropriately use statistics at several levels. We'll analyze data taken from an earlier experiment.

A First Cut

Some of the data from Table 1 Ultraviolet Lethality and Mutations in Yeast are copied below. Run your eye down the figures and note that the larger the dose the fewer the survivors. Make this qualitative lesson explicit to inspire yourself to figure out more about it. This qualitative observation motivates both quantitative questions such as "How does survival depend on dose?" and also additional qualitative questions, such as "Are all yeast cells impaired or are some simply killed outright?" or "Are cells really killed or is their reproductive capability damaged?" Both types of questions are important. Here we will follow up the quantitative question.

UV Survival Data:

Exposure Time in Seconds Count Plate 1 Count Plate 2 Count Plate 3 Concentration Factor
0 129 127 119 1
5 101 140 109 1
10 96 82 62 1
15 39 32 29 1
15 298 357 322 10
20 149 122 128 10
25 52 38 52 10
30 22 27 24 10

These data show how important the dilution strategy is. If we had worked with the same dilution throughout, then after 30 seconds we'd probably only have seen 0 or 1 colonies/plate and not see clearly that longer exposures have a greater effect. Alternatively, if we had used the last dilution for the 5 second exposure, then we'd have had to count more than 1,000 colonies!

On the other hand, using these differing dilutions can be confusing. The 300+ survivors at a 15 second dose should be compared to the 1,250+ survivors we would have gotten had we used the same dilution factor for the 0 second dose. Convert all the data to the same dose to simplify the analysis. Here is the table we get if we convert all the data to the 10 dilution factor:

UV Survival Data (Converted to a single effective dilution factor) :

Exposure Time in Seconds Count Plate 1 Count Plate 2 Count Plate 3
0 1290 1270 1190
5 1010 1400 1090
10 960 820 620
15 390 320 290
15 298 357 322
20 149 122 128
25 52 38 52
30 22 27 24

We are justified in "massaging the data" in this way because otherwise we might mislead people who haven't done the experiment, and who don't know or care about the details of the dilution factors.
To figure out what is going on, and to explain the results to other people plot the data on a graph. As an example we have plotted the results for plate #2 in Figure 1. The difference between 25 and 30 second exposures on this graph is hard to see because the graph covers a big range: 1400 to 27. All the exposures are easier to see if you use a semi-log graph. On a semi-log graph, the vertical scale has the same distance between 1 and 10 as between 10 and 100, and between 100 and 1000. These same data are plotted, much more clearly, on a semi-log graph in Figure 2. We can see that all the data are different, but cannot yet draw any conclusion about how survival depends on dose. In fact, this graph suggests that a little UV actually helps colonies grow. If we plot ALL the data from the table on this graph however, we learn more, as we see in Figure 3.

Figure 1: Linear plot of column two data.

Figure 2: Semi-log plot of column two data.

Figure 3: Semi-log plot of all data.

Figure 3 is revealing. In the first place, survival more clearly depends on dose. The low dose behavior is still uncertain, but the idea that a little UV helps survival looks less likely. We need more data to study this question. For larger doses, the log of the surviving colonies clearly decreases roughly linearly as the dose increases. Note the smoothing effect of lots of data: each individual data point represents some random fluctuation just like the dilutions do, and each point also contains some experimental errors, but the errors and fluctuations push one point one way, another point in the opposite direction, and so the collection of points becomes much more useful than the individual ones.
This plot can effectively demonstrate the results from an entire class. It is actually a form of "statistics": we can roughly estimate a number of survivors at each dose, note how much variability there is in this number and, using that marvelous analytical engine, our brain, even fill in an approximate line through the data. The next step is to make these rough insights quantitative, but the three steps above are useful in themselves if you have limited time.

First, we have identified an effect to be studied.
Second, we've described the result qualitatively.
Third, we've presented all our data clearly on a useful graph.

A Second Cut

Presenting all the data at each dose gives us too much information because the experimental uncertainties in each point reported may obscure what is really going on. Also, you don't want to report excess information to people who just want to know how dangerous UV is to yeast cells. A useful tactic is to determine a single number at each dose that tells us how many survived. Most classes doing these experiments will decide to "use the average" at each dose for that single number. If all students use the same procedure then this is a good strategy. Returning to our "raw data" we produced this table of the average number of colonies/plate versus dose:

UV Survival Data:
Exposure Time in Seconds Average Colonies per Plate Concentration Factor
0 125 1
5 117 1
10 80 1
15 33 1
15 325 10
20 133 10
25 47 10
30 24 10

But what should we do if five runs at an exposure get 22, 25, 26, 31 and 196 cells? Should we just average these five and say the average is 60? Doesn't this look a little misleading? Discussion will probably suggest that the odd result is likely to be a wrong dilution, or some other procedural problem, and ought to be left out. This may be a proper time to "massage the data" because we should not report data that we have reason to believe are wrong. However, we should first re-check that exposure, keep the odd result in mind, and consider the possibility that it reflects an interesting unanticipated effect.
To understand this better consider the results of the 5 second exposure in our data. The points are 101, 109, and 140. The 140 is more than any of the 0 second plates, and 140 is really quite far from the 117 average. If we omitted this number, the drop from 0 to 5 seconds would appear more like the other changes. Should we drop it? Before deciding, this we'll calculate the other quantity that we report when using only averages instead of all the data: the "standard deviation."
The standard deviation measures how variable the data are. For example, the standard deviation of the closely bunched 0 second data is just over 4, while that for the 5 second data is 17. You can figure out the standard deviation using the 5 second data as follows.

First, find the average:

(109 + 101 + 140)/3 = 117 cells. Next, find out how far each point is away from the average and square it:
(109 - 117)^2 = (-8)^2 = 64 cells^2,

(101 - 117)^2 = (-16)^2 = 256 cells^2,

(140 - 117)^2 = (+23)^2 = 529 cells^2.

You square these deviations from the average to get a positive number that measures the deviation. The average of these numbers is

(64 + 256 + 529)/3 = 283 cells^2

and is called the "variance" of the data. To get rid of the peculiar "cells squared," we take the square root:

sqrt(289 cell^2) = 17 cells

For practice, figure out the standard deviation of the 0 second data (the answer is 4.3). Most spreadsheets have statistical functions, and you should experiment with yours to learn how to ask your computer for the average and standard deviation of a bunch of numbers. (Be forewarned that some programs offer two types of standard deviation, and only one of them works here. For reference, we used Lotus 1-2-3.)
Now we can construct a table of both the average number of surviving cells and the experimental variability in this number for each dose.

UV Survival, including averages and standard deviation:

Exposure Time in Seconds Average Colonies per Plate Standard Deviation Concentration Factor
0 125 +/- 4.3 1
5 117 +/- 17 1
10 80 +/- 14 1
15 33 +/- 4.2 1
15 325 +/- 24 10
20 133 +/- 12 10
25 47 +/- 7 10
30 24 +/- 2 10

Now, should we retain or throw away the 140 cell plate in the 5 second exposure? The standard deviation of 101, 109, and 140 is 17, so 140 is just about 1 1/2 standard deviations away from the average. This is not very far; in fact, in typical experiments about 1/3 of the points will be more than one standard deviation away from the average. On the other hand, if the data for 30 seconds had been 22, 25, 26, 31, and 196, then the average would be 60, and the standard deviation about 35; 196 is almost 4 standard deviations from the average. Such a result is extremely unlikely, and so we would drop that point and report only on the other four. Thus we would report an average of 26 and standard deviation of 3.2. (See Figure 4 for a plot of the data in this table and the standard deviations used as error bars.)

The two sets of data at 15 seconds furnish a final lesson about errors, the usefulness of more data, and the justifiable massaging of data. The standard deviation of the dilution 1 data is about 0.13 of the mean; therefore, we only know the answer to 13%. The dilution 10 data are as is typical, more accurate: we know the mean to about 7%. The larger the mean, the smaller the relative error (= error/mean). We should probably use the more accurate data, but consider this question before discarding the smaller numbers: How accurately do you suppose the plates with 300+ colonies were counted? Mistakes easily occur when too many colonies grow on a plate; we may miss colonies or have two colonies counted as one because they are growing right on top of each other. If an average of 500 colonies are growing on some plates, we would expect a lot of experimental error because the number of colonies would probably be systematically under-counted. This kind of error is called "systematic error." It is an error that creeps into our data because of problems with our procedure. How would you choose between data with a mean of 500 +/- 25 or of 50 +/- 8? We would probably accept the 50 even though the relative error is 8/50 = 0.16, because the data with the 500 mean is likely to contain a big systematic error. On the other hand, a further dilution that gives data with a mean of 5 +/- 2 gives even poorer data. Although zero error occurs in counting the colonies, the relative error due to statistical fluctuations is 0.40. This sort of error, called "statistical error," is a big problem when you are dealing with small numbers. We could therefore discard the plates giving 5 +/- 2 because of the large statistical error and just report the 50 +/- 8 data. We should be conscious of statistical and systematic errors when selecting data, and consider which of our procedures are likely to give the most reliable data.

Discard the wide-spread prejudice that all data are sacred, but don't fall into the temptation to discard results just a little bit higher or lower than you were expecting. You would then report only the data that reinforce your expectations. Do not discard data because they disagree with a pre-existing theory but subject all results to the same rigorous scrutiny; accept nothing at face value. If you think that a procedure was faulty, then it is only honest to discard the data gathered using that procedure. No magic rule governs such decisions; we just have to think hard and be honest.

Figure 4: Semi-log plot of survival with a straight line fit.

A Third Cut

We should also figure an appropriate standard deviation even when our data appear too "good." In this experiment, the variability in the data at any dose is caused by two factors: variability in technique and procedure, and the unavoidable fluctuations we noted in the serial dilution experiment. (See the notes on statistics and the serial dilution experiment.) If we recall that a fluctuation is roughly the square root of the expected number of cells, and look at the table above, we realize that the 0 second data are too closely bunched; we must have been lucky. When we report these results or try to analyze them, we should report at least the expected variability. In the next table therefore, we report standard deviations at least as big as the square root of the average number of colonies on the plate.

UV Survival Averages and Estimated Errors Including Fluctuations

Exposure Time in Seconds Average Colonies per Plate Standard Deviation Concentration Factor
0 125 +/- 12 1
5 117 +/- 17 1
10 80 +/- 14 1
15 325 +/- 25 10
20 133 +/-12 10
25 47 +/- 7 10
30 24 +/- 5 10

We can work with the results of two or more classes by calculating the fraction of cells that survive. Another class using a different starting suspension of cells might begin with an average of 170 cells/plate, so all their numbers would differ from ours. The fraction that survive at each dose, however, can be compared.

We've compared our results with those from other experiments in a final table:

Surviving Fraction

Survival With Dilution Correction Fraction
Exposure Time in Seconds Average Colonies per Plate Standard Deviation Average Standard Deviation
0 125 +/- 12 1.00 +/- 0.10
5 117 +/- 17 0.94 +/- 0.14
10 80 +/- 14 0.64 +/- 0.11
15 32.5 +/- 2.5 0.26 +/- 0.02
20 13.3 +/- 1.2 0.11 +/- 0.01
25 4.7 +/- 0.7 0.04 +/- 0.01
30 2.4 +/- 0.5 0.02 +/- 0.01

The surviving fraction is plotted in Figure 5. The graph starts out with a bit of a shoulder and then after 10 seconds exposure, becomes a fairly straight line. We need more data to study the dose dependence between 0 and 10 seconds, although most likely it is not constant. After 10 seconds a straight line represents the data well. The line in Figure 5 was drawn by Lotus 1-2-3 using linear regression on the last five points, although we could have done as well using a ruler and our eyeball. Many other curves will also fit this data. We are not creating a theory when we draw a line through the points, we are just fitting data, whether we "eyeball it," calculate the "best fit" ourselves, or ask a statistical program to do a "linear regression." The straight line portion of the graph after 10 seconds, which represents our data simply and clearly, has several uses.
One use is to suggest a theory. A downward sloping straight line on a semi-log graph means "exponential decay." Just as the UV intensity drops exponentially as it passes through the ozone layer, so do these cells die off exponentially when exposed to more and more UV. Thus, after the first 10 seconds, only about 0.4 of the cells survive each successive 5 seconds of radiation. At 10 seconds there are 80 cells; 0.4 of these = 32 survive until 15 seconds; 0.4 of these = 13 survive until 20 seconds; 0.4 of these = 5 survive until 25 seconds; and 0.4 of these = 2 survive until 30 seconds. The straight line on the semi-log plot suggests this model of events.

Figure 5: Semi-log plot of the surviving fraction with a straight line fit.

Another use of the line is to economically express our data: after 10 seconds the fraction that survive is

(fraction at 10 seconds) x (0.4)^(time past 10 seconds/5)

=(0.64) x (0.4)^((t-10)/5).

The term "exponential decay" derives from this expression. The exposure time is in the exponent; it is not just a multiplier.
This form also gives us the useful concept "LD10" for "lethal dose for 10 percent survival." The exposure time needed to get a surviving fraction = 0.1 is about 20 seconds because if we use t = 20 in the above formula the exponent is (20-10)/5 = 2 and the formula becomes (0.64) x (0.4)^2 = 0.10.
However we use the straight line fit to our data, the data themselves provide a valuable monitor of the biologically-important UV intensity where you live. Both the straight line fit and the graph itself reveal that only 0.1 of the cells survive an exposure of 20 seconds to this lamp, the "lethal dose for 10 percent survival," or LD10. When you do this experiment outdoors, you may find that on one day a 4-minute exposure leaves only 0.1 cells surviving. Four minutes in the sun would be the LD10, providing the same amount of UV exposure as 20 seconds of the lamp used in this experiment. On the next day the LD10 might be only 3 minutes, suggesting that the UV is more intense than the day before.

There are other statistical notations that are useful in special situations, such as "Confidence intervals" or "Chi-square," but these are not appropriate for a first experience with statistics. (For an application of chi-square testing, see the notes on statistics in our photoreactivation experiment.) When these notions are combined with curve-fitting, we are usually trying to pin down an underlying theory with several parameters. Here we are just starting down that road; our data suggest that there might exist an underlying theory of radiation damage that gives roughly a straight line, but assuming there is such a theory and setting confidence levels for its parameters, is premature!

SUMMARY:

  1. Observe effects and investigate causes.
  2. Describe relationships qualitatively.
  3. Plot all results on graphs; experiment with linear and semi-log plots.
  4. Make graphs of averages using standard deviations as error bars, and consider other sources of error.
  5. Look for the "best fit" between a simple curve (often a straight line on some graph) and the data. First, use your eye, then use "least squares."
  6. If a theory exists and specifies parameters for the curve in #5, then you may be able to determine a "confidence interval" for some of the parameters. This is not the standard situation in science: a glance through Physical Review or Genetics will reveal lots of error bars and few confidence levels, though when available these are important.

Click here to return
Last updated Wednesday, 04-Dec-2002 20:16:56 UTC