4 |
Manual Delineation Experiments |
This chapter describes two experiments undertaken to estimate the level of accuracy obtainable when wound area measurements are made by manually delineating the boundary of a wound image using a standard PC mouse. Error in area measurement obtained by wound image delineation is tentatively attributed to the effects of subjective appraisal combined with manual dexterity and the mechanical quality of the mouse. This gives rise to the possibility of differing opinions upon the location of the boundary and thus produces bias between the area measurements of different delineators. Error due to manual dexterity and mechanical imperfections of the equipment is expected to be manifest as random measurement noise. The investigation of such effects is embodied in two experiments. The first experiment aims to measure the delineation accuracy (as bias and precision of area measurements) of several volunteer delineators over some widely differing wound images. The second experiment aims to quantify the effect of display size of a wound upon delineation performance.
4.1 General Assessment of Manual Delineation Performance
4.1.1 Objectives
Notes
4.1.2 Experimental Set-up and Procedure
A set of ten wound images was specifically selected from an existing wound-image library for the experiment, which were subjectively classified into three groups. Three of the images possessed subjectively good wound boundary contrast and relatively uniform intensity in and around the wound site. Four images were more typical of the wound images in the library in that the contrast at wound boundary was generally lower and more varied that of the first three images, the surrounding skin included granulation tissue, some slough was present and the scene illumination was evidently non-uniform. Finally, the last three images were poorly defined with little discernible boundary contrast and poor illumination. Five subjects with experience in examining wound images were asked to use a graphical computer program which allowed them to delineate the wound boundary in each image under control of a standard mouse. Each delineator was asked to make a total of five delineation attempts at each of the ten wounds. The program was written in the ‘C’ programming language using Borland’s ‘Object Windows’ Library for developing applications to run under the Microsoft Windows
™ operating system. The software was capable of displaying colour images of wounds in a 'true colour' graphics mode (approx. 16 million simultaneous colours). The result of each delineation attempt was an image of wound boundary pixels forming a closed border. The area of the wound was then measured by counting the number of pixels lying inside or on the boundary of the wound.
Two effects were identified in advance which could affect the results:
4.1.3 Hypothesis Test for Bias: Analysis of Variance Fixed-Effects Model
Analysis of Variance (ANOVA) is a standard statistical method for testing the equality of several population means given some basic assumptions about the nature of the data:
It has been shown by
Box (1954) that for fixed effects ANOVA, the F-test for differences among means is "most robust for a", where a is the significance level of the test. When the sample sizes are equal, Neter et al. (1996) state that the effect of unequal variances is to slightly raise the significance level of the test. They also state that the F-test is robust against non-normality of the residuals provided that the departure from normality is not too serious, with kurtosis having a greater effect than skewness upon the significance level. However, confidence intervals for single comparisons between factor-level means can be greatly affected by non-normality. The most serious infringement of the assumptions occurs when the residuals are not independent of each other. Neter et al. (1996) suggest that "randomisation when serial correlations are expected can be a good insurance policy."Outliers affect estimates of both factor-level means and, more strongly, variances. In the present case, if one of the five samples is affected by a single outlier, this can exaggerate the variance for that sample and lead to a conclusion of differing variances. There are many types of residual plots that can be used to visually detect the presence of outliers e.g. dot-plots, stem-and-leaf plots and normal-probability plots. Complementary to the subjective analysis of such plots is the Bonferroni test for outliers using studentised deleted residuals (Neter et al., 1996). This test is useful in cases where graphical methods to identify outliers are considered inconclusive.
Considering the foregoing discussion, the proposed pre-ANOVA analysis to test for serious violation of the assumptions, stated above, is as follows:
4.1.3.1 The ANOVA Bias-Test Model
In the case of each of the ten individual wound images, the null hypothesis states that all delineators measure the same wound area. Specifically, the null hypothesis,
H0, and its alternative, H1, may be stated as:
(4.1.1) |
H 0 : All Ai are equal for i=1..5 (delineators) |
||
H 1 : Not all Ai are equal |
|||
where Ai is the ‘true’ wound area measured by delineator i. |
For any particular wound image, let
aij denote the jth delineated wound area observation of delineator i. The test for H0 is thus conducted using the aij data in the standard manner for an ANOVA fixed-effects model.
4.1.3.2 Pre-ANOVA Checks for Violation of Assumptions
(a) Test for Outliers and Normality of Residuals
For each wound, all
N=25 delineated area measurements (aij) are pooled to produce a residual value versus expected normal probability plot with a reasonable number of data points. The residuals are calculated with (4.1.2) and the expected normal probabilities are estimated using (4.1.4). The expected normal probabilities should be plotted on normal-graph paper. The probabilities may be transformed for plotting on a linear axis scale by calculating the standardised ‘z’ value, z(Pk), of the expected probability (Wadsworth, 1990). The pooled plot is constructed by first ordering the eij over the entire set of residuals according to magnitude (4.1.3) and subsequently plotting ek against either Pk or z(Pk).
(4.1.2) |
|
(4.1.3) |
|
where k=Rank(eij), k=1..N. |
(4.1.4) |
|
(b) Inequality of Variances
Several hypothesis tests for checking the equality of variances have been published. Bartlett’s test (
Montgomery, 1991), Hartley’s test, the Levene test and the Modified Levene test (Neter et al., 1996), Bartlett and Kendall’s ‘log s2’ test and the Burr-Foster ‘Q’ test (Anderson and McLean, 1974). The first three tests are considered sensitive to deviations from normality whereas the modified Levene test is considered robust. The ‘log s2’ test requires at least two samples for each factor level, although a sample containing a minimum of 10 observations could be randomly split into two samples.The Burr-Foster ‘Q’ statistic for comparing the variances of
N samples is considered robust to non-normality of the error terms. When the samples are of equal size, the test statistic, Q, is defined as:
(4.1.5) |
|
where si2=ith sample variance estimate.
4.1.3.3 Inspection of Images for Regions of Ambiguity: Median Boundary Method
In this experiment, where biases are deemed to exist between delineators the ‘Median Boundary’ test described here is a robust test designed to analyse which regions of each image are ambiguously interpreted when
H0 is rejected (4.1.1). In total, each wound image will be delineated 25 times, possibly with different regions of the wound site included or excluded from the definition of the wound. The test is designed to be largely unaffected by random errors and delineations which give rise to outlying area measurements.The method is based on the principle of 'vote taking'. It counts the number of times each pixel is included in a wound definition by all delineators, using binary mask images produced by filling the delineated boundary images. Thus a pixel which is never included by any delineator will have a score (vote) of zero, whereas a pixel which is always included by all delineators will have a maximum score which is set by the total number of area measurements and hence mask images. Pixels with a score of one are least-probable wound pixels, and pixels with a maximum score are most-probable wound pixels. Far outside of the wound the pixel scores will be zero. The centre of the wound should be a region of maximum score pixels. Near the boundary of the wound the scores of the pixels will lie within this range and the problem becomes one of setting a ‘likelihood’ threshold to discriminate between less likely and more likely wound pixels. A simple criterion used here is to select as wound pixels only those which have a score of at least 50%. This ensures a degree of robustness in the result since neither a rigid intersection of sets nor a loose union of sets will be robust in the presence of random errors, the union over-estimating the area and the intersection underestimating the area. Also, in the cases of single or infrequent events the union will be affected by outward placement of the boundary and the intersection by the inward placement of the boundary. The summation,
R(x,y), of N binary regions Pn(x,y) is given by:
(4.1.6) |
|
where |
|
i.e. the set of all pixels marked as wound pixels. |
The median region
M(x,y) is the thresholded set of pixels, where t is the threshold:
(4.1.7) |
|
|
Figure 4.1 Regions formed by selecting various thresholds for boundary analysis |
Figure 4.1 demonstrates the practical formation of the median region. The median region is formed when
t=(N+1)/2 as in (4.1.7). The general intersection and general union of sets in Figure 4.1 are formed by setting t in (4.1.7) to N and 1 respectively. With this method one can include all 25 delineations for each wound to produce the most likely wound boundary, and one should expect its included area to be near to the median area of the 25 area measurements. The median boundary will be based on only five measurements when high-lighting the differences between the boundaries delineated by the two delineators who produce the lowest and highest mean area measurements. The results in this case will be less robust an estimate, but will still allow the regions of ambiguity to be identified.
4.1.4 Model for Delineator Precision
The fractional precision estimate for a sample of wound area measurements is defined as
(4.1.8) |
|
where
s is the area measurement precision and A is the ‘true’ area of the wound.
A general fractional precision estimate for a wound’s area may be defined as the arithmetic mean of each delineator’s precision values:
(4.1.9) |
|
for i=1..N delineators. |
The fractional precision estimates of (4.1.8) and (4.1.9) are evidently quotients of two random variables. Linear statistical theory presents no exact formulae for calculating the expected mean value and standard error of such expressions. However, provided two variables are uncorrelated, it can be shown that:
(4.1.10) |
|
|
(4.1.11) |
|
where
E{} and V{} are the expectation and variance operators respectively.The approximation in (4.1.10) relies upon A>>s 2/n where n is the sample size. Barford (1985) and Topping (1972) provide descriptions of combinations of standard error.
Assuming the area measurements produced by each delineator i are Gaussian with distribution aij~N(Ai,si2) then the standard deviation estimate si~cn where n=n-1 and the mean area estimate . It is now possible to define (4.1.10) in terms of measurable quantities:
(4.1.12) |
|
The standard square error of may now be expressed as:
(4.1.13) |
|
where |
|
and |
|
Equations (4.1.12) and (4.1.13) together define the expected mean and expected standard error of estimate for the fractional precision definition of (4.1.8). The expectation of the estimate of fractional precision given by (4.1.12) is clearly biased with respect to (4.1.8) by the factor
c4. To produce an unbiased estimator of fractional precision it is necessary to define the estimator of fractional precision as:
(4.1.14) |
|
Thus the expectation and variance of
pfi become, respectively:
(4.1.15) |
|
|
(4.1.16) |
|
The second term in (4.1.16) may be omitted provided yielding:
(4.1.17) |
|
It is a straightforward matter to average the fractional precision estimates produced by each delineator for a given wound. Substituting (4.1.14) into (4.1.9) yields
(4.1.18) |
|
The standard square error of
PFav estimated using (4.1.17) is given by
(4.1.19) |
|
Use of Pooled Variance Estimates
The expected standard error of
PFav can be improved and its calculation simplified when the delineator variances, si2, are equal for all delineators (for a given wound). The variance estimates, si2, can thus be pooled increasing accuracy in estimating s2. The pooled variance equivalent of (4.1.18) becomes:
(4.1.20) |
|
where |
|
|
The common number of degrees of freedom for c4 and MSE, npooled=Nn |
Using (4.1.17), the square of the standard error of
PFpooled becomes:
(4.1.21) |
|
Note that (4.1.20) does not assume that the distributions from which the samples are drawn have the same mean value. This anticipates the existence of mutual biases among the set of experimental delineators.
In total, the five delineators produced 250 area measurements upon which the following statistical tests are performed. The raw area measurements for each image were used to estimate the mean and variance of area measurements for each delineator upon each wound image. The mean and variance estimates for each wound and delineator are presented, respectively, in
Tables B.1 and B.2, Appendix B.
4.1.5.1 Summary of Pre-ANOVA Check Results
|
Figure 4.2 Typical normal probability plot of residuals showing one instance of an outlying value |
There is no serious departure from the normality assumption in the cases of all 10 wounds. Figure 4.2 shows a typical plot of the residuals (taken from image 5), with one outlier marked. The probabilities are regressed twice upon the residuals, once with the outlier included and once with it excluded, showing the effect this outlier has upon the linearity of the plot. No departure from normality is apparent in this plot.
With most of the wounds, delineator variances do not differ significantly. In the case of wound 8, only one delineator produced an anomalous variance, which is a result of an altering opinion rather than variation caused by the limitations of manual dexterity. This analysis is covered in more depth in the next section.
A total of four outliers are detected in the measurements made by the delineators for the ten wound images, and each case is considered to be due to a one-off change of delineator judgment. The effect of these upon the mean and sum-of-square calculations integral to the ANOVA tests is quite small, however.
The results of applying the Burr-Foster Q-test (
4.1.2) for equality of several variances to the area measurement data for each wound image are shown in Table 4.1, along with the significance of the test in each case.
Image |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
Q * |
0.239 |
0.261 |
0.319 |
0.335 |
0.514 |
0.390 |
0.281 |
0.709 |
0.221 |
0.260 |
Significance |
No |
No |
No |
No |
<1% |
No |
No |
<0.1% |
No |
No |
Table 4.1 Burr-Foster Q* Values for Equality of Variance Tests.From tables Q(a=0.010 ; N=5, n=4) = 0.443 Q(a=0.001 ; N=5, n=4) = 0.552 |
4.1.5.3 Bias Analysis
Table 4.2 shows the calculated ANOVA F-statistic for each wound, along with its associated ‘
P*’ value derived from the F-distribution with (4,20) degrees of freedom. The null hypothesis that all five delineators measure the same area was rejected in the case of each wound at least at the 5% significance level. Since this is the case for all 10 wounds the data are further examined to establish the differences between delineators. A suitable test for this purpose is the Tukey-Kramer multiple comparison procedure (Neter et al., 1996). Figure 4.3 shows a plot of these comparisons among the delineators with 95% confidence intervals for each wound. The maximum difference between the means of the delineators – for each wound in turn – is calculated by:
(4.1.22) |
|
Table 4.2 contains this metric for each wound along with the associated Tukey-Kramer confidence intervals. The
Dmax metric is divided by the grand mean for each wound to provide an approximation to the fractional bias between the delineators who produce the most extreme area measurements.
Image |
A |
F calc |
P * (%) |
D max |
D max/A (%) |
1 |
0 8458 |
0 5.03 |
0 0.6 |
00 580 ± 532 |
0 6.9 ± 6.3 |
2 |
0 6490 |
10.35 |
<0.1 |
00 499 ± 245 |
0 7.7 ± 3.8 |
3 |
46194 |
0 4.14 |
0 1.3 |
0 1525 ± 1323 |
0 3.3 ± 2.9 |
4 |
35747 |
0 8.99 |
<0.1 |
0 2692 ± 1573 |
0 7.5 ± 4.4 |
5 |
29152 |
0 8.94 |
<0.1 |
0 5413 ± 2968 |
19.0 ± 10.2 |
6 |
24161 |
0 5.19 |
<0.1 |
0 2246 ± 1715 |
0 9.3 ± 7.1 |
7 |
0 4072 |
0 9.51 |
<0.1 |
00 633 ± 327 |
16.0 ± 8.0 |
8 |
38033 |
0 8.68 |
<0.1 |
0 4478 ± 2617 |
12.0 ± 6.9 |
9 |
42577 |
23.88 |
<0.1 |
0 7696 ± 3202 |
18.0 ± 7.5 |
10 |
43940 |
42.74 |
<0.1 |
11252 ± 2936 |
26.0 ± 6.8 |
Table 4.2 Salient results of F-test for delineator agreement and subsequent bias estimation.From statistical tables: F(1-a =0.95 ; 4, 20) = 2.87 ; F(1-a =0.99 ; 4, 20) = 4.43 |
Further to the question of estimating the magnitude of the bias differences is the question of which particular regions of each image are the cause of bias can be answered by inspecting the average differences in boundary positioning achieved by the delineators who generated the extreme values of area. This is the Median Boundary method introduced in §4.1.3.3 and will show if the biases occur because certain parts of the image are excluded from the wound in a generally consistent manner or whether the regions are distributed in thin bands around the boundary of the wound.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 4.3 Mean areas and Tukey HSD confidence intervals for delineated area bias analysis. |
|
Figure 4.4 Comparison of mean area and area of median boundary method |
Figure 4.4 is a scatter-plot showing the correlation between the estimate of
Dmax values of Table 4.2 and the difference of the areas produced by applying the median boundary method to those same samples. Clearly, in each case there is good agreement between the estimates of average wound area. Figure 4.5 shows the gray images of the 10 wounds overlaid with the boundaries of the major regions of the differences detected by the median boundary analysis. Note that in most of the images the bias differences between the measurements of the delineators who produced the minimum and maximum mean area estimates is due to the formation of clustered regions of pixels. However, with some wounds, the pixels which cause the bias are scattered around the edge of the wound rather than forming clusters.
4.1.5.4 Test for Overall Delineator Bias
The previous hypothesis tests for bias shows that there is at least one delineator in the case of each wound whose measurements are biased. This being the case, it is possible to perform an overall significance test for bias which shows whether one or more delineators generally tend to produce area measurements with lower or higher values in relation to their peers. Since the area measurement variances among wounds are different, two-way ANOVA using the F-test is inappropriate. An alternative is to use a non-parametric test for differences, the Friedman test
(Neave and Worthington, 1988).
|
|
|
|
|
|
|
|
|
|
Figure 4.5 Regions of significant differences identified from median boundaries |
The Friedman ‘
S’ statistic is approximately c2 distributed with, in this case, 4 degrees of freedom.Let where
is the mean area measurement produced by the
(4.1.23) |
|
where
Aw is the area of wound w and Bi is the bias of delineator i.
The null hypothesis, H0, is defined along with its alternative as:
(4.1.24) |
H 0 : All Bi are equal |
|
H 1 : Not all Bi are equal |
The differences between the true areas of each wound,
Aw, are accounted for by blocking the data, i.e. each wound is defined as a block and the data, mwi, are thus ranked within each block, removing the effect the different Aw. The rank sums for each delineator taken over the ten wounds are shown in Table 4.3 together with the value of the associated Friedman test statistic, S.
Delineator, i |
C |
A |
B |
D |
E |
Rank Sum |
17 |
19 |
34 |
35 |
43 |
Friedman Statistic, S |
00 18.1 |
||||
P * : Probability in tail area c 2 > 18.1 |
0.0003 |
||||
Table 4.3 Rank sums and Friedman statistic for general delineator bias test.c 2 Critical Value @ 1% significance level = 13.3. |
As Table 4.3 shows,
H0 is rejected and therefore the conclusion is that delineators have a tendency to produce biased estimates on all wounds tested.
4.1.5.5 Precision Estimates
Table 4.1 shows that not all of the delineator variances for images 5 and 8 are equal. The pre-ANOVA checks identify ‘outliers’ for these images, but these outlying observations must not be discarded since they are not measurement errors. In this case, the averaged fractional precision formula given by (4.1.17) is to be preferred to the pooled-variance version of (4.1.19). The average fractional precision estimates for the area measurements of the 10 wound images are shown in Table 4.4. Fractional precision results for the area measurements are given using both the average-precision formula and the pooled-variance fractional precision estimates. As the table shows, the pooled variance formula produces somewhat diverse results for images 5 and 8 which have significantly differing variances among delineators.
Image |
PF ± s{PF} (%) Pooled |
PF ± s{PF} (%) Average |
||
0 1* |
3.37 ± 0.24 |
3.46 ± 0.54 |
||
0 2* |
2.02 ± 0.14 |
2.05 ± 0.33 |
||
0 3* |
1.53 ± 0.11 |
1.49 ± 0.25 |
||
0 4* |
2.36 ± 0.17 |
2.29 ± 0.38 |
||
0 5* |
(5.47 ± 0.39) |
4.82 ± 0.87 |
||
0 6* |
3.80 ± 0.27 |
3.56 ± 0.60 |
||
0 7v |
4.31 ± 0.30 |
4.16 ± 0.67 |
||
0 8* |
(3.69 ± 0.26) |
2.99 ± 0.63 |
||
0 9* |
4.05 ± 0.29 |
4.19 ± 0.65 |
||
10 * |
3.65 ± 0.26 |
3.74 ± 0.61 |
||
Table 4.4 : Fractional precision estimates.* denotes differing variances.Note: for n=4, c4=0.9400. for n=20, c4=0.9876 |
4.1.6 Discussion
Bias Analysis
Table 4.2 shows that for each wound, the F-test for delineator bias (4.1.1) rejected the null hypothesis that all delineators measure the same area. Significant differences in area measurements therefore exist among delineators, so that for each wound image at least one pair of delineators disagreed in their average measurement of wound area. The Tukey confidence intervals shown in Figure 4.3 illustrate that the mutually biased area measurements from different delineators often fall into two or three overlapping groups. The area measurements for the three wound cases which produce the largest biases, images 5, 9 and 10, comprise two distinct non-overlapping groups in each case. This is also the case for the measurements of image 8 (6th largest bias). Except for image 9, the distinction is due to just one delineator in each case producing significantly smaller wound area measurements.
The regions of each wound which have been ambiguously interpreted, and thus are the cause of the biases, are shown in Figure 4.5. Bias arises in most cases when compacted regions of the wound, as opposed to elongated regions, are defined as wound by one delineator and as surrounding tissue by another. This is most clear in images 5 and 7 to 10 which have the largest biases.
The measurements for two of the wounds, images 5 and 8, both contain a single outlier which has increased the variances of the respective samples, thus causing the Burr-Foster ‘Q’ test to indicate significantly differing variances (Table 4.1). Inspection of the delineated area bitmap images shows that the outlier represents a change of opinion on the part of the delineator in each case. This will have inflated the mean-square-error (MSE) calculation for the bias analysis, causing wider confidence intervals. In the case of image 5, the delineator who produced the outlier was delineator A. The relevant line-plot in Figure 4.3 shows that delineator A is not one of the those who produced an extreme value of mean area. If the outlying observation is tentatively removed, the mean area for delineator A increases but is still not an extreme mean area, thus the estimates of
Dmax in Table 4.2 are unaffected, except that, as mentioned, the confidence intervals are broader. In the case of image 8, delineator A again produces the outlier which inflates the sample variance and thus the MSE calculation. Delineator A’s mean area measurement is the lowest of the five delineators and the outlier is a lower-than average measurement for this delineator. Thus the Dmax estimate is affected by the outlier.The estimates of mean bias in Table 4.2 generally agree with the subjective three-group wound image classification described in
§4.1.2. Biases range from 3.3% to 7.7% for the first group (images 1 to 3) and are generally lower than those in the other two groups and consistently lower than those in the most indistinct group (images 8 to 10) which has a bias range of 12.0% to 26.0%. This assumes that bias is independent of image size, or at least that it is fairly insensitive to it. The second experiment, presented in §4.2 below, will show that there is a bias component to manually-delineated area measurements, although it is small (approx. 2% maximum) in relation to the largest biases of the group containing images 8 to 10.The Friedman test for bias across the whole wound-set (
§4.1.5.4) indicates that the general differences of opinion which exist do so to some extent because of relative conservative and liberal attitudes on behalf of individual delineators, i.e. the magnitude of the overall differences between delineators is significant enough to show that delineators can be biased more often than not in one direction. Thus, some delineators generally measure higher or lower values of area in comparison with others. It is clear from the ranks in the above table that the five delineators fall into two distinct groups, with delineators A and C forming the lower group (under-estimators) and the other three delineators forming the upper group (over-estimators). It is possible to apply an analytical comparison procedure to identify significant differences, such as Dunn’s paired comparisons (Neave and Worthington, 1988), although in this instance the differences are evident from the rank sums in Table 4.3. The distinction between the two groups arises because these two delineators frequently exclude parts of the image from their wound definitions which are included by the others, although they were not always responsible for generating the lowest mean area measurements for each wound.
Precision Analysis
Table 4.4 shows that the fractional precision of area measurements varies, depending upon the particular wound, from 1.53±0.11% for image 3 (pooled variance) up to 4.82±0.87% for image 5, although it is not possible to state how much of this variation is due to the size of the wound (in pixels). The inflationary effect of unequal variances upon the pooled-variance version of the precision estimate is clearly seen in the disparity between the entries for the two versions of estimated precision for images 5 and 8. Otherwise, the agreement between the pooled estimate version (equation 4.1.18) and the averaged version (equation 4.1.20) is evident. Note also that the associated standard error estimates when using pooled-variances are less than half the respective error estimates produced by the averaged version. In contrast to the case for bias there is no clear distinction between the precision values for the first three wounds (considered visually unambiguous) and the last three wounds which are considered visually ambiguous. Thus it appears that the effect of manual dexterity and the mechanical properties of the delineation equipment (in this case a mouse) have swamped any variation due to vagueness of the wound boundary.
4 Manual Delineation |
||