7 |
Discussion |
7.1 General Performance of the Algorithms
Precision
Table 6.1 shows that all four algorithms produce significant improvements in the variance (precision) of area measurements in most of the test cases. The comparison for average precision between the algorithms given by Table 6.3 shows that there are detectable differences in precision between the four algorithms. The differences between the algorithms and the manual precision (including inter-delineator biases) are clearly shown in Figure 6.2: the TN algorithm shows the greatest measurable improvement, although its confidence interval overlaps to a small extent with that of the MX algorithm. The general precision performances of the GA, MX and MG algorithms are not clearly separable, and it should be noted that there is a small overlap between the manual confidence interval and that of the MG algorithm. More information about the performance of the algorithms in terms of precision may be gleaned by considering the number of improvements in precision made. The TN algorithm produces 50% more cases of measurable improvements in precision of the manual results than do the other three algorithms, the algorithm showing 18 out of 20 cases of improved precision, versus 14 out of 20 for each of the other algorithms. Conversely, the TN algorithm fails to produce a measurable improvement of precision in two cases, and the other three algorithms each fail to produce measurable precision improvements in six cases. This increase in the number of improvement cases is thus in agreement with the results of the overall precision test.
Bias
In contrast to the case for precision, the four algorithms experience some difficulty in returning non-significantly biased results, with no algorithm able to return non-significant differences for more than 12 of the wounds. In this respect the TN algorithm again produces the highest number of successful cases, showing significant bias in only 8 out of 20 cases, in comparison with the other three algorithms which show significant biases in 12 out of 20 cases. Since the TN algorithm also produces the best precision performance, the decreased incidence in the cases of bias for this algorithm cannot be attributed to higher variances causing less significance to be shown in the t-tests.
Combined Improvements
The number of times both performance criteria are met simultaneously again highlights the differences in the overall performance levels of the algorithms. The TN algorithm produces 12 such cases; i.e. showing significant increases in precision and having non-significant differences in mean areas. In this respect the other algorithms perform much more weakly, with the minimax algorithms showing simultaneous improvements on less than six wounds. From Figure 6.1 it is clear that the major contributor to the failure rate is bias. Since the highest simultaneous success rate is only 60% (12 out of 20 wounds) for algorithms tested here, a large part of this discussion is devoted to finding the causes of the failures.
Table 6.4 shows that the mean precision for manual measurements of 4.3±0.3% (25 measurements) is reduced by each algorithm. The best average precision of 1.8±0.2% is achieved by the TN algorithm – this is in agreement with the overall precision test and the number of precision improvement cases stated above. The lowest estimated precision is produced by the MG algorithm (3.1±0.3%). The GA, MX and MG algorithms each show one instance of significantly increased variance along with additional cases of non-significant increases that are discussed later in this chapter. In 19 out of 20 cases the highest measurement precision is achieved by either the TN algorithm (14 cases) or the GA algorithm (6 cases), there being one case of a tie between the two algorithms. The GA algorithm shows significant increases in precision in comparison to the TN algorithm for images 1.6 and 2.3; these cases are examined here:
A. Image 1.6
Both algorithms produce the same bias value (+2.0%) for this wound and start and finish scale-descent at the same scales, but the GA algorithm has a precision of 1.2% and the TN algorithm, 3.4%. Figures 7.1 (a) and (b) show a comparison of the median boundaries of the delineations produced by the GA and TN algorithms. Note that the boundary produced by the TN algorithm is substantially smoother than that produced by the GA algorithm. The boundary of the GA algorithm is a closer match to the wound’s edge, deviating mainly from the manual boundary by missing a slightly elongated portion at the top of the wound and having a slightly thicker ‘tail’ at the right-hand side. It is clear that the very nature of the GA algorithm has enabled it to supersede the performance of the other algorithms in this instance.
B. Image 2.3
Figures 7.1(c) and (d) show a comparison of the median boundaries produced by the GA and TN algorithms. A situation similar to the one described above arises here: the GA boundary is much less smooth than the TN boundary and is clearly a better delineation of the wound’s boundary, just failing to converge fully with the upper-left hand edge of the wound. The failure of the TN algorithm in this case is due to the effect of high image smoothing upon this particular wound because of its shape. Note from Table 6.2 that the measurements of all algorithms met both performance criteria stipulated in §6.1. Thus, even though the precision of the TN algorithm is poorer than that of the GA algorithm, it is still a definite improvement upon the manual precision. Additionally, despite the fact that the algorithm clearly fails to properly delineate the wound, the bias difference is insignificant. Despite these facts, the ‘failed’ delineation is visible and such a result in practice can therefore be treated, appropriately, with caution.
(a) GA Algorithm |
(b) TN Algorithm |
(c) GA Algorithm |
(d) TN Algorithm |
Figure 7.1 Comparison of median boundaries representing average wound delineations: (a) and (c) the GA algorithm and (b) and (d) the TN algorithm. |
It should be noted that the action of the stiffness parameter
b in the TN algorithm is to promote smoother boundaries, whereas in the GA algorithm all second order deviations from the initial contour cause a progressive increase in contour energy, without regard to whether the contour is becoming more smooth or less smooth. The contour energy of the GA algorithm will increase if the action of the image gradient forces is to smooth the initial boundary. This will occur when applying a manual seed contour at a high scale, and is an explanation of why the GA algorithm does not converge with such a high precision as the TN algorithm. The two cases described above are examples of a situation where the gradient energy is insufficient to overcome the smoothing forces in the TN algorithm, i.e. for these image cases, the contour is over-regularised.The MG algorithm is the weakest performing algorithm in terms of precision. It can be seen from Figure 6.1(d) that 9 of the significant improvements in precision lie in a narrow vertical band between
X=0.5 and X=0.7. The cases of improved precision of the other algorithms tend to be scattered closer to the X=0 abscissa. Also, it is clear that this algorithm does not create any more cases of variance increases than either the GA or MX algorithms. The band may be explained thus: The contours of minimum energy for the two external forces, the gradient forces and gray-level forces, do not tend to coincide with each other, so that convergence with one contour means that the energy term for the other external energy will not be locally minimised, and vice-versa. This competition between the two external energies is a contributory factor in producing variance in the results and may therefore be used as an explanation of the higher levels of variance experienced with this algorithm.
The lowest average absolute bias value of 2.8±0.2 % is produced by the MG algorithm (Table 6.4), although this algorithm shows 12 cases of significant bias. In contrast, the TN algorithm, which shows just 8 cases of significant bias, produces the highest absolute bias average of 4.0±0.2%. Figure 6.2(b) shows that the TN algorithm produces three cases of large bias, two in excess of 10% and one in excess of 20%, all of which are significant (images 1.10, 2.8 and 2.10). In comparison, Figure 6.2(d) shows that, although the MG algorithm also produces its largest biases on images 2.8 and 2.10 it contained these biases at the 8-9% level, and it does not produce a significant bias on image 1.10 – the bias is less than 5%. The large bias is a starting scale problem: Figure 7.2 shows some example median boundary delineations for image 2.10 that represent the average of the converged results from the TN algorithm and the MG algorithm. It is clear from these images that the problem is one of over-regularising the image: both algorithms fail to delineate the ‘tails’ of this wound. In fact it is evident that the MG algorithm actually performs worse than the TN algorithm when both algorithms are started at a high scale. This suggests the need for a shape-based control to adjust the starting scale. This information could be obtained directly from analysis of the initial contour or alternatively, the person delineating the wound could be presented with several shape templates and asked to indicate which one best approximates the wound. Although the TN algorithm produces fewer cases of significant bias, its performance can vary significantly on some wounds. These cases are discussed in the next section.
(a) TN (s =8) |
(b) TN (s =2) |
(c) MG (s =8) |
(d) MG (s =2) |
Figure 7.2 Comparison of median boundaries for TN and MG algorithms applied to image 2.10.TN algorithm at (a) s =8 and (b) s =2, MG algorithm at (c) s =8 and (d) s =2 [pixels]. |
7.2 Analysis of Failure Cases by Wound
A wound measurement made by one of the active contour algorithms may be labelled a ‘failure case’ when either one or both of the following failure modes apply:
The purpose of this discussion is to identify a set of properties in the wound images that explains the errors. Both of the failure modes occur a total of three times among the measurements, made on wounds 2.8 and 2.10 (see Table 6.2). These cases are also the only ones where variance significantly increased. Note that the variance ratio test for identifying cases of improved precision produces three groups of results, (a) significantly decreased, (b) not significantly different (ratio of variance estimates may be numerically less than or greater than 1.0 but not enough to be considered significant) and (c) significantly increased. The improvement criteria stated at the beginning of this chapter include case (a) whereas the failure mode for variance test is case (c) – this ignores wounds that are cases of (b).
7.2.1 Cases of Both Failure Modes
Two of the three cases where application of an algorithm caused a degradation in precision and a significant bias refer to wound 2.8 with the GA and MG algorithms. This image is of a sloughy (yellow) wound on the upper side of the foot just above the toes. Subjectively, the wound appears distinct from the surrounding skin, having a different hue and regular shape (ovoid). In this case the reflectance of the wound tissue is actually higher than that of the surrounding skin and inspection of the edge of the wound shows colour changes – a small amount of red epithelialising tissue, purple-hued swollen tissue and some white flakes of skin – which produce multiple edges. Also, there is a band of shadowing by the toes which at low filter scales will produce two edges having opposing directions.
The MX algorithm gives rise to a significant bias and a significantly increased variance for wound 2.10. The bias differences for this wound are the largest recorded for the set of 20 wounds, ranging from 8.6% to 21.7%. Figures 7.2(a) above, show the general error that has occurred in this image: the final contour has pulled away from two of the three points of highest curvature which have little support in the gradient magnitude image at the starting scale. There is also a red dark spot (probably pooled blood, as opposed to blood perfusing through healing tissue) at the neck of one of the tails of the wound which gives rise to high gradients at its periphery. The fact that two of the algorithms (TN and MG) produce significantly improved precision values for this wound (both 1.2%) is of no real consequence due to the dominance of bias.
7.2.2 Insignificant Cases of Variance Ratio
Table 6.2 shows that images 1.9 and 1.10 returned
increases in variance that were not statistically significant. Both of these wounds are examples of some of the most ambiguous images in the wound library. Table 4.1 in Chapter 4 shows that all five manual delineators perform at their least consistent with these two images, returning average biases of 18% and 26% respectively. It should also be noted that there is little consistent gradient information in these images. Thus, in this case there is insufficient image evidence to allow for these differences of opinion to be converged together. The capacity for variance-reduction with image 2.4 is limited because of the vagueness of the boundary at the top of the wound that prevents all of the algorithms from converging to a repeatable result.
7.2.3 Cases Where Bias is Significant
As stated earlier in this chapter the main manifestation of error in the results is the number of cases where most or all of the four algorithms tested produce significantly biased results when compared with the manual results. Notably, there are five wounds in the set of 20 where all four algorithms introduce a statistically significant bias (see Table 6.2) and there are a further six cases where three of the four algorithms produce significantly biased results. In four of these latter six cases, the TN algorithm is the only one that produces insignificant biases. Whilst these biases are deemed significant, the size of the bias is still an issue that requires addressing. The following table summarises the failures.
Image |
Summary |
Bias (% mean area) |
|||
GA |
TN |
MX |
MG |
||
1.3 |
All algorithms fail |
00 3.4 |
00 3.3 |
00 3.5 |
00 1.7 |
2.2 |
- 04.1 |
- 03.7 |
- 05.7 |
- 04.8 |
|
2.8 |
00 4.3 |
-17.7 |
-15.9 |
- 08.0 |
|
2.9 |
- 03.6 |
- 02.5 |
- 03.3 |
- 05.3 |
|
2.10 |
-14.0 |
-21.7 |
-16.2 |
- 08.6 |
|
1.1 |
3 algorithms fail |
- 03.9 |
- 01.3 |
- 02.5 |
00 2.0 |
1.4 |
00 3.0 |
- 00.2 |
00 3.6 |
00 2.4 |
|
2.1 |
- 01.7 |
00 0.2 |
- 01.8 |
- 01.5 |
|
2.4 |
00 4.9 |
00 3.5 |
00 4.6 |
- 00.4 |
|
2.5 |
- 01.2 |
- 00.3 |
- 01.3 |
- 01.3 |
|
2.6 |
- 00.4 |
- 04.4 |
- 01.3 |
- 01.8 |
|
Table 7.1 Cases of wounds that give rise to significant bias for most algorithms.Values marked in bold typeface are not statistically significant. |
Some biases are statistically significant whilst being relatively small, e.g. 1.7 % bias recorded for image 2.1, and represent the cases where an algorithm tends to miss a small part of the wound, rather than a catastrophic failure. It is not clear whether such bias errors can be systematically propagated through a series of wound measurement samples taken at regular intervals over a period of a wound’s occurrence, since a wound can be expected to change its appearance, particularly if it is changing in size. Clearly though, when an algorithm fails to produce a ‘correct’ measurement (e.g. images 2.8/2.10) and the bias is definitely not a difference of opinion between the edge evidence and the manual delineator, it renders the measurement useless. Since the error is so large, it allows easy visual detection by the delineator so that the algorithm measurement can be discarded and the manual result maintained as the best estimate of the wound’s area. Small errors are not necessarily obvious, since one expects a small ‘noisy‘ variation in the results, and in a clinical/work-place setting a delineator may not be able to make a series of measurements of a wound.
7.2.4 Identifiable Causes of Bias Errors
By examining the median boundary of each wound measurement sample, three features present in the wound images tested have been identified to which systematic error is attributable. Some of the causes have been identified above in the discussion of failure modes.
(1) Vague boundaries
The bias errors on images 2.2, 2.4 and 2.6 are attributed to lack of clear image edge evidence. This occurs at the lower edge of 2.2 (see
Plate 2), the top part of 2.4 and most of the right hand side of 2.6. The biases are all in the 2% to 4% range.(2) Redness at the edge of the wound
Images 1.3 and 2.8 produce maximum biases of 3.5% and -17.7% respectively. In the case of 2.8, precision is also affected. The redness is epithelialisation tissue highly perfused with blood which turns a purple hue after a period of time (as the wound heals). This tends to produce multiple boundaries and the algorithms therefore tend to produce results that include parts of both boundaries.
(3) Darkened spots near the boundary
Cases of this are wound images 1.1 and 2.10. In the case of 1.1 the bias error is limited to 3.9% and the effect of the spot (blood inside the wound) is to provide an alternative high-gradient boundary segment which attracts the contour.
The ability of the algorithms to significantly improve area measurement precision has a practical purpose – it gives the physician a higher degree of confidence that a sample area measurement is close to some mean value. The disadvantage is that this mean value can, and often is, biased with respect to the average human judgment. This does pose a problem for the use of the algorithms in clinical practice: the objective in measuring wounds is to compare measurements taken at regular times separated by intervals of possible change to the wounds size and appearance. A wound may be classified as ‘indolent’, indicating no change in size or colour, or it may be healing (getting smaller) although this may be due more to volume changes rather than area changes in which case area measurement is less informative. Alternatively a wound may be infected, which may cause it to change colour and probably increase in size.
The results presented here represent a medium sized sample of the wounds that can be encountered in clinical practice. In a hospital outpatients clinic it is not always possible to take many measurements of a wound and perform a hypothesis test to determine whether a wound has changed size. In the event of a hypothesis test between two measurement samples, a significant difference can be interpreted as a genuine change in a wound’s area, conclusive evidence of healing or deterioration (perhaps due to infection), or it may be due to a systematic error (bias) between the two samples.
Detection of wound size changes can be improved by taking a set of measurements for the purposes of performing a hypothesis test where the null hypothesis is that the mean values for a current wound measurement is the same as the one previously taken. The equal variance t-test for differences between means when the sample sizes are equal, is given by (Johnson and Bhattacharyya, 1992):
|
with 2(N-1) d.f. |
Comparing two measurement samples of five measurements each, a mean difference of approx.
1.5spool is required for significance. If the delineator’s precision (spool) is 5%, then a mean difference of 7.5% is required between the sample means for a change in the wound’s area to be signalled. The sensitivity can be increased by increasing the sample size, although this requires more time to be made available for delineation and it is possible that a delineator’s performance may degrade as a result. The precision may be improved by applying an active contour model, thus alleviating some of this burden by allowing a smaller sample size to be used without the test losing its power to detect a change in wound size. However, one must also consider the effect of the bias errors introduced by the algorithm upon this difference. Considering the above causes of bias, it is possible that such errors will tend to be either all positive or all negative given a sequence of measurements taken over some period of time. Thus, the bias error may be a fairly constant component of the measurements thus having a relatively small effect on the hypothesis test. However, when a bias error is very large and the resulting contour has not managed to properly delineate a fairly clear wound boundary (e.g. image 2.10) the error should not be ignored. As previously stated, errors of this type are visible to the delineator and the result of the algorithm should be discarded in preference to the manual result.
7 Discussion |
||