6 |
Active Contour Model Experiments |
This chapter describes the experiments undertaken to appraise the performance of the four active contour model algorithms described previously in Chapter 3 and incorporates the contour regularisation parameters (
a,b) derived in Chapter 5 for the GA and TN algorithms (recall that the MX and MG minimax-based algorithms are parameter-less with respect to the energy terms) and the scale-descent start and end scales (ss,se) for each of the four algorithms. An additional set of ten images is tested with the same parameter and scale settings used for the first set of images and the overall results for the two sets of images are presented. This is considered necessary because the first set of ten images were obtained using sets of optimised parameters determined by the analysis in Chapter 5 and the performance measure is therefore biased in favour of the first ten images – it does not account for the performance of the algorithms upon previously untested images. The aim of this is to produce a more widely based set of results which is not unfairly optimistic due to exclusive use of training-derived data. Secondly, doubling the number of wounds upon which the algorithms are tested from 10 to 20 adds weight to the generality of results and gives a more certain vision of the results that should be expected in clinical practice, which is the ultimate purpose of this analysis.
The active contour models require initialisation by manually delineated wound boundaries forming an initial approximation of the final solution. The algorithms can only be considered a useful tool for wound measurement if it is possible to conclude that area measurements produced by active contour models are an improvement upon the measurements produced by manual delineation. For the purposes of this analysis, the specific definition of improvement is two-fold and is given by the following statement:
Definition of Improvement
To conclude that the application of an algorithm to the wound measurement problem is an improvement upon manual wound measurement it is considered necessary to show:
Criterion 1
A significant reduction in the variance of area measurements, taken on a case-by-case basis (i.e. improved precision).
Criterion 2
That any differences in area measurements between the manual and algorithmic mean results can be attributed to random variation, rather than being a manifestation of bias.
Therefore, the main objective of this experiment is to determine if the algorithms are capable of producing measurements that conform to these criteria. Thus it is necessary to compare the performance of each algorithm against the manual delineation performance. Complementary to this objective is a comparison between the algorithms in order to determine the levels of measurement bias and precision achievable. Note that testing bias requires comparison with a ‘gold standard’ of measurement: the approach adopted here is to consider the mean of the manual results, for each wound, to be the best a-priori approximation to such a standard. In practice, the true value of a quantity is defined as the measured value produced by a process that is agreed amongst experts to be an ‘exemplar process’
[Deming, 1950].
The four active contour model algorithms described in Chapter 3, viz.: GA, TN, MX and MG, are used to measure the area of each wound image a total of 25 times using each manual delineation attempt for that wound image as the initial contour. The potential energy images are derived from the derivative-of-Gaussian gradient magnitude image of the green band image of each wound image. The GA and TN algorithms both use
a=b=0.4 contour regularisation parameters determined in Chapter 5 and are used in the scale descent algorithm (§3.1), beginning with filter-scale standard deviation ss=8 pixels down to se=2 pixels. The MX algorithm descends from scale ss=5.7 down to se=2 and the MG algorithm descends from scale ss=4 down to se=2. The image filter scale s is reduced in half octave steps.
This experiment comprises the original ten images used to select the parameters for the finite element models and ten additional wound images. Of the latter set of ten images (shown in
Plate 2), four were taken with a hand-held camcorder with ambient-only lighting (2.1 to 2.4), three were taken with the collimated halogen lighting of the MAVIS instrument (2.8 to 2.10), one was taken with a purpose made illumination rig (2.7) and a further two obtained from a third party wound image library (2.5 and 2.6).
6.2.2 Seeding the Algorithms: Manual Delineation
Two volunteer delineators were asked to provide seed contours for initialising the algorithms in order to run them on the additional image set. In order to make it possible to compare the results with the first set of ten images, for which there were 25 delineations each, the first delineator was asked to make thirteen attempts at delineating each wound under identical conditions to the first manual delineation trial and the second delineator was asked to make twelve such attempts. The wound areas enclosed by the manual boundary delineations are then used to estimate manual precision as a base line for improvement (albeit for only two delineators) by application of the algorithms.
6.3 Development of Analytical Procedures
For each of the 20 wound images the manual delineators produce a set of 25 area measurements. Correspondingly, each algorithm produces a further set of 25 individual contour-area measurements for each of the twenty wounds. A set of area measurements, produced by either manual delineation or by application of an active contour algorithm, may be summarised by calculating the estimates of the corresponding arithmetic mean and standard deviation. These estimates are generated five times for each wound, once for the manual seed contours and four times for the algorithms, providing the data necessary to compare manual and algorithmic performances.
6.3.1 A Note on the Application of Statistical Methods
The results of the first manual delineation experiment in Chapter 4 have been suitably analysed using single-factor fixed effects Analysis of Variance. This requires the area measurements produced at different factor level settings (delineators) to be normally distributed and have equal variances, although these are not absolutely rigid requirements and small departures from them do not much affect the test results. However, the criterion for improvement of precision stated above, requires the variances for area measurements of a wound produced by any one algorithm and manual delineation to be significantly different. The improvement of precision is the main aim of the application of the active contour models to the problem of wound area measurement.
6.3.2 Improvement of Manual Results
The analytical procedure developed in this section is a comparison between the performance of manual delineation area measurement and the active contour models specified in terms of improvement of precision. The criteria defined in §6.1 may be assessed as a test of relative values and thus conform to standard statistical hypothesis tests. In order to test for variance-reduction (improved precision) the standard F-ratio test for comparing two variances is used. The test employed for bias is Student’s t-test for significant differences between a pair of mean values. The two tests require the definition of the following values:
For any given wound image, let be the respective estimates of the mean and variance of manually delineated area measurements, having mean area
Variance Reduction Test (Criterion 1)
The following equation defines the test to determine if the application of any one active contour model reduces the variance of wound area measurements for a given wound image:
(6.1) |
H 0 : s 2acm=s 2man |
|
|
|
H 1 : s 2acm<s 2man |
The test statistic is
F=s2acm/s2man. Reject H0 if F*<Fu,v(a) where u=v=N-1 and the significance level a=0.05.
Bias Test (Criterion 2: Equality of Means)
The test of significant differences between means, i.e. a bias between the mean area measurement of a single wound image produced by a sample of manual delineators and the mean area measurement produced by an active contour model, is defined as:
(6.2) |
H 0 : mman = macm |
|
|
|
H 1 : mman ¹ macm |
The associated
t statistic (modified for unequal variances) is defined as:
(6.3) |
|
with N-1 degrees of freedom |
Note: This test is less powerful than the equal-variance
t test which cannot be used here because the standard equal-variance form of expression for t is not distributed as a t-variable when the variances are unequal [Johnson and Bhattacharyya, 1992]. The consequence of this lower-power test is a broader confidence interval for biases. Also note that this series of tests is not intended to perform a ‘multiple-comparison’ type procedure among the algorithm performances. Development of a test strategy for comparing the overall precision of the algorithms amongst themselves is considered in §6.3.4.
6.3.3 Development of Graphical Procedure
It is convenient to carry out the tests defined in the last section in a graphical form that allows the test results to be visualised affording easier interpretation. The development of a representative graphical format proceeds as follows:
Firstly, define the axes of the graph by the following equations, for abscisses and ordinates respectively:
(6.4) |
|
(6.5) |
|
These performance measures are plotted for each wound image in Figure 6.1. Equation (6.4) defines a statistic that is related to the F-ratio test of (6.1), thus the abscissa value of a point on the graph may be used to test for the reduction of variance. From standard tables of the F-distribution the critical value of
F24,24 at a significance level of a=0.05 is 0.5. This allows the definition of a confidence interval for the variance of measurements made by manual delineations. The result is a pair of vertical lines on the graph at Xlower=1/Ö2 and Xupper=Ö2 signifying the limits for improved variance and degraded variance respectively.Equation (6.5) defines a test statistic that is related to the
t-test of (6.2), allowing the equality of the mean area measurements of a wound, produced by manual delineation and an algorithm, to be evaluated according to the vertical position of a point on the graph. Let tcritical be the critical value of t with N-1 degrees of freedom (df) for a two-sided test at a significance level of a=0.05. Equation (6.3) may now be re-written in terms of (6.4) and (6.5) as:
(6.6) |
|
From tables of the 2-sided
t distribution with 24 df, tcritical has a value of 2.064. Thus in this application (6.6) becomes:
(6.7) |
|
Again, (6.7) defines a pair of upper and lower confidence intervals that may be plotted on the graph.
The area measurement results in terms of measured means and variances are expressed in terms of the manual standard deviation so that an improvement in precision is clearly shown, without regard to the actual levels of manual and algorithmic fractional precision. Therefore, this diagram allows for the drawing of two confidence intervals, one for variance and the other for bias which are equally valid for all of the wounds measured so that the significance of all wounds may be tested and displayed on the same diagram (see Figure 6.1).
6.3.4 Comparison of Algorithms: Overall Test for Precision
The previous analysis allows the comparison of each algorithm’s performance with the manual delineation performance on an image-by-image basis. However, this does not necessarily allow a general comparison to be made between the algorithms over the whole 20 wounds. Since the variances are intended to be altered significantly by the application of each algorithm, and the variances will inevitably be image dependent, a non-parametric or rank test using blocking to remove the variation amongst wounds is used. The Friedman test meets such criteria. The basic datum for the test in this case will be the range of measurements produced by each algorithm upon each wound. The test is defined as follows:
(6.8) |
|
where
Ew is defined as the range block-effect that is a random variable dependent upon the wound properties and its size. Note from Chapter 4 that both of these quantities affect precision. Ra defines the general range (precision) of each algorithm. Note that wÎ{1..W} and aÎ{1..k}, where k=5. The ‘treatment’ groups denoted by ‘a’ correspond to the four algorithms and manual delineation, giving a total of k=5 groups. Aw,a,i corresponds to the ith area measurement of wound w made with algorithm a, where iÎ{1..25}.The null hypothesis states that the general range or spread of the area measurements, and hence the true variation of each algorithm’s performance is the same. It is defined as follows:
(6.9) |
H 0 : All Ra are equal |
|
|
|
H 1 : Not all Ra are equal |
For a significant difference in range to be detected an algorithm must consistently perform better than the other algorithms. If
H0 is rejected, and therefore the algorithms have definite differences in overall precision, it becomes necessary to consider which algorithms differ. Dunn’s multiple paired-comparison procedure is suitable for this purpose [Neave and Worthington, 1988]. The paired differences computed within each block (wound) are approximately distributed as N(0,Wk(k+1)/6). This allows the construction of Gaussian confidence limits for the rank sums from the Friedman test, given by:
(6.10) |
|
where s = N(0,Wk(k+1)/6) |
When the confidence limits are plotted on a line diagram, overlapping ends imply that the ranges of area measurements produced by the two corresponding processes do not differ significantly at the
a significance level. Note that the confidence limits apply to the rank sums and not to the original range data.
6.3.5 Quantification of Bias and Precision
The previous sections define the analytical method for the improvement of manually delineated wound measurements without regard to percentage values of precision or bias. The errors are analysed in terms of variance ratios (precision related) and systematic error (bias related). By their very nature, such relative values and their related hypothesis tests do not contain information about the size of either a precision or bias error. A significant bias in measured area is strong evidence that an active contour model is in general delineating a different boundary to the corresponding mean manual boundary, by including areas not defined as wound by the manual operators or by excluding areas which are defined as wound. However, when comparing manual and algorithmic measurements, obtaining a significant value from an appropriately applied hypothesis test does not necessarily indicate that the bias difference represents a large portion of the mean wound area. For wounds where the precision of both sets of measurements is relatively high, a hypothesis test such as the
t-test will be correspondingly more sensitive to bias errors, i.e. the likelihood of detecting smaller fractional biases will increase. The absolute value of this bias is important, since when measurement precision is high, a relatively small bias is a small fractional error which introduces little variation in the final result despite being a major part of the overall error. Clearly, a large fractional bias that is significant can never be dismissed in such a manner. In the cases of wounds where the precision of measurements is poor, the discrimination of bias from the expected mean error under the equal-means hypothesis becomes increasingly blurred. This section therefore proposes an additional method of analysis that accompanies the previous one and which will allow the definition of the performance of the algorithms defined in this thesis to be reported explicitly in terms of the average percentage bias and precision.
Bias Measurement Method
The bias performance of each algorithm is assessed by calculating the mean of the absolute biases that exist between the manual and respective algorithmic measurements taken from each wound in this study. The mean absolute bias is calculated as follows:
Firstly, the individual bias for each wound
w is defined as:
(6.11) |
|
The estimate of the standard error is dependent upon the ratio in (6.11) and can be shown to be:
(6.12) |
|
The overall mean absolute bias for the set of wounds is then defined as:
(6.13) |
|
where |
W is the number of wounds over which the average is taken. |
This expression has an estimated standard error of estimate given by:
(6.14) |
|
Precision Measurement Method
Complementary to the bias calculation is the quantification of the average level of precision achieved by each algorithm. The mean fractional precision for each algorithm is defined by the following equations:
The fractional precision for a set of area measurements of one wound is defined as:
(6.15) |
|
The estimates of fractional precision are made with samples of
N=25 measurements, thus the omission of the correction factor c4 associated with estimates of standard deviation has little influence on the results. The standard error of (6.15) may be estimated from:
(6.16) |
|
where it is noted that the standard error of a standard deviation estimate based on a sample of size
N is closely approximated by
(6.17) |
|
(6.18) |
|
The manual mean area for a wound, , is used as the reference point for all measurements, since it represents the combined opinions of several manual delineators. Although it is known that delineators are in general mutually biased
[Appendix C] contains tables of the bias and precision estimates of the wound area measurements arising from application of the algorithms to the twenty wound images. Estimates of absolute bias and precision for the application of each algorithm to each are calculated using (6.11) and (6.15), along with the associated standard errors, calculated with (6.12) and (6.16) respectively. The data in these tables is used to plot the graphs presented in this section.
6.4.1 Detection and Analysis of Outliers
The measurement of the spread or variation of area measurements is highly sensitive to the presence of outliers in the data. Thus, the hypothesis tests for variance reduction (6.1) and bias (6.2) along with the calculation for precision (6.15) and most definitely the calculation for range (6.8) will be affected. There are three possible sources of outliers in the area measurements:
The manual delineation data for the first set of images has been analysed for outliers in Chapter 4. There are four area measurements considered as outliers, however, these measurements do not correspond to faults and so are not discarded. The 25 area measurements made by each algorithm are checked for outliers, for each wound, with the following rule:
Label Ai as an outlier if |
|
or |
|
where
QL and QU are the lower and upper quartiles of each measurement sample respectively.Applying this rule to the basic area data tends to indicate one or more outliers in approx. half of the 20 wound cases for each algorithm, although no case can be attributed to a procedural or equipment fault. Inspection of the regions delineated by each algorithm shows that outliers tend to be indicated when the data is divided into two groups, one being larger than the other, e.g. 20 measurements in one group and the remaining 5 measurements forming a distinct second group. This occurs because of the ambiguity of the edge evidence present in the images. Since the algorithms first seek equilibrium at an increased initial scale the areas of the delineated regions produced by the algorithms at the initial scales may be similarly inspected for outliers. The result of this inspection shows that few cases of outliers are now indicated. Thus, at high scales, where the ‘multiple edges’ are merged, the data tend to be clustered (except for the most ambiguous wound cases). As the scale is reduced, the merged edges begin to break up and the noise also increases. This combination exploits the small differences that exist in the boundaries produced at higher scales and causes them to diverge as the scale is lowered.
6.4.2 Relative Improvement of Algorithms
Figures 6.1 (a)-(d) show the combined results of the hypothesis tests for bias and improved precision defined by (6.1) and (6.2). These diagrams show the results of the algorithm trials in relative terms only, i.e. the diagrams show whether or not the variance of a set of area measurements made by an active contour model on a particular wound is a significant improvement upon the corresponding manual variance and also show if the measured bias differences are insignificant. There is no information in these diagrams to show fractional (percentage) errors, e.g. a point at (0.5,1.0) indicates that the standard deviation of a measurement sample produced by one of the algorithms for a particular wound was half of the corresponding manual standard deviation, regardless of whether the manual standard deviation equates to a precision level of 1% or 10%.
Likewise, the abscissa value of 1.0 represents a statistically significant bias but does not indicate the magnitude of bias expressed as a percentage of the mean manual area. If both manual and algorithmic measurements of a wound’s area give rise to small values for precision, then only a small bias is needed to signal a significant difference. If there is a spreading of either the manual or algorithmic precision then a corresponding increase in measured bias is required for a given significance, else the significance decreases and any true bias is swamped by the variation in results. The confidence limits in Figures 6.1 (a)-(d) that define the regions corresponding to the outcome of the variance reduction test (6.1) are plotted using (6.3). Equation (6.6) defines the function used to plot the upper and lower mean-difference confidence limits. The combination of these two separate intervals defines a key region on the diagram: this region is enclosed between the two
t confidence curves and bounded on its right side by the lower F confidence limit. Marked points that lie within this region represent algorithm measurements of wounds where there is a significant reduction in variance and no statistically discernible bias.
(a) GA Algorithm |
(b) TN Algorithm |
(c) MX Algorithm |
(d) MG Algorithm |
Figure 6.1 Improvement Diagrams — the filled circles represent measurements of wounds from the initial (training) set. The squares represent measurements made on the additional set of images. Image 2.10 is out of range in all graphs due to a large negative t value. Image 2.2 is out of range on graphs (a), (c) and (d) for the same reason. Likewise, image 2.8 is out of range on graphs (b), (c) and (d) and image 2.6 is out of range on graph (b). |
The central point at (1.0,0.0) in Figures 6.1 (a)-(d) represents the standard deviation and mean estimates for the manual delineation results for any wound. As a basic summary of the performance of each algorithm with respect to the manual delineation performance, Table 6.1 displays a count of the number of times each algorithm made the following specific improvements:
(a) Rejected the equal-variance hypothesis
(b) Did not reject a
t-test of mean differences given by (6.2), i.e. the algorithm is not considered to have introduced a bias (Criterion 2).(c) Both above cases were coincident and thus the point lies in the ‘key region’ mentioned above, i.e. complete improvement (both Criteria).
Algorithm |
Criterion 1 Significant improvement |
Criterion 2 Equation 6.2 Not rejected |
Both Criteria Met |
|||
GA |
14 |
0 8 |
0 7 |
|||
TN |
18 |
12 |
12 |
|||
MX |
14 |
0 8 |
0 5 |
|||
MG |
14 |
0 8 |
0 4 |
|||
Table 6.1 Frequency of occurrences of agreement and precision improvement criteria |
The hypothesis test outcomes for the application of the four algorithms to individual wounds is presented in a side-by-side comparative form in Table 6.2. From this table it can be observed which wound measurements were improved by most or all of the algorithms and which wounds tended to lead to poor measurements by the algorithms.
(A) Variance Reduction |
||||||||||||||||||||
Image |
1.1 |
1.2 |
1.3 |
1.4 |
1.5 |
1.6 |
1.7 |
1.8 |
1.9 |
1.10 |
2.1 |
2.2 |
2.3 |
2.4 |
2.5 |
2.6 |
2.7 |
2.8 |
2.9 |
2.10 |
GA |
ü |
ü |
ü |
ü |
ü |
ü |
ü |
ü |
? |
? |
ü |
? |
ü |
? |
ü |
ü |
ü |
r |
ü |
? |
TN |
ü |
ü |
ü |
ü |
ü |
ü |
ü |
ü |
ü |
? |
ü |
ü |
ü |
? |
ü |
ü |
ü |
ü |
ü |
ü |
MX |
ü |
ü |
ü |
ü |
ü |
ü |
ü |
? |
? |
? |
ü |
ü |
ü |
? |
ü |
ü |
ü |
? |
ü |
r |
MG |
ü |
ü |
ü |
ü |
ü |
ü |
ü |
? |
? |
? |
ü |
? |
ü |
? |
ü |
ü |
ü |
r |
ü |
ü |
(B) Insignificant Bias |
||||||||||||||||||||
Image |
1.1 |
1.2 |
1.3 |
1.4 |
1.5 |
1.6 |
1.7 |
1.8 |
1.9 |
1.10 |
2.1 |
2.2 |
2.3 |
2.4 |
2.5 |
2.6 |
2.7 |
2.8 |
2.9 |
2.10 |
GA |
r |
ü |
r |
r |
ü |
ü |
r |
ü |
ü |
r |
r |
r |
ü |
r |
r |
ü |
ü |
r |
r |
r |
TN |
ü |
ü |
r |
ü |
ü |
ü |
ü |
ü |
ü |
r |
ü |
r |
ü |
r |
ü |
r |
ü |
r |
r |
r |
MX |
r |
ü |
r |
r |
ü |
r |
ü |
ü |
ü |
ü |
r |
r |
ü |
r |
r |
r |
ü |
r |
r |
r |
MG |
r |
ü |
r |
r |
ü |
r |
r |
ü |
ü |
ü |
r |
r |
ü |
ü |
r |
r |
ü |
r |
r |
r |
Table 6.2 Breakdown of area measurement improvements for each algorithm(A) Improvement of Precision; (B) Insignificant Bias |
Key |
|
ü |
Significant improvement Insignificant Bias |
? |
Indeterminate; insignificant change |
r |
Definite degradation Significant Bias |
6.4.3 Overall Precision Test Results
The result of the hypothesis test defined by (6.9) defined for testing the equality of the general precision of the algorithms and the manual delineations is summarised in Table 6.3. The confidence intervals for the rank sums defined by (6.10) are used to produce the line plots shown in Figure 6.2. The overall grouping of the different algorithms is clear from this diagram. The results are discussed in Chapter 7.
|
Figure 6.2 Confidence intervals for rank sums of overall precision test |
Algorithm |
GA |
TN |
MX |
MG |
Manual |
Rank Sum |
58 |
35 |
55 |
65 |
87 |
Friedman Test Statistic, S |
28.2 |
||||
c 2 Critical Value @ a=0.01 |
13.3 |
||||
Result: H0 (6.9) is rejected, therefore measurement ranges differ |
|||||
Table 6.3 Rank sums and Friedman statistic for general precision test |
6.4.4 Bias and Precision Results
The basic measurement data contained in
Appendix C, which was presented in the foregoing section in terms of improvement upon the manual performance, is presented here a second time but in a different guise. As stated previously, the improvement data is expressed in terms of the manual precision which may be directly interpreted in terms of a hypothesis test but which contains no information about the sizes of random and systematic errors in relation to the ‘true’ area of a wound. The general performance of each algorithm in terms of bias is presented in Table 6.4 as mean absolute bias accompanied by the counts from Table 6.1 for the number of times the t-test for mean differences was not rejected for each algorithm. Mean absolute bias and its standard error is estimated with (6.13) and (6.14) respectively. Average fractional precision is estimated with (6.17) and its standard error is estimated with (6.18).
Algorithm |
MAB ± std. error[%] |
PF av ± std. Error [%] |
#(6.2) not rejected |
#(6.1) rejected |
|||||||
GA |
3.4 ± 0.2 |
2.4 ± 0.2 |
8 |
14 |
|||||||
TN |
4.0 ± 0.2 |
1.8 ± 0.2 |
12 |
18 |
|||||||
MX |
3.8 ± 0.2 |
2.6 ± 0.3 |
8 |
14 |
|||||||
MG |
2.8 ± 0.2 |
3.1 ± 0.3 |
8 |
14 |
|||||||
Manual |
NA |
4.3 ± 0.3 |
NA |
NA |
|||||||
Table 6.4 Mean absolute bias and mean precision for algorithms |
(a) GA Algorithm |
(b) TN Algorithm |
(c) MX Algorithm |
(d) MG Algorithm |
Figure 6.3 Bias-precision plots — the filled circles correspond to measurements made on the first set of images. Although the diagrams do not indicate the significance of either of the hypothesis tests previously described, the practically significant cases where bias is large are evident. |
The bias and precision estimates for individual wounds are displayed in Figures 6.2 (a)-(d) for the four algorithms. The dashed rectangle plotted in each diagram bounds the region where bias is within ±5% and fractional precision <5% and is included purely as a visual guide for the reader.
This chapter has presented the results of running the four active contour models described in Chapter 3 to twenty images of leg ulcers. The first objective of the presentation has been to display the performance measures defined in §6.1 that compare the observed performance of the algorithms with the observed performance of the manual delineators (results from §4.1). The performance measures are intended to indicate (a) whether the algorithms can reliably reduce the variability inherent in manual delineation and (b) whether the results of algorithmic area measurements agree with those produced by manual delineation. Secondly, a performance measure has been devised to compare the average measurement precision produced by the algorithms. Emphasis has been placed on the need to evaluate the performance of the algorithms over a varied range of wound images, since such wounds are likely to frequently occur in practice. The results presented will be discussed in detail in the next chapter, where particular attention will be paid to those wound images where the algorithms fail to produce measurable improvements.
6 Results |
||