Towards improved 3D reconstruction of cystoscopies through real-time feedback for frame reacquisition

Rachel Eimen; Rachel Eimen; Mayaank Pillai; Kristen R. Scarpato; Audrey K. Bowden; Audrey K. Bowden; Audrey K. Bowden

doi:10.1364/BOE.523361

1. Introduction

The high recurrence rate of bladder cancer drives the current clinical standard for frequent surveillance: once diagnosed, individuals with bladder cancer undergo clinical surveillance every three months to one year with white-light cystoscopy, the gold standard for diagnosis and follow up in the bladder [1]. Data from past cystoscopies are important to monitor progression and recurrence and aid in planning of subsequent treatment and surgery. Unfortunately, comprehensive review of cystoscopic videos is cumbersome, not readily fitting into clinical workflow; the video, if created, is often discarded in favor of saving a scant number of still images and notes from the cystoscopy procedure.

As a data-rich alternative to images and notes, previous works have attempted to convert cystoscopy data into 2D and 3D panoramas [2,3,4], which provide an extended view of a scene. While convenient to review, panoramas do not preserve the bladder shape and can therefore distort the appearance of the bladder wall. In contrast, 3D bladder reconstructions preserve the shape and appearance of the bladder in a form that is also convenient for clinicians to review. Not all cystoscopy videos are easily converted to 3D reconstructions, however. One common cause for reconstruction failure is when the underlying video frames contain artifacts, such as bladder debris and motion blur, that render the frames unusable for reconstruction [5]. We denote such frames as “unusable” in the context of this paper, while frames that can be used in 3D reconstructions are denoted as “usable.” In the absence of sufficient usable frames, reconstruction algorithms are unable to recover the information needed to match and place frames in 3D space.

Several methods have been developed to identify low quality frames from endoscopy videos [6,7,8]. Unfortunately, such classification methods do not fix the problem of reconstruction failure due to insufficient usable frames, as they themselves do not improve underlying video or frame usefulness. Moreover, simply removing the identified frames from the video is not practical as it would lead to gaps where data are missing. Too many missing frames can also lead to overall failure of the reconstruction.

Alternatively, several attempts to improve the quality of endoscopic video frames in post-processing have been demonstrated through methods such as brightness and tone adjustment, noise reduction, contrast enhancement, and temporal filtering [9,10,11,12,13]. However, these methods cannot address artifacts that completely obscure the view of the bladder wall over many sequential frames, such as intense blur and bladder debris.

One potential strategy to ensure that the cystoscopy video has sufficient usable frames to form a complete 3D reconstruction is to provide feedback during the video collection so the clinician can capture usable frames from each region. To this end, here we propose the Assessment and Feedback Pipeline (AFP), a novel endoscopy guidance system that classifies frames in real time on the basis of their usefulness for 3D reconstruction and provides immediate feedback to clinicians to suggest reacquisition of regions where unusable frames have been collected.

To our knowledge, the AFP is the first pipeline to both identify unusable frames and provide real-time feedback to improve the usefulness of cystoscopy videos and the quality of their resulting 3D reconstructions. Ultimately, clinical implementation of this strategy will yield cystoscopy videos of higher quality overall, which translates to improved cystoscopy records, better visibility of the sections of the bladder being imaged, and more successful 3D bladder reconstructions that may enable better disease characterization and patient outcomes, including decreased recurrence and progression. Moreover, the feedback provided from this pipeline may find use in medical training programs.

2. Methods

2.1 AFP overview

The goal of the AFP is to assess the usefulness of incoming frames during the cystoscopy and to cue clinicians to recollect data from regions where unusable frames have been collected, thereby inhibiting 3D reconstruction. Collection of new frames to replace regions comprising unusable frames – a process we call “recollection” – is intended to support the possibility of a successful 3D reconstruction. In practice, during recollection, clinicians would be advised to slow down to reduce the likelihood of motion blur or visually disrupting bladder debris.

The AFP (Fig. 1) is a real-time (>30 fps) frame-level algorithm that contains three primary steps: frame selection, frame classification, and clinician feedback. In brief, for each frame collected, the frame selection step first determines whether the frame should be further processed. In order to maintain real-time operation of the algorithm, it is neither necessary nor practical to process every frame, particularly because the high frame rate of the cystoscopy video makes many sequential frames redundant. Unselected frames are marked as “unprocessed” while selected frames are assessed and classified as usable or unusable. When a frame is marked as unusable, this triggers a visual feedback cue to the clinician to suggest reacquisition of data from that region; these reacquired frames are then also classified by the AFP as usable or unusable. The output of the AFP is thus a full cystoscopy video containing unprocessed, usable, and unusable frames.

Fig. 1. Algorithm flow chart for the AFP. Each incoming cystoscopy frame undergoes the three steps of the AFP. Frames to be processed are chosen through a frame selection step that passes every nth frame to the frame classification step and labels the rest as unprocessed data. The selected frames are subject to frame classification, which classifies and labels frames as usable or unusable. Unusable frames trigger a request for the clinician to recollect the last four seconds of data in the clinician feedback step.

Download Full Size | PDF

2.2 Choice of 3D reconstruction pipeline

The choice of pipeline was important in helping us develop metrics to classify frames for 3D reconstruction. The AFP was built with the intention of using CYSTO3D, a reconstruction pipeline we previously developed for 3D reconstruction of bladder cystoscopy data [5]. Hence in this work, our decision about the suitability of a frame (i.e., marking it as usable or unusable) for 3D reconstruction is based on whether or not that frame is likely to be included among the reconstructed frames used by CYSTO3D. For this, we use the outcome of whether or not CYSTO3D generates a pose for a given frame as the ground truth label: frames for which a pose is generated are considered usable, and frames that do not yield a pose are considered unusable.

Pose determination - that is, the preliminary assignment of the angle and position of the frame in 3D space – is a critical intermediate step for many 3D reconstruction pipelines. It is often accomplished via structure-from-motion (SfM) [14], which defines the 3D structure of the bladder by detecting features in the frames and by matching frames with the same features. Frames for which poses are generated generally have clear, high-contrast features of the bladder wall (Fig. 2(A)), such as well-defined vasculature. On the contrary, unusable frames often suffer from one or both of two imaging artifacts: motion blur and bladder debris. Motion blur (Fig. 2(B)) makes the features less clear and decreases contrast. Bladder debris (Fig. 2(C)) describes particles that can obstruct the cystoscope view of the bladder wall, leading to loss of visualization of features that makes these frames impossible to place in the context of the reconstruction. Bladder debris hinders robust feature extraction and feature matching because it can lead to detection of features (e.g., bladder debris) that do not capture the appearance or structure of the bladder. Because bladder debris moves independently of the bladder wall, which is mainly stationary, these features do not match well with features in other frames where the same region of the bladder is captured, leading to poor matching of the frame. Bladder debris comes in different sizes and can be clustered together or separate. The white arrows in Fig. 2(C) point to many small pieces of bladder debris that obstruct much of a frame, and the bracket in Fig. 2(C) points to a single large piece of debris that covers the corner of a frame while leaving the rest of the frame unobstructed. While much of each frame contains a clear view of the bladder, it is important to note that the SfM step of CYSTO3D only selects entire frames for reconstruction and removes frames with too few features or poor frame matching. Thus, the presence of bladder debris in one area of the frame can render the entire frame unusable.

Fig. 2. Example frames from patient cystoscopy videos collected as part of this study. Visualization of the bladder wall is best in (A) artifact-free frames with a clear view where the cystoscope is focused, close to the bladder wall, and unmoving. Visualization is impaired by the presence of (B) motion blur, which decreases contrast and makes features less clear. Visualization is also impaired by (C) bladder debris, both small (arrows) and large (bracket), which obstructs the view of the bladder wall.

Download Full Size | PDF

2.3 Dataset

To build and test the AFP, we generated a database comprising 12 cystoscopy videos. All data were collected through the approved Vanderbilt Institutional Review Board (IRB) study #201269, and we obtained consent from participants to save their data. Participants were 18 years or older with planned cystoscopies at the Vanderbilt Urology clinic as part of their standard medical care. Each cystoscopy was performed by a urologist using a digital Ambu cystoscope. Cystoscopy videos ranged in length from 30 seconds to 3 minutes. In general, we asked clinicians to scan slowly and close to the bladder wall to reduce motion blur and enable better visibility of the bladder.

2.4 Design of the frame classifier

At the heart of the AFP is the classification for each frame. We built a support vector machine (SVM) [15,16] classifier using a feature vector based on metrics computed for each frame. We chose to use an SVM because it is highly tunable and is effective in high dimensional spaces, which is helpful in allowing us to incorporate many frame quality metrics.

The following metrics were considered for use in the feature vector, mostly due to their ability to detect blur or sudden changes in the frame usefulness, such as due to the presence of bladder debris:

A. Feature count: In our analysis of past work, we noted that a common reason for failure in 3D reconstructions was a lack of salient scale invariant feature transform (SIFT) features to enable frame matching in the SfM step [5,17]. Thus, we hypothesized that an insufficient number of features would limit frame utility. To reduce the computation time for feature detection, the AFP computes Oriented FAST and rotated BRIEF (ORB) features, which runs in real time. As required by ORB, features were computed on grayscale values. Frames with fewer features were more likely to be unusable.
B. Standard deviation of feature count: The number of features alone is inadequate to characterize frame utility, because it does not consider that the features, once identified, need to appear in multiple frames so that the frames can be triangulated to one another during the matching step. That is, it is possible to have a frame with many features for which no pose is generated if surrounding frames do not share the same features, such as if vasculature features were obscured by sudden introduction of blur or bladder debris in surrounding frames. We found that such frames were characterized by a highly varying number of features relative to their near-neighbor frames. We thus used the standard deviation of the feature counts over a sliding window of eleven frames, both processed and unprocessed, to identify frames that would be poor match candidates and, therefore, unlikely to have a pose generated.
C. Matched feature count: While the features detected in each frame were indicative of the likelihood of frame matching, we could also assess the performance of some frame matching in real time. Frame matching most occurred between frames collected within a short time because they overlapped in bladder coverage. Therefore, we matched each frame to its tenth predecessor using the OpenCV Brute-Force Matcher [18] and used the number of matched features between the frames as a metric. We decided to use the tenth predecessor because we empirically determined that a window size of eleven, which would include the frame being assessed and its tenth predecessor, resulted in the highest classification balanced accuracy.
D. Matched feature average distance: The matched feature count did not capture the quality of the matches themselves, which can be quantified by the distance between the matched features. It is important to note in this context that distance does not refer to a physical distance; rather, it refers to the dissimilarity between the matched features. Here, a lower distance indicates a stronger match. Thus, we calculated the average distance between matched features of a frame and its tenth predecessor, again using a window size of eleven.
E. Contrast: One potential reason for a low feature count is lack of image contrast; therefore, we used a metric that quantified the contrast of a frame with the average root mean square (RMS) contrast across the color channels. The RMS contrast was the standard deviation of the pixel intensities within a single color channel. We then averaged the standard deviation values across the color channels, and low contrast was indicated by a low average standard deviation.
F. Standard deviation of contrast: We observed that the low contrast of some bladders could be simply due to the low amount of vasculature. Hence, we also considered changes in the contrast over multiple frames, which was connected to the introduction of blur or bladder debris. To compute this metric, we calculated the standard deviation of the contrast values over a window size of eleven. Ultimately, this metric was useful in identifying regions of rapid change in contrast, which indicated a change in imaging conditions, such as the introduction of blur or bladder debris.
G. Tenengrad metric: We chose to incorporate the Tenengrad method [19], because it has shown strong performance in blur detection [20,21]. To use this method, we applied the Sobel operator to detect intensity gradients or edges within the frame and then took the magnitude of the resulting frame. A higher magnitude indicated more well-defined edges, while a lower magnitude indicated less intense changes in pixel intensity, which is tied to blur.
H. Standard deviation of Tenengrad metric: While the Tenengrad method would ideally detect blur in most cases, we recognized that the magnitude of the edges can be low in a usable frame with little vasculature. Thus, we took the standard deviation of the values calculated with the Tenengrad method over a window size of eleven. We included this to assist in the distinction between usable frames with little vasculature and unusable frames, as a change in the value between frames indicates the introduction of blur.
I. Isolated pixel ratio: We used the isolated pixel ratio (IPR) [7], originally developed to classify endoscopy frames as informative or non-informative, to assess the continuity of edges, as frames with fewer connected edge pixels are less likely to have defined edges and more likely to be blurry. To detect edges in the frame we used a Sobel operator, because it is a less time-intensive alternative to the Canny edge detection implemented in the original IPR algorithm [7]. Next, we calculated the IPR by computing the ratio of the number of isolated edge pixels (i.e., single pixels that are not connected to other edge pixels) to the total number of edge pixels. Here, a higher IPR indicated blur.
J. Standard deviation of isolated pixel ratio: We also chose to take the standard deviation of the IPR, which can be greater in frames with little vasculature, although the frames themselves may or may not be useful. The standard deviation of the IPR assisted in frame classification in these frames, as an increase in the IPR between frames indicated the introduction of blur. We again took the standard deviation of the IPR over a window size of eleven frames.
K. Blind/referenceless image spatial quality evaluator (BRISQUE) features: BRISQUE [22] is a method for image quality assessment based on natural scene statistics that has low computational complexity and has been previously used to classify endoscopy frames with human perception of quality as Ref. [8,13]. Rather than using BRISQUE itself, we chose to instead use the 36 statistics calculated as part of BRISQUE. We did this because BRISQUE itself is a trained model that calculates and processes statistics for image quality assessment. The trained model is similar to the SVM used for our frame classification, so including the trained BRISQUE model was not necessary.

The SVM we built was optimized to improve frame classification. We note that SVM performance could be inhibited when redundant or poorly performing metrics are used. To check for these effects, we first found the correlations between the metric values to see if there was redundancy, generally indicated by a Kendall’s tau correlation coefficient [23,24] with a value close to positive or negative one. Kendall’s tau correlation coefficient quantifies the strength of correlation between the metrics, ranging from -1 to 1, where -1 indicates a perfect negative correlation, 1 indicates a perfect positive correlation and 0 indicates no correlation. There was not a set threshold in literature to distinguish redundant from non-redundant metrics; hence, we considered thresholds used by others [25,26,27] as well as our own dataset and application. We decided to include metrics with a value between, but not including, -1 and 1 because we wished to remove only highly redundant features before the feature selection step. We confirmed that the correlation coefficients were greater than -1 and less than 1 for all metric pairs, indicating that the information was not redundant.

Next, we used Recursive Feature Elimination (RFE) [28] to keep the metrics that resulted in the highest balanced accuracy. RFE is a technique to select the most important features in a model by iteratively removing features and fitting the model to the remaining features then assessing classification performance for each subset of features. For our purposes, we retrained the SVM with every possible subset of metrics and calculated the overall balanced accuracy of frame classification with the newly trained SVM. We then compared balanced accuracies across each subset of metrics and found that the highest balanced accuracy was achieved when we kept the following metrics: A. Feature count, C. Matched feature count, D. Matched feature average distance, E. Contrast, F. Standard deviation of contrast, H. Standard deviation of Tenengrad metric, I. Isolated pixel ratio, J. Standard deviation of isolated pixel ratio, and a subset of 19 of the 36 K. Blind/referenceless image spatial quality evaluator (BRISQUE) features, which included all but one of the features related to shape. From this selection, we can infer that metrics related to feature detection and matching, metrics quantifying contrast, blur, and edge detection as well as BRISQUE features related to shape are helpful. This aligns with what we have seen in our data, which is that regions of cystoscopy videos containing frames with high contrast, well-defined edges, and clear structures lend themselves well to 3D reconstruction.

2.5 AFP structure

2.5.1 Frame selection

The frame selection step (Fig. 1, Frame Selection) of the AFP is designed to minimize computation time. Since most clinical systems acquire video data at 30 fps, the relatively slow speed of the motions made by clinicians leads to many frames with nearly redundant appearance. Hence, it is not necessary to process every frame, and most imaging artifacts span multiple frames. For this reason, we perform a simple down-select process whereby we only process one of every four frames (n = 4). We chose a selection rate of four frames because unusable frames in our dataset came in an average cluster size of 7.44 ± 3.10 frames, and since the AFP only needs to detect one of the unusable frames in each cluster, most unusable frames could be detected with a selection rate of four frames. Frames that are not selected for processing are labeled as “unprocessed.” To clarify, while we only process every four frames, we found that including immediately adjacent frames for our standard deviation-based metrics was helpful to classification, so we calculated these select metrics for unprocessed frames as well so we could take the standard deviation across all frames in the sliding window.

2.5.2 Frame classification

The frame classification step classifies a processed frame as useable or unusable in real time, depending on its likelihood to be useable for 3D reconstruction (i.e., to yield a pose in CYSTO3D). For each frame, we first compute the aforementioned frame quality metrics to generate a feature vector. The feature vector is then input into the SVM, which outputs a binary frame classification of either usable or unusable. Computing metric values and classifying a frame takes approximately 0.02s on a desktop with an Intel Xeon W-2255 Processor @ 3.70 GHz.

2.5.3 Clinician feedback

Upon identification of an unusable frame, the AFP displays a visual cue on the screen to alert clinicians that they should consider recollecting new frames. In real clinical deployment, we envision this cue to be in the form of a small indicator LED that is visible for the clinician but not alarming for the patient, who is most often awake during clinical cystoscopies. When the feedback cue is triggered, this alerts the clinician to begin recollecting the last four seconds of data (Fig. 1, Clinician Feedback). The duration of the recollection period was determined by speaking with our team urologist; however, if necessary, this timing can be modified in future implementations. During recollection, the clinician should aim to cover the same region of the bladder as that most recently imaged, but in a more deliberate manner. While the region of the bladder in the recollected frames may differ slightly from that in the original frames, this discontinuity is unlikely to negatively affect reconstruction. CYSTO3D uses both initially collected and recollected frames and does not only match sequential frames; rather, each frame in the video is compared to all other frames in the video, so coverage of slightly different areas at different timepoints in the video is still beneficial. Each new incoming frame is subject to the same real-time frame selection and frame classification steps described above; hence, an unusable frame captured during a recollection period can trigger another recollection event.

The output of the clinician feedback step in the AFP is a cystoscopy video containing unprocessed, usable, and unusable frames, where the regions of the bladder with unusable frames have ideally also been captured in usable frames that were collected through recollection.

2.6 AFP development and testing

We developed the AFP with clinical cystoscopy videos (Sec. 2.6.1) then performed an experiment to determine the effect of the AFP on 3D bladder reconstructions (Sec. 2.6.2).

2.6.1 Frame classifier development and training

To develop the classifier, we first trained the SVM and then tuned it by determining the optimal parameter values (C and gamma) and the thresholds that yielded the highest balanced accuracy for frame classification. We chose to use balanced accuracy when finding these parameters, as opposed to accuracy or the area under the curve (AUC) of the receiver operating characteristic (ROC) curve, because our dataset was imbalanced, meaning there were many more usable frames than unusable frames in most videos (Table 1).

Table 1. Frame distribution

View Table | View all tables in this article

We used a leave-one-out cross-validation (LOOCV) approach to determine the SVM parameters. Essentially, we selected one test video using the remaining videos for training, then repeated this process for every video in the group. LOOCV was chosen because it reduces bias and underfitting, which are more likely to occur in methods where the SVM is trained on only one set of videos. During the SVM training step for a given test video, we used the scikit-learn grid search function [28] to optimize the values for the constants related to the SVM (C and gamma). This function searches through a user-defined grid of parameters to find the optimal parameters for a model. It assesses model performance across the combinations of parameters and returns the parameters that optimized performance. The threshold to apply to the SVM output values was the value that resulted in the highest balanced accuracy for frame classification on those particular training videos. The resulting SVM was applied to the test video, and the threshold was applied to the values output by the SVM to predict frame classifications. These predicted classifications were then compared to the ground truth classifications (obtained by determining if a frame yielded a camera pose in 3D bladder reconstruction). This process was then repeated for every video in our dataset. We assessed classification performance by averaging the sensitivity, specificity, accuracy, balanced accuracy, and the AUC of the ROC curve across all test videos. While we only aimed to maximize balanced accuracy, we thought it was important to measure and include all these values so readers could assess the performance of our method as it relates to their application. The AFP then used the C and gamma values that resulted in the highest balanced accuracy for the most test videos.

To mimic clinical deployment, we also wished to show classification performance when a common threshold was used among all videos. Given the small number of videos in the dataset, it was not reasonable to separate videos into training and testing sets. The threshold we used was the mean of those thresholds determined from the prior experiment using LOOCV. It is important to note that this experiment with the common threshold introduced bias, because the videos being tested had also been used during training.

2.6.2 Clinician feedback testing

To evaluate the probable effect of clinician feedback on reconstruction quality, we simulated the effect of using the AFP during a clinical cystoscopy by simulating the capture of unusable frames and reacquisition of usable frames to replace them. To achieve this, we identified frame sequences in clinical cystoscopy videos that comprised at least four seconds of clear, usable frames. These sequences became candidates for “usable” recollected data. In each video, we identified at least two such sequences, denoted A and B, of approximately 120 frames each (Table 2). We chose to include two selections of frames for each video to show the importance of recollection, as the final reconstruction coverage depends strongly on which frames in the video are deemed unusable. Recollection can help prevent missing content from frames that are critical to enable a large-area reconstruction outcome. We then created an associated “unusable” frame for each sequence by averaging frames in the sequence [29] and applying a blur filter with the Python filter2D function [18]. Hence, the unusable frame simulates motion blur. Note that for this evaluation, the nature of the artifact we simulated was not important (i.e., motion blur or bladder debris), as the main goal is to create a frame that would be unusable by CYSTO3D.

Table 2. Frame selection for each set of synthetic videos

View Table | View all tables in this article

From this video we created four synthetic videos of three types used in our evaluation (Fig. 3):

1. Synthetic original: This video models a cystoscopy video with occasional unusable frames collected without the AFP. Here, both sequences A and B were removed and replaced with their corresponding synthesized unusable counterparts.
2. Synthetic AFP-assisted video A and Synthetic AFP-assisted video B: These videos model cystoscopy videos with occasional unusable frames followed by usable, recollected frames of the same region, as if collected with the AFP. Here, either sequence A or B (but not both), respectively, are retained, by inserting the synthesized unusable frame just before the frame sequence to model a trigger for the recollection. The other sequence was removed and replaced with its unusable counterpart.
3. Synthetic AFP-assisted video A and B: This is similar to the previous case, but here both sequences A or B were retained and proceeded with their unusable counterpart frame.

Fig. 3. Synthetic videos used for testing. Each clinical cystoscopy video was used to create four videos: the synthetic original video to model video contents without the AFP, the synthetic AFP-assisted A and synthetic AFP-assisted B videos to model video contents with occasional use of the AFP, and the synthetic AFP-assisted A and B video to model video contents with continuous use of the AFP.

Download Full Size | PDF

We then created and compared the 3D reconstructions from the synthetic original and AFP-assisted videos. The reconstructions were compared in terms of two metrics to assess quality: one to assess bladder coverage and one to assess accuracy. We quantified bladder coverage with a metric called the area of reconstruction coverage (A_RC), which was the number of pixels contained in the resulting 3D reconstruction [30]. We assessed accuracy with the average reprojection error (RPE), which is commonly used to assess 3D reconstructions.

We then used a repeated measures one-way Analysis of Variance (ANOVA) test to compare the reconstructions from synthetic original, AFP-assisted A, AFP-assisted B, and AFP-assisted A and B videos in terms of their A_RC metric and average RPE values. We chose to use a one-way ANOVA because this test is ideal for comparing three or more groups to determine the effect of a single independent variable (i.e. frame recollection). Here, statistical significance was defined by a 95% confidence interval (p < 0.05).

3. Results and discussion

3.1 Classification performance

For our dataset, the highest balanced accuracy was obtained across most videos with C = 100 and gamma = 0.01. The classification results with these parameters can be seen in the average ROC curve across all clinical videos (Fig. 4). In the context of our frame classification, sensitivity, or the true positive rate (TPR), quantifies the detection of usable frames, while specificity, or one minus the true negative rate (TNR), quantifies the detection of unusable frames. Each point along the curve shows the average TPR and TNR for a set threshold applied to the SVM output values, and the shaded area shows the 95% confidence interval for that threshold. The dashed line indicates an ROC curve of a random classifier, which would have an AUC of 0.5 and acts as a reference for ROC curves.

Fig. 4. The average ROC curve for frame classification. The average ROC curve, along with upper and lower bounds for a 95% confidence interval, is shown for classification of frames in the clinical videos. A dashed line also shows an ROC curve of a random classifier for reference.

Download Full Size | PDF

We also compared the classification performance across all videos (Table 3), where the sensitivity, specificity, accuracy, and balanced accuracy listed are those whose respective threshold resulted in the highest balanced accuracy. We averaged these values and the area under the curve (AUC) of the receiver operating characteristic (ROC) curve to assess the performance of the AFP. The video classifications produced an average ROC AUC of 0.81, an average accuracy of 81.48%, and an average balanced accuracy of 81.60%. The average sensitivity was 82.83% and slightly higher than the average specificity of 80.38%. This indicates that our frame classification step was slightly better at detecting usable frames than unusable frames because sensitivity quantifies the detection of usable frames ($sensitivity = \frac{{true\; positives}}{{true\; positives + false\; negatives}} = \frac{{detected\; usable\; frames}}{{total\; usable\; frames}}$) while specificity quantifies the detection of unusable frames ($specificity = \frac{{true\; negatives}}{{true\; negatives + false\; positives}} = \frac{{detected\; unusable\; frames}}{{total\; unusable\; frames}}$).

Table 3. Frame classification performance

View Table | View all tables in this article

The lower performance of our classifier with some videos may be explained by their video content. For example, the low sensitivity and accuracy of video 7 may be due to the lack of vasculature – and therefore insufficient frame quality – compared to other videos. Insufficient frame quality would cause the frames to be classified as unusable even when they are actually usable but simply have low contrast and few edges, leading to a lower final value output by the SVM. Examples of such false-negative cases are shown (Fig. 5(A)-(B)). Additionally, classification of video 10 shows low accuracy and low sensitivity. While most frames in the video can be used in the 3D reconstruction (i.e., the bladder wall can be clearly seen), this bladder has fewer vessels in comparison to bladders in the videos used for training, so there is inherently less contrast. Therefore, the video has, on average, lower metric values than the training videos, so the threshold set during training was too high to detect many usable frames. These undetected frames are, again, false negatives (Fig. 5(C)).

Fig. 5. Falsely classified frames. Poor classification performance for videos 7 and 10 are tied to their comprising frames, such as those shown here (A and B) and C, respectively. Example (A-F) false negative and (G-I) false positive frames often suffer from limited vasculature, regions with minimal contrast (marked by an “x”) or saturation (marked by arrows).

Download Full Size | PDF

More broadly speaking, this problem of false negatives is often found in frames with little vasculature (Fig. 5(D)) or frames where there is minimal contrast (“x” marks in Fig. 5(E)-(F)) within much of the frame despite other regions of the frame containing well-defined vasculature that can be used for 3D reconstruction. This particular case is interesting case because it suggests that some frames are partly usable and partly unusable. In fact, the texture-generation step of the CYSTO3D algorithm does not use or require the entire frame, but only portions of the frame. Hence, every frame selected for reconstruction has regions that are both used and unused in the final reconstruction, so no frame is entirely usable or unusable. Future work may consider this to reduce the number of recollections triggered. Specificity is limited by the false-positive rate, where the classifier mistakes unusable frames for usable frames. This is often found in frames where regions of the frame are saturated (arrows in Fig. 5(G)-(I)). While the saturation creates contrast leading to the suggestion that these are usable frames, the saturated regions do not contain the appearance of the bladder wall, so the frames are actually unusable. Sensitivity is limited by the false-negative rate, where the classifier mistakenly classifies usable frames as unusable.

The thresholds selected for the videos have a normal distribution, so we used the average threshold of 0.72 and applied this common threshold to all cystoscopy videos. Results appear in Table 4. As mentioned previously, this common threshold induces bias into the experiment, so as expected, we see an increased sensitivity, specificity, accuracy, balanced accuracy, and ROC across all videos.

Table 4. Frame classification performance with a common threshold

View Table | View all tables in this article

3.2 Reconstruction performance

To assess the effect of the AFP on reconstruction quality, we examined the resulting reconstructions from our synthesized videos as described above (Sec. 2.6.2). In our experiment, only 5 of 12 raw clinical videos yielded a reconstruction and were usable for testing; the videos that did not produce reconstructions contained a significant number of unusable frames, which inhibited reconstruction.

We compared the reconstructions produced from these five videos in terms of their A_RC metric. For unreconstructed videos, we set A_RC = 0 and RPE = undefined. We observed an increase in the average A_RC metric (i.e., larger reconstructions) for synthetic AFP-assisted videos compared to the synthetic original videos (Table 5, Fig. 6). Only two of the five synthetic original videos yielded reconstructions. In contrast, all five of the synthetic AFP videos with two recollected sections (i.e., synthetic AFP A and B videos) yielded reconstructions, suggesting that use of the AFP improves the ability to generate reconstructions and that certain sequences of clear, usable frames are necessary to enable reconstruction. Generally, use of the AFP for multiple regions increased the reconstruction size compared to not using the AFP for any (synthetic original) or only one region (i.e., synthetic AFP A and synthetic AFP B). Our one-way ANOVA test indicated that simulated frame recollection had a statistically significant (p = 0.009) effect on A_RC metric values for the resulting reconstructions. Thus, simulated frame recollection had a positive effect on the A_RC metric of the 3D reconstructions.

Table 5. A_RC metric results with synthetic videos used for testing

View Table | View all tables in this article

Fig. 6. Reconstruction results from the synthetic original and synthetic AFP-assisted videos. The appearance of the reconstructions along with their A_RC and RPE values are shown for each synthetic video.

Download Full Size | PDF

The reconstruction results for video ID 6 are unlike those for the other clinical videos because the synthetic original video produces a reconstruction with a larger ARC than its AFP-assisted counterparts. We believe this unusual result occurs because the frames selected for testing the AFP from video ID 6 have limited contrast and contain notable bladder debris that may result in poor frame matches. A lack of contrast and the presence of bladder debris can result in the detection of fewer salient features (e.g., when vessels are not well-defined or when features come from bladder debris rather than from the bladder wall). In these cases, matching may be performed using unreliable features, which inhibits reconstruction. Furthermore, in the case of video ID 6, the regions of the bladder present in frame selections A and B are captured in other frames of the video with higher contrast and no bladder debris. Thus, it is possible that removing the selected frames from the synthetic original video increases reconstruction coverage because the reconstruction algorithm does not have the poorly matched frames to try to incorporate. This suggests an area of possible improvement to frame matching and selection in CYSTO3D, which is not optimized to address this edge case. Improving reconstruction performance may require more extensive and robust frame matching and selection in CYSTO3D, which is outside the scope of this work.

It is important to note that while some of the pixels added to the reconstruction of the synthetic AFP-assisted video come from the usable frames themselves, most come from other frames that can be connected to each other in the reconstruction via the usable frames. This is because only frames with shared camera poses can be connected and included in the reconstruction, so groups of connected usable frames can still be excluded if they cannot be connected to the larger reconstruction.

We also compared the average RPE across videos (Table 6) and found little change between the groups of synthetic videos, with an average RPE of 0.50 pixels for synthetic original videos, 0.53 pixels for synthetic AFP A or AFP B and 0.53 pixels for synthetic AFP A and B. Importantly, an RPE less than one pixel is considered a highly accurate reconstruction, so all reconstructions were sufficient in terms of accuracy. Furthermore, our ANOVA test indicated that there was not a statistically significant difference between the average RPE values. Thus, frame replacement did not negatively affect the high accuracy of the 3D reconstructions.

Table 6. Average RPE results with synthetic videos used for testing

View Table | View all tables in this article

4. Conclusion

This is the first pipeline to identify unusable frames and provide real-time feedback to recollect unusable cystoscopy frames. A key advantage to our strategy to guide recollection of data is that it overcomes limitations of current post-processing methods that cannot improve frames that are inherently of extremely low quality. The need for this is particularly evidenced by the fact that 7 of our original 12 videos were unable to generate reconstructions. Had these videos been collected with the AFP, they possibly would have been able to produce more usable frames to facilitate a reconstruction.

There are, however, some disadvantages of the current implementation of the AFP. First, the process of recollection increases the overall length of the cystoscopy, which, although minimal, is not ideal for the patient or clinical workflow. Ultimately, however, the improved data will save time by augmenting the medical records, which aids in surgical planning and helps avoid unnecessary procedures while ensuring the entire bladder is more clearly visualized. Second, the current AFP incorporates an override feature to mitigate the possibility and frustration of an infinite loop of recollection requests (i.e., if a region of the bladder has unavoidable obstruction or debris and is unlikely to yield usable frames). The override relies on a pre-defined number of repeated recollections. Alternatively, the clinician always has the option to disregard feedback and to continue the cystoscopy rather than performing recollection. The clinician can either ignore the feedback or temporarily disable feedback during the recollection period manually by pressing a button. In the future, a better implementation would also assess whether the current region of the bladder has already been imaged, in which case no new frames are needed from that region. Third, use of the AFP does not guarantee coverage of the full bladder. In short, the final reconstruction is still limited in bladder coverage to regions that were actually imaged during the cystoscopy. In the future, it would be helpful to validate whether the entire bladder has been imaged to ensure full coverage. This could be done by ensuring that the 3D reconstruction can be implemented in real-time. Fourth, we were only able to develop and test our method with 12 cystoscopy videos because large clinical datasets are very difficult to acquire. We plan to continue our work in collecting clinical cystoscopy videos and in the creation of a platform to be used to simulate new cystoscopy videos. Through these means, we hope to provide improved analysis in future studies.

Point-of-care frame classification and feedback allows for improved usefulness of the collected frames. Improved frame usefulness can result in better record keeping, not only by improving the quality of the few saved frames in the current standard of care, but also by better enabling the creation of 3D bladder reconstructions to be stored in patient medical records. Real-time frame feedback also has the potential to improve the visibility of the bladder during cystoscopies, which may affect diagnosis and treatment.

While not tested here, the resulting labels we generate with the AFP would generalize to work for other 3D reconstruction pipelines besides CYSTO3D, since the metrics we evaluate are based on standard image processing algorithms that are agnostic to the reconstruction pipeline of choice. It could also be used for other organs, although image features associated with “usable” frames may vary between organs and the frame classifier may need to be tuned accordingly. Finally, the AFP could also be used to assist in medical training by providing feedback when unusable frames are detected.

Funding

National Institutes of Health (R01DK117236); SyBBURE Searle Undergraduate Research Program at Vanderbilt University.

Acknowledgments

We thank the members of the Vanderbilt University Medical Center Urology Clinic for providing valuable help towards this work.

Disclosures

The authors declare no conflicts of interest.

Data Availability

Data underlying the results presented in this paper are available in Ref. [31].

References

1. H. Kobayashi, E. Kikuchi, S. Mikami, et al., “Long term follow-up in patients with initially diagnosed low grade Ta non-muscle invasive bladder tumors: Tumor recurrence and worsening progression,” BMC Urol. 14(1), 5 (2014). [CrossRef]

2. T. Weibel, C. Daul, D. Wolf, et al., “Graph based construction of textured large field of view mosaics for bladder cancer diagnosis,” Pattern Recognit 45(12), 4138–4150 (2012). [CrossRef]

3. A. Ben-Hamadou, C. Daul, and C. Soussen, “Construction of extended 3D field of views of the internal bladder wall surface: a proof of concept,” 3D Res 7(3), 19–23 (2016). [CrossRef]

4. E. J. Seibel, T. D. Soper, M. R. Burkhardt, et al., “Multimodal flexible cystoscopy for creating co-registered panoramas of the bladder urothelium,” Photonic Therapeutics and Diagnostics VIII 8207, 82071A (2012). [CrossRef]

5. K. L. Lurie, R. Angst, D. V. Zlatev, et al., “3D reconstruction of cystoscopy videos for comprehensive bladder records,” Biomed. Opt. Express 8(4), 2106 (2017). [CrossRef]

6. N. C. Van Dongen, F. Van Der Sommen, S. Zinger, et al., “Automatic assessment of informative frames in endoscopic video,” in Proceedings - International Symposium on Biomedical Imaging (IEEE Computer Society, 2016), 2016-June, pp. 119–122.

7. J. H. Oh, S. Hwang, J. K. Lee, et al., “Informative frame classification for endoscopy video,” Med. Image Anal. 11(2), 110–127 (2007). [CrossRef]

8. Z. A. Khan, A. Beghdadi, F. A. Cheikh, et al., “Towards a video quality assessment based framework for enhancement of laparoscopic videos,” SPIE Medical Imaging 11316, 23 (2020). [CrossRef]

9. S. Van Vliet, A. Sobiecki, and A. C. Telea, “Joint brightness and tone stabilization of capsule endoscopy videos,” in 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP) (2018), pp. 101–112.

10. S. Suman, F. Azmadi Hussin, A. Saeed Malik, et al., “Image enhancement using geometric mean filter and gamma correction for WCE images,” in Lecture Notes in Computer Science (2014), 8836, pp. 276–283.

11. F. Vogt, S. Krüger, H. Niemann, et al., “A system for real-time endoscopic image enhancement,” in MICCAI (Springer, 2003), pp. 356–363.

12. A. Mittal, A. K. Moorthy, and A. C. Bovik, “Blind/referenceless image spatial quality evaluator,” in Conference Record - Asilomar Conference on Signals, Systems and Computers (2011), pp. 723–727.

13. M. Pedersen, O. Cherepkova, and A. Mohammed, “Image quality metrics for the evaluation and optimization of capsule video endoscopy enhancement techniques,” J. Imaging Sci. Technol. 61(4), 040402-1–040402-8 (2017). [CrossRef]

14. C. Zach and M. Pollefeys, “Practical methods for convex multi-view reconstruction,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)6314 LNCS(PART 4), 354–367 (2010).

15. C. C. Chang and C.-J. Lin, LIBSVM: A Library for Support Vector Machines (2001).

16. J. C. Platt, “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” Advances in Large Margin Classifiers 10.3 (1999), 61–74.

17. Y. Zhou, R. L. Eimen, E. J. Seibel, et al., “Cost-efficient video synthesis and evaluation for development of virtual 3d endoscopy,” IEEE J. Transl. Eng. Health Med. 9, 1–11 (2021). [CrossRef]

18. G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools (2000).

19. J. M. Tenenbaum, “Accommodation in computer vision,” Stanford University (1970).

20. R. A. Pagaduan, M. Christina, R. Aragon, et al., “iBlurDetect: image blur detection techniques assessment and evaluation study,” in International Conference on Culture Heritage, Education, Sustainable Tourism, and Innovation Technologies (CESIT2020) (2020), pp. 286–291.

21. A. N. Almustofa, Y. Nugraha, A. Sulasikin, et al., “Exploration of image blur detection methods on globally blur images,” in 2022 10th International Conference on Information and Communication Technology, ICoICT 2022 (Institute of Electrical and Electronics Engineers Inc., 2022), pp. 275–280.

22. A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Trans. on Image Process. 21(12), 4695–4708 (2012). [CrossRef]

23. M. G. Kendall, “Further contributions to the theory of paired comparisons,” Biometrics 11(1), 43–62 (1955). [CrossRef]

24. Maurice G. Kendall and Jean Dickinson Gibbins, Rank Correlation Methods, 5th ed. (Oxford University Press, 1990).

25. W. Altidor, T. M. Khoshgoftaar, and J. Van Hulse, “An empirical study on wrapper-based feature ranking,” in 21st IEEE International Conference on Tools with Artificial Intelligence (2009), pp. 75–82.

26. V. Bolón-Canedo and A. Alonso-Betanzos, Recent Advances in Ensembles for Feature Selection (2018), p. 147.

27. J. Van Hulse, T. M. Khoshgoftaar, A. Napolitano, et al., “Threshold-based feature selection techniques for high-dimensional bioinformatics data,” Netw Model Anal Health Inform Bioinforma 1(1-2), 47–61 (2012). [CrossRef]

28. F. Pedregosa, G. Varoquaux, A. Gramfort, et al., “Scikit-learn: machine learning in python,” Journal of Machine Learning Research 12(85), 2825–2830 (2011).

29. G. G. Chrysos, P. Favaro, and S. Zafeiriou, “Motion deblurring of faces,” Int J Comput Vis 127(6-7), 801–823 (2019). [CrossRef]

30. R. Eimen, H. Krzyzanowska, K. R. Scarparto, et al., “Fiberscopic Pattern Removal for Optimal Coverage in 3D Bladder Reconstructions of Fiberscope Cystoscopy Videos,” medRxiv, 2024.04.16.243059312024. [CrossRef]

31. R. Eimen, M. Pillai, and A. Bowden, “realTimeMetrics,” OSF, 2024, https://osf.io/exf6r/.

Clinical video ID	Percentage of Usable Frames	Percentage of Unusable Frames
1	66.67%	33.33%
2	88.40%	11.60%
3	76.67%	23.33%
4	46.05%	53.95%
5	83.65%	16.35%
6	69.12%	30.88%
7	96.77%	3.23%
8	81.79%	18.21%
9	98.15%	1.85%
10	95.86%	4.14%
11	68.42%	31.58%
12	78.89%	21.11%
Average	79.20%	20.80%

Clinical video ID	Frame selection A		Frame selection B
	Time range	Number frames	Time range	Number frames
1	00:21.67–00:25.67	120	00:36.33–00:40.33	120
2	00:01.67–00:05.67	120	00:08.33–00:12.33	120
3	00:15.00–00:19.00	120	00:23.33–00:27.33	120
4	00:11.50–00:15.17	110	00:46.50–00:49.50	90
5	00:11.67–00:15.67	120	00:23.33–00:27.33	120
6	00:12.67–00:16.67	120	00:22.87–00:25.37	75
7	00:05.00–00:09.00	120	00:16.67–00:20.67	120
8	00:09.67–00:13.67	120	00:15.67–00:19.67	120
9	00:29.67–00:33.33	110	00:58.33–00:62.33	120
10	00:05.00–00:09.00	120	00:12.00–00:16.00	120
11	00:17.33–00:21.33	120	00:28.80–00:31.27	74
12	00:10.07–00:13.27	96	00:21.37–00:25.23	116

Clinical video ID	Threshold	Sensitivity	Specificity	Accuracy	Balanced accuracy	ROC AUC
1	0.73	85.47%	82.88%	84.64%	84.18%	0.84
2	0.69	96.67%	64.44%	92.80%	80.56%	0.81
3	0.70	86.15%	69.57%	82.33%	77.86%	0.79
4	0.80	74.39%	84.23%	79.62%	79.31%	0.79
5	0.71	90.22%	55.68%	84.67%	72.95%	0.74
6	0.68	88.27%	69.88%	82.44%	79.07%	0.80
7	0.74	53.65%	100.00%	55.00%	76.82%	0.77
8	0.79	93.25%	83.93%	91.47%	88.59%	0.89
9	0.69	95.91%	100.00%	95.98%	97.96%	0.97
10	0.65	62.21%	100.00%	63.70%	81.11%	0.78
11	0.74	79.05%	82.22%	80.00%	80.63%	0.80
12	0.68	88.72%	71.70%	85.08%	80.21%	0.79
Average	0.72	82.83%	80.38%	81.48%	81.60%	0.81

Clinical video ID	Sensitivity	Specificity	Accuracy	Balanced accuracy	ROC AUC
1	84.62%	89.19%	86.09%	86.90%	0.87
2	96.06%	80.00%	94.13%	88.03%	0.88
3	87.45%	85.51%	87.00%	86.48%	0.87
4	75.61%	95.70%	86.29%	85.65%	0.86
5	91.96%	70.45%	88.50%	81.21%	0.81
6	88.83%	92.77%	90.08%	90.80%	0.90
7	70.39%	85.71%	70.83%	78.05%	0.78
8	97.89%	92.86%	96.93%	95.37%	0.95
9	97.20%	100.00%	97.25%	98.60%	0.98
10	80.98%	81.25%	80.99%	81.11%	0.81
11	82.86%	84.44%	83.33%	83.65%	0.84
12	86.15%	79.25%	84.68%	82.70%	0.81
Average	86.67%	86.43%	87.17%	86.55%	0.86

Clinical video ID	Original	AFP A	AFP B	AFP A and B
1	0	0	0	157,198.27
2	34,917.33	235,450.82	56,930.27	288,015.33
6	304,327.89	142,953.50	198,128.71	215,407.62
8	0	215.56	613.28	96,464.65
9	0	134,042.11	124.22	171,658.96
Average and standard deviation	67,849 ± 119,010.27	76,845.85 ± 87,911.16		185,748.97 ± 63,754.85

Towards improved 3D reconstruction of cystoscopies through real-time feedback for frame reacquisition

Abstract

1. Introduction

2. Methods

2.1 AFP overview

2.2 Choice of 3D reconstruction pipeline

2.3 Dataset

2.4 Design of the frame classifier

2.5 AFP structure

2.5.1 Frame selection

2.5.2 Frame classification

2.5.3 Clinician feedback

2.6 AFP development and testing

2.6.1 Frame classifier development and training

2.6.2 Clinician feedback testing

3. Results and discussion

3.1 Classification performance

3.2 Reconstruction performance

4. Conclusion

Funding

Acknowledgments

Disclosures

Data Availability

References

Data Availability

Cited By

Figures (6)

Tables (6)

Biomedical Optics Express

Clinical video ID	Original	AFP A	AFP B	AFP A and B
1	Undefined	Undefined	Undefined	0.57
2	0.51	0.50	0.52	0.55
6	0.49	0.51	0.51	0.50
8	Undefined	0.55	0.54	0.50
9	Undefined	0.57	0.56	0.53
Average and standard deviation	0.50 ± 0.01	0.53 ± 0.02		0.53 ± 0.03