Comments on https://pubs.rsna.org/doi/10.1148/radiol.222176
-
The study looks at how the use of a faked “AI assistant” affects the judgment of radiologists evaluating mammograms with a suspicious lesion, rating on a scale with these values: 2 -> benign; 3 -> probably benign; 4 -> suspicious; 5 -> highly suggestive of malignancy.
-
They extracted a set of 50 mammograms from a big archive. Either those mammograms had already been scored, or they had an experienced radiologist score them afresh (not sure which). I'm going to call this the pre-experiment score. In either case, the appropriateness of the score was confirmed. It's not clear, but I think it went like this: For "suspicious" and "highly suggestive", they only included mammograms for which a biopsy actually found a malignancy; for "benign" and "probably benign"; they only included mammograms if four years had since passed without otherwise discovering a problem.
-
The participants were shown the image, an AI score, and asked to produce a participant score. A perfect participant would score the image with the pre-experiment score. The core question is: if the AI score is wrong, does that tend to make the participant score wrong too? ("Wrong" here is a little slippery, since you're mapping a binary – cancer or not? – onto a four-point scale which itself is dividing a continuous reality into four hard-edged bins. Suppose the pre-experiment score was 4 (so a biopsy was taken and cancer confirmed) and the participant score was 5. That seems less of an error (both would prompt a biopsy) than a participant score of 3 (probably benign, so no biopsy).
-
Participants were told they were “evaluating” the AI system and later asked to judge its accuracy. I assume they knew in advance they were to judge the accuracy, rather than, say, thinking they were judging the UI (written in C++! With Qt!). That is, you’d expect them to be more suspicious of the AI than they would be in ordinary clinical practice.
-
There were three categories of participants: inexperienced reading mammograms, moderately experienced (~1 year), experienced (~11 years). Inexperienced were really inexperienced: like, none or at most on their first radiology rotation. Not relevant to real-world mammogram-reading, so I’m ignoring that. (Dr. Dawn Marick, DVM, MS, ACVIM points out that the actual relevance of the inexperienced category is that if these people are trained with AI, they might not become experienced. Too lazy to rewrite, but I'll add the relevant stats below.)
-
Participants were shown 50 images. In the first 10, the “AI” was always right. Then, in the next 40, the AI was wrong 12 times. It erred in both directions (6 claimed image more benign than justified, 6 more malignant). Out of 6, in 4 cases the AI was off by one (scored a 2 as a 3, for example); in 2, the AI was off by 2.
There were three relevant analyses.
There were cases where the AI score was the same as the pre-experiment score. In this case, both experienced and moderately experienced agreed with the pre-experiment score about 80% of the time, though moderately experienced had more variance. (Moderately experienced: 81.3 ± 10.1; experienced: 82.3 ± 4.2.) Inexperienced were 79.7 ± 11.7
There were cases where the AI score differed from the pre-experiment score by either 1 or 2. It definitely affected both types of participant. The moderately experienced agreed with the pre-experiment score only 24.8% of the time (±11.6). Presumably they agreed with the AI score or, in the cases where the AI score was two off from the pre-experiment score, they were dragged either partway or all the way to the AI score. The paper doesn't say anything about cases where the pre-experiment score was 2 (benign), the AI score was 3 (probably benign), but the participant rated it 4 (suspicious), or the case where the pre-experiment was 3, the AI was 4, and the participant was 2 (erring in the wrong direction, so I guess they didn't happen? The experienced participants did better, but were still heavily influenced: they agreed with the pre-experiment score 45.5% of the time (± 9.1). Inexperienced agreed with the "ground truth" only 19.8% ± 14.0
Did it matter whether the incorrect AI score was in the "more suspicious" or "less suspicious" direction?
For the cases where the AI was wrong, they looked at the difference between the participant score and the pre-experiment score. Again, it seems the only possibilities were (a) dragged toward the AI score, not away from it, and (b) not dragged past the AI score. So the only values were 0 and 1 (for the case where the AI was off-by-one) and 0, 1, and 2 (for the off-by-two case). They took the sum of those values. Since there were two off-by-two cases and four off-by-one cases, the largest possible cumulative error was 8 and the minimum is 0.
They divided up the data in a way I wouldn't have, comparing moderately experienced to experienced in the "AI erred by being more suspicious" case (giving a higher score), then separately in the "AI erred on the side of guessing no malignancy" case. For that, there was no significant difference (p ≤ 0.05) between the two groups in either of the two cases.
What I'd have preferred is a comparison of the difference between more and less suspicious for the experienced, and more and less suspicious for the moderately experienced. They give the averages, but don't say if they're statistically significant. The results are:
When presented with AI scores biased on the high side (more suspicious), the experienced people are tugged toward it by a total of 1.2 points. When the bias is on the low side, the difference is 5.0 points. Eyeballing the error bars on their bar chart, the difference looks significant? So radiologists are more easily influenced to be optimistic?
The moderately experienced people show a similar propensity, though wrong AI influences them more, and they have more variance. If the AI is biased on the high side, participant scores change in that direction by a total of 2.4 points. If on the low side, scores change by 6.3 points.
The inexperienced people: If the AI is biased on the high side, participant scores change in that direction by a total of 4.0 points. If on the low side, scores change by 6.3 points (same as moderately experienced.
I believe mammographs are often read by pairs of radiologists since disagreement is high. (Even when prompted by the AI giving the correct pre-experiment score, radiologists with over a decade of experience only agreed with it 82% of the time.) I wonder how two radiologists reconcile different scores. One hopes the junior doesn't just defer to the senior, but rather they argue it out based on the evidence. That poses a problem when the differing score comes from an AI: the current style of AI can't give reasons for its decision because it doesn't have any. Maybe that's OK if the goal is cost reduction: it's expensive to have two radiologists argue.