Friday, February 18, 2011

Constantinople isn't just a Train in Rome: Watson's error rate on Jeopardy

A lot of ink has been spilled about Watson's error in guessing "Toronto" for a U.S. city, but amusing as that goof was, it wasn't one of his worst errors at all. According to IBM scientist David Ferucci, Watson only had 14% confidence in that answer, hence the multiple question marks. Had it not been a Final Jeopardy question it was forced to answer, it never would have buzzed in.

The far more egregious errors were those that Watson had full confidence in, but was dead wrong about. In fact, given IBM's stated goal of using similar technology to aid doctors in diagnosing patients, those are the most dangerous outcomes possible. It's one thing for Watson to throw its hands up and admit to not having a clue (as it essentially did with the Toronto question), but what if Watson informed a doctor that it was 97% certain that a patient had cancer, when in fact they have no malignant cells at all? That would be far more troubling, and is analogous to Watson's 97% confidence that Latin finis is a also a word for where trains originate:

Final Frontiers (400): From the Latin for "end", this is where trains can also originate 
Correct Answer:
terminus/terminal 
Watson's Answers:
finis 97%
Constantinople 13%
Pig Latin 10%



I'm being a little hard on Watson, of course. It did know that finis is a Latin word for end, but basically ignored (or gave little weight to) the second part of the question. Indeed, the error becomes even more amusing when you see Watson's other choices. But therein lies the point--Watson was not paying attention to any of the information in the rest of the clue, and had little idea that Constantinople and Pig Latin make no sense at all as answers.

I admit to being stumped as to where Watson came up with those choices. Where is the connection to "trains" if any? Granted, a human presented with these options (as a doctor using a diagnostic aid might be) would realize immediately that they should be discarded. But the finis answer is more subtly problematic--it is correct for part of the clue (or some of the symptoms) just not the rest. It's the sort of answer that a human who didn't know the right answer might be tempted to accept, and Watson's 97% confidence in it would compound the problem.

By my count there were seven other times where Watson was highly confident of the wrong answer, and three of which, like the one above, were times when the right answer didn't occur to Watson at all. Those eight questions weren't the only ones Watson missed, just the ones where it missed badly. In total, I counted 30 missed questions out of 120 the viewers were shown (there were two other questions we have no data for). Meaning that Watson actually missed 25% of the questions. Further, if you count the nine questions where Watson was correct but lacked confidence, he failed almost a third of the time. A great battling average in baseball, and good enough for Jeopardy, but not so great for medicine.

1 comment:

  1. A very good analysis cous! It's interesting to note that I haven't seen the stats showing that Watson missed 30% of the time...all you see is that Watson kicked the humans butts.

    So from an IBM marketing perspective...this is a huge win for them; even though from an engineering standpoint they're not really close to the results they need.

    Many people having seen the technology work so well on Jeopardy will be far more likely to accept it in other fields without question. That's scary in and of itself but what could be worse is if IBM tries to use that faith to their advantage before fully correcting the problems.

    ReplyDelete