UPDATE 12/2024: This application is now available on PyPI! Or read the paper.
This is a teaching and understanding tool.
When I learn something new, I prefer to understand it fully, deeply. This can be costly in the short-term, but over the long-term I get faster at fixing problems and pointing others in the right direction.
It's also more fun. When I was learning binary classification, I wanted a tool like this. One exists for the ROC curve on its own; some are great interactive lessons or have depth in a static visualization; but none were as as comprehensive and interactive as I wanted. So I built this.
Using it has helped me understand these metrics more deeply; I hope it's useful to others.
A collection of different ways binary classification models are evaluated, including the ROC curve, Precision-Recall curve, and confusion matrix. If that's new to you, look it up first; there are many acronyms. Each metric offers a summary of how well a model (or any binary decision rule) does, emphasizing different strengths and weaknesses. Many resources introduce these tools for different fields. I won't regurgitate that content here, but they're everywhere: medical diagnostics, quantifying perception, the heuristics of your default behaviors, the classic example, and of course machine learning.
More importantly, it's a playground to help deepen understanding of the metrics all at once, in relation to each other[1].
Introductory materials are fine for getting a basic understanding, but I've found that real-world (messy) data, not enough data, or novice mistakes can lead to various curve shapes whose cause isn't immediately obvious from the majority of instructional material.
gain a deeper understanding of why a certain ROC/PR curve shape occurs
better appreciate the pros and cons associated with model evaluation metrics, in order to use them more appropriately
learn new ways of evaluating models
learn a mapping of the ways each graph is linked to -- and differs from -- all others without having to code your own examples
beginning ML practitioners or data scientists who want to learn beyond the basics typically taught in tutorials, boot camps, and textbooks
anyone who wants to explain these terms to others; good teaching requires depth of understanding
anyone who makes predictive models and wants a toy to play with
me, because there aren't enough interactive graphics in the world, and I find it fun and extremely gratifying to fully grok things like this, which in my experience requires 2 things:
graphical display of information (rather than just numerical or tabular format)
play/exploration within that graphical space, which is best done with movement
I hope people learning this topic are exposed to a more comprehensive suite of evaluation metrics, or see puzzling examples as opportunities to deepen their knowledge of what these metrics really convey (or do not convey).
Step by step...
The black and orange normal distributions represent the model predictions across both categories, the class distributions[2]. All of the other plots derive from the relationship between these distributions and the classification threshold in green.
The most "raw" and comprehensive summary of the quality of the model (or decision rule, whatever that may be) is the confusion matrix, but wouldn't it be great if we had one number to summarize what we care about?
We can do that, but we inevitably lose information in this compression step, just like using an average to summarize a distribution. Therefore each of the remaining plots shows a standard metric, each of which measures different relationships between the sections of the confusion matrix.
But each individual metric has trade-offs. These trade-offs are visualized with combinations of metrics as a curve, like the ROC, PR, and MCC-F1 curves. In practice, experienced people check various metrics and curves, and weight their importance based on the problem at hand. For example, perhaps we prefer high Recall at the expense of lower Precision to catch cancer more often than not (or maybe not, there are trade-offs).
Examples to guide your exploration are below.
There are atypical ROC curve shapes you don't see in introductory examples, but can very much crop up in with messy data, or issues with your data prep choices.
If standard deviations are different enough, the result is asymmetrical overlap in the distributions. This gets reflected in the ROC curve with an asymmetrical "flattening" of one edge or the other:
Push the SD even farther and you get this S-like curve, due to the extremely wide spread of the positive class distribution flanking the negative class distribution below (Precision-Recall and MCC-F1 curves show similar S-curves):
Most metrics ignore some section of the confusion matrix. That may seem obvious, but whole papers are more or less dedicated to this problem (Powers 2020, Chicco et al. 2021).
Nearly all metrics only guarantee that 2 (or at most 3) of the 4 confusion matrix sections are "good" (high for hits, low for misses), except the Matthews Correlation Coefficient (Chicco et al. 2021). Many metrics can be misleading when the dataset isn't balanced, i.e. the number of examples from one class is much higher than the other.
Accuracy is probably the worst culprit that trips up beginners. A classic toy example is you want to build a dog picture vs. cat picture detector, but you only have 99 pictures of cats and 1 dog picture (for some reason?). In cases like this a dumb classifier that always predicts "cat" no matter what would give you...wow 99% accuracy! But this is clearly a terrible model, and will perform poorly on dogs. Here's a similar example:
Positively imbalanced example, think 500 dogs, 5000 cats (orange, the best cat color?). The "dumb model" part isn't the underlying data, but the green classification threshold pushed all the way to the left, to always predict "cat". 90% accuracy, not bad! And look even precision and F1-score are pretty good, and 100% recall! The PR curve is really close to the top right corner, where we want it to be! The ROC curve isn't great but hey it's not chance, ship it! Bad idea; pay attention.
The confusion matrix (with no predictions in the negative class) and Matthews Correlation Coefficient (at chance performance) give a more accurate description of this scenario.
No modeler worth their salt would look at this example naively like the caption says (right? :) ).
The point is to illustrate certain flaws in a narrow mindset that seeks to maximize one number blindly -- or perhaps, a non-expert asks why it's not a good model if the accuracy is so high, and you politely explain to them why just because a number looks good doesn't mean it's reliable. But this is why we use multiple metrics, because they each measure something different[3], often presenting a trade-off of maximizing one number at the expense of the other. This is why the ROC and PR curves exist, to find a balance.
Other metrics have failure cases as well, I encourage you to find them. These issues aren't new; the field has documented them well if you look beyond the most introductory material, but beginners don't always appreciate it until they have a hard problem in front of them and are forced to learn the trade-offs. They're not surprising if you look at the formulas for each metric, but as a non-mathematician it wasn't obvious to me from the formulas the first time I encountered them. I had to play around with it.
This tool is meant for that play. I chose to exclude certain metrics; again for a comprehensive list go here. Just keep this list handy:
Chicco & Jurman (2023) highlight that a high Matthews Correlation Coefficient means the 4 "basic rates" derived from the confusion matrix (TPR, TNR, PPV, and NPV) are pretty good, while many other attempts at finding a single number that gives a more comprehensive picture get close, without guaranteeing that all 4 are high (BA: balance acuracy, BM: bookmarker informedness, MK: markedness). Original table: Chicco et al. 2021, Updated: Chicco & Jurman 2023.
The extreme cat/dog example with data imbalance and a dumb model are contrived for example sake, but the issue comes up subtly in realistic cases.
Other points to notice:
The PR curve has a baseline equal to the fraction of positive class examples. Therefore, a PR curve that hangs out around high Precision doesn't necessarily mean your model is good if its baseline is also high (i.e. if your problem has mostly positive examples).
The MCC-F1 curve is yet another evaluation curve, but worth the time...
The MCC-F1 curve was introduced as an improvement beyond the limitations of the ROC and PR curves -- it plots what it sounds like, MCC vs. F1-score. The authors outline cases where model selection with ROC or PR curves alone favor one over the other in biased ways -- ultimately subject to the class balance (Cao et al. 2020).
It would be better if we had an evaluation metric that was more agnostic to the proportion of positive and negative examples. The take home message:
If your dataset has more negative class examples, MCC-F1 is more informative than the ROC curve for model selection
If your dataset has more positive examples, MCC-F1 is more informative than the PR curve instead
Even with balanced classes, there can be cases where ROC or PR curves from different models cross, making model selection less clear. Here, the MCC-F1 curve can help disambiguate.
MCC alone was already pretty good for scoring in imbalanced dataset scenarios, since it guarantees the 4 "basic rates" (TPR, TNR, PPV, NPV) are high[4]. Why make another metric? The MCC-F1 curve goes a step farther, by combinding it with another metric and looking at MCC and F1 at all thresholds, which isn't typical with MCC alone (although 3x Kaggle Grandmasters like jfpuget do).
The best threshold is typically at the curve's apex, stretching (ideally) toward the top right corner. The curve also has its own summary metric, the MCC-F1 score, discussed in the paper.
Is the MCC-F1 curve the best overall method that we should be using all the time? Not necessarily. In some real world cases, it could be overkill, like MCC itself -- there are many scenarios where a decent Recall or Area Under the ROC curve is good enough as long as you check the confusion matrix, and optimizing MCC would be too rigorous to be practical. But the authors show how it can be more useful in imbalanced cases than ROC or PR curves alone, especially in realistic scenarios of comparing multiple models where ROC or PR curves cross. It all depends on the real world problem to solve.
Nevertheless, MCC-F1 is a great tool to have in the box. Ignoring it would be careless.
What I love the most about learning new mathematical tools is the formal frameworks they reveal, in this case for evaluating trade-offs. Part of growing up with perfectionist tendencies was the eventual realization that most big decision can't be perfect, instead they come with trade-offs. This was drilled into me through experience in the lab too -- when I joined I naively believed that science was all about perfect control. In most cases, it's about as much control as feasible, but often far from perfection. For example, choose solid control of the chemical environment of a brain slice, or preserve the whole brain and record from the whole organism, but with less control over the physiology. Or amplifying vs. filtering those noisy biological signals: find a balance, or filter your signal/amplify the noise too much. Sometimes flaws are unavoidable. You get the point.
Real world trade-offs and pure math are different. The simple examples in this tool are formalized to a binary decision with neatly quantifiable mistakes, but the concepts are the same and generalize well beyond their concrete details. While ROC curves represent a specific trade-off continuum as the decision threshold changes (just like PR or any other curve), you can easily imagine an abstract trade-off continuum for a real world decision, at least at the extremes[5].
I'm not a mathematician, so when I first learned terms like Pareto front from optimization, I felt relief. Relief that, failing perfection, there are tools that smarter people had already discovered to make decisions in some "best" way, accounting for the messiness of the world at least a little bit. You can extrapolate vaguely from formal mathematical frameworks into simple, mature ways of thinking -- apply shades of grey instead of black and white thinking.
Do we want a criminal justice system tuned toward catching as many criminals as possible, or avoiding false convictions? Neither one will give us perfection, choose which errors are worse.
Is the world better or worse without the Internet, in all of its free (mis)information glory?
Is the strong pattern recognition that the human brain employs a blessing or a curse?
Apply it to whatever you like. Absolutism is an appealing shortcut, but these issues are more nuanced. Beware your inner cognitive miser.
Above all, the math is a reminder to think more clearly about how I navigate trade-offs in life. There are real-world continua that emerge when making decisions. I don't go around calculating the true positive rate of my assumption that the car on the other side of the road won't veer in front of me (but it's high, since I've never been in that kind of car wreck; we learn this implicitly at a young age before we even begin to drive). Being familiar with formal mathematical frameworks crystallizes useful thinking patterns. Most importantly, it does so in a way that allows you to loosely generalize the concepts to everyday decisions -- you get a new mental model to work with. That's the best part.
[1]: Technical note: D3.js or other tools are probably better for this, or any tool that would be easier to embed on this page. But I wanted something I could develop faster and in pure python, so after lots of searching and experimentation I chose bokeh server. I'd love improvements and modifications, please fork the repo or file an issue.
[2]: These 1D distributions aren't the same as the dimensionality of the input data during model training. The values that make the distributions, on the x-axis, represent a model's predictions, while distribution colors represent the class they belong to, i.e. the ground truth label. Take these distributions as they are, just their relative differences and the metrics that result. If this bothers you and you're a modeler, the distributions are the result of feature engineering, imputation, and other representations a human or model creates before it draws the classification boundary. More practically, they're similar to what's returned from scikit-learn's model.predict_proba() with one big difference: those would be floats from 0.0-1.0, but for UI reasons, I've kept the x-axis range open for these distributions. Their relative difference, overlap, and shape is what's important here. Another way to think about them is that they're just the output layer result of a neural network before softmax is applied.
[3]: If the class balance was reversed, 5000 dogs 500 cats, and you kept the dumb model predicting "cat" all the time, recall would stay at 100%, with everything else at chance or worse than chance levels. Switch to always predicting "dog" on the same data and recall drops but accuracy is 90% again. Metrics can be gamed, who knew! Don't fool yourself either.
[4]: Aficionados will have already noticed that I keep saying "MCC" when I really mean "Unit Normalized MCC", because every MCC I show actually ranges from 0-1. MCC, being a correlation coefficient, ranges from -1 (perfectly wrong predictions) to +1 (perfect predictions), with 0 as chance performance, while Unit Normalized MCC shifts and compresses this range from 0 to 1, with chance being at 0.5. Here I'm choosing to say "MCC" when I really mean "UNMCC", just as Cao et al. (2020) do, because it's more convenient. This is ordinarily something I hate doing: being imprecise with terminology, but UNMCC is so unwieldy and unfamiliar an acronym that I'm happy to accept the imprecision. MCC and UNMCC have the same strengths and weaknesses, and that behavior of the metrics is the main point here. Otherwise we'd be talking about "UNMCC-F1 curve" and jesus christ someone has to draw a line in the absurd alphabet soup we're already in. I'm also firmly of the belief that to teach someone something, you don't have to give them full complexity up front, you can restrict the precision of your explanation a little bit to avoid information overload, then refine later. Plus, if Cao et al. can do it in a paper, I can do it on a website.
[5]: Visualizing a full trade-off continuum would be more complex, because even a simple binary decision yields more than 2 "ends" in its spectrum. With a simple binary decision, there are 2 ways you can be right and 2 ways you can be wrong, and if you've read any part of what's above, you saw that most of the ways to measure a binary decision are limited, not comprehensive. So the full trade-off curve isn't a 2D curve, but a multi-dimensional surface (4D minimum depending on how you'd want to plot it). And this is just for a simple binary decision; think of a harder problem with 3 or 5 options....If we extend the confusion matrix to some decision with n possible choices, the confusion matrix size balloons by n2. Any trade-off hypercurve visualization would be at least that many dimensions.
Special thanks to Dr. Davide Chicco (@davidechicco) for valuable feedback on this page and the tool.
Powers, DMW (2020) Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation. arXiv:2010.16061.
Cao C, Chicco D, Hoffman MM (2020) The MCC-F1 curve: A performance evaluation technique for binary classification. arXiv:2006.11278.
Chicco D, Tötsch N, Jurman G (2021) The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData Mining 14:13, 1-22.
Chicco D, Jurman G (2023) The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification. BioData Mining 16:4, 1-23.