Last year, Facebook AI released Dynabench, a platform that radically rethinks benchmarking in AI, starting with natural language processing (NLP) models. Going forward, they have now announced a new evaluation-as-a-service platform for comprehensive, standardized evaluations of NLP models called Dynaboard. Dynaboard can perform apples-to-apples comparisons dynamically without common issues from bugs in evaluation code, inconsistencies in filtering test data, backward compatibility, accessibility, and several other reproducibility issues.
Dynaboard enables AI researchers to customize a new Dynascore metric based on multiple axes of evaluation, including compute, accuracy, robustness, memory, and fairness.
Dynascore allows AI researchers to dynamically adjust the default score by placing more or less weight on particular metrics to evaluate performance comprehensively. This capability is an essential feature of Dynascore since every person who uses leaderboards has different preferences and goals.
Since launching Dynabench, Facebook AI has collected over 400,000 examples and has released two new, challenging data sets. Facebook AI believes that as the AI community continues to build on its open platform, the field will iteratively and rigorously improve how researchers evaluate models, create data sets, and eventually evolve towards better benchmarks.
Dynaboard offers maximum flexibility for users who want to make fine-grained comparisons between models while requiring minimal overhead for model creators who wish to submit their NLP model for evaluation. Dynboard addresses all of the issues like reproducibility, accessibility, and compatibility in one single end-to-end solution.
The NLP model evaluation metrics currently supported in the overall “Dynascore” ranking function are Accuracy, Compute, Memory, Robustness, and Fairness.
On Dynaboard, the exact accuracy metric is task-dependent. To account for computation, Facebook AI has proposed to measure the number of examples that a model can evaluate per second on its instance in their evaluation cloud.
Whereas memory is measured as the amount of memory that a model requires in gigabytes of memory usage. Then, it is averaged over the duration that the model is running, with measurements taken each N seconds. They evaluate the robustness of a model’s prediction by measuring changes after adding various perturbations.
At the launch of Dynaboard, Facebook AI aims to start with an initial metric relevant to NLP tasks that can serve as a starting point for collaboration with the broader AI community. A model is considered more “fair” if its predictions don’t change after a perturbation.
To calculate the rate at which the adjustments or trade-offs are made, they have used the marginal rate of substitution (MRS). In economics, it is the amount of good that a consumer is willing to trade-off for another good while getting the same utility.
To calculate the default Dynascore, which are specifiable by task owners, they estimate the average rate at which users are willing to trade-off each metric for a one-point gain in performance. It then is evaluated and t used to convert all metrics into units of performance.
Facebook AI has used Dynaboard to rank current state-of-the-art NLP models — such as BERT, RoBERTa, ALBERT, T5, and DeBERTa — on the four core Dynabench tasks. Dynascore weights all scoring datasets equally. Even after considering the additional axes of evaluations, DeBERTa, the currently highest ranked open-source model, still performs best.
As the accuracy of models keeps improving, and more complex and harder dynamic adversarial datasets are being collected, Facebook AI believes that the other axes of evaluation have become more critical. Facebook AI has a long-standing commitment to promote open science and scientific rigor. They hope this framework can help in this pursuit.
They aim to open Dynabench up so that anyone can run their task, run their own models in the loop for data collection, and host their own dynamic leaderboards. The goal of this platform is to help show the world what state-of-the-art NLP models can achieve today. It can be visualized that Dynabench will help the AI community build systems that make fewer mistakes, are less susceptible to potentially harmful biases, and are more helpful to people in the real world.
Dynaboard is available for AI researchers to submit their own models for evaluation. Facebook AI has also built Dynalab, a command-line interface tool, and library. A tutorial on how to deploy models is available on their Github repository.