UW Interactive Data Lab

Errudite: Scalable, Reproducible, and Testable Error Analysis

Tongshuang (Sherry) Wu, Marco Tulio Ribeiro, Jeffrey Heer, Daniel S. Weld

Hypothesized error causes should be (1) formalized in a precise and reproducible manner, (2) applied to all instances rather than a small sample of errors, and (3) tested explicitly via counterfactual analysis.

Abstract

Though error analysis is crucial to understanding and improving NLP models, the common practice of manual, subjective categorization of a small sample of errors can yield biased and incomplete conclusions. This paper codifies model and task agnostic principles for informative error analysis, and presents Errudite, an interactive tool for better supporting this process. First, error groups should be precisely defined for reproducibility; Errudite supports this with an expressive domainspecific language. Second, to avoid spurious conclusions, a large set of instances should be analyzed, including both positive and negative examples; Errudite enables systematic grouping of relevant instances with filtering queries. Third, hypotheses about the cause of errors should be explicitly tested; Errudite supports this via automated counterfactual rewriting. We validate our approach with a user study, finding that Errudite (1) enables users to perform high quality and reproducible error analyses with less effort, (2) reveals substantial ambiguities in prior published error analyses practices, and (3) enhances the error analysis experience by allowing users to test and revise prior beliefs.

Materials

PDF | Code | Video | BibTeX

Citation

Errudite: Scalable, Reproducible, and Testable Error Analysis

Tongshuang (Sherry) Wu, Marco Tulio Ribeiro, Jeffrey Heer, Daniel S. Weld

Proc. Association for Computational Linguistics (ACL), 2019

PDF | Code | Video