Data pruning

Angelova, 2004

Category: ML

Overall Rating

2.4/5 (17/35 pts)

The paper's central concept is using the disagreement among multiple learners on data subsets as a heuristic to identify troublesome examples for removal before final model training.
While the specific techniques (shallow models, simple features, Naive Bayes combiner) are largely superseded and technically outdated, this pre-training pruning philosophy using disagreement offers a conceptual contrast to modern integrated robustness or post-hoc analysis methods.
However, its value for modern research is questionable without significant adaptation and demonstration of unique benefits that surpass existing, more theoretically grounded, and integrated approaches.
It serves better as a historical perspective than a blueprint for actionable modern research compared to existing, more robust, and integrated techniques.

The core idea of using opinions from an ensemble of diverse learners to identify and prune troublesome examples before training a final model is a distinct approach to robust learning and data cleaning.
The proposed method of leveraging the collective 'opinion' or disagreement of multiple analyses... is a general principle applicable to various data analysis tasks beyond ML training...
Modern computational power... makes training multiple models (or fine-tuning pre-trained models) significantly more feasible than in 2004.
An unconventional research direction inspired by this paper would be to develop 'Ensemble Opinion Pruning' (EOP) for large foundational models.

the specific approach outlined in 2004 suffers from several limitations and has been largely superseded or rendered less relevant by advancements in machine learning.
The methods presented here... are not naturally aligned with how modern deep networks are trained...
The paper lacks strong theoretical guarantees, being explicitly presented as a heuristic method 'without guarantees of optimality'.
The reliance on Naive Bayes for combining opinions is a potentially brittle step... violates the core independence assumption of the algorithm...

Watch