Data pruning

Read PDF →

Angelova, 2004

Category: ML

Overall Rating

2.4/5 (17/35 pts)

Score Breakdown

  • Cross Disciplinary Applicability: 7/10
  • Latent Novelty Potential: 4/10
  • Obscurity Advantage: 4/5
  • Technical Timeliness: 2/10

Synthesized Summary

  • The paper's central concept is using the disagreement among multiple learners on data subsets as a heuristic to identify troublesome examples for removal before final model training.

  • While the specific techniques (shallow models, simple features, Naive Bayes combiner) are largely superseded and technically outdated, this pre-training pruning philosophy using disagreement offers a conceptual contrast to modern integrated robustness or post-hoc analysis methods.

  • However, its value for modern research is questionable without significant adaptation and demonstration of unique benefits that surpass existing, more theoretically grounded, and integrated approaches.

  • It serves better as a historical perspective than a blueprint for actionable modern research compared to existing, more robust, and integrated techniques.

Optimist's View

  • The core idea of using opinions from an ensemble of diverse learners to identify and prune troublesome examples before training a final model is a distinct approach to robust learning and data cleaning.

  • The proposed method of leveraging the collective 'opinion' or disagreement of multiple analyses... is a general principle applicable to various data analysis tasks beyond ML training...

  • Modern computational power... makes training multiple models (or fine-tuning pre-trained models) significantly more feasible than in 2004.

  • An unconventional research direction inspired by this paper would be to develop 'Ensemble Opinion Pruning' (EOP) for large foundational models.

Skeptic's View

  • the specific approach outlined in 2004 suffers from several limitations and has been largely superseded or rendered less relevant by advancements in machine learning.

  • The methods presented here... are not naturally aligned with how modern deep networks are trained...

  • The paper lacks strong theoretical guarantees, being explicitly presented as a heuristic method 'without guarantees of optimality'.

  • The reliance on Naive Bayes for combining opinions is a potentially brittle step... violates the core independence assumption of the algorithm...

Final Takeaway / Relevance

Watch