Speculation-aware Resource Allocation for Cluster Schedulers

Read PDF →

Ren, 2015

Category: Distributed Systems

Overall Rating

2.4/5 (17/35 pts)

Score Breakdown

  • Latent Novelty Potential: 4/10
  • Cross Disciplinary Applicability: 5/10
  • Technical Timeliness: 4/10
  • Obscurity Advantage: 4/5

Synthesized Summary

  • The paper introduces the conceptually interesting idea of integrating speculation needs directly into the job scheduling decision through a dynamic "virtual job size."

  • ...its specific theoretical model and algorithms are deeply tied to outdated cluster paradigms (slot-based resources, specific DAG structures, task duration assumptions) that do not directly translate to modern, multi-resource, containerized, and cloud-native environments.

  • Pursuing similar goals today would likely involve developing entirely new models and mechanisms better suited to current infrastructure and workload diversity, rather than reviving Hopper's specific design.

Optimist's View

  • This paper's core contribution lies in proposing a jointly designed job scheduler and straggler mitigation system that dynamically allocates resources by introducing the concept of a "virtual job size".

  • A specific, unconventional modern research direction inspired by this is applying this framework to scheduling jobs on heterogeneous, uncertainty-prone edge/federated computing environments, particularly for distributed machine learning inference or continuous data stream processing.

  • For an ML inference task or stream processing query federated across diverse edge devices, the "virtual size" could dynamically incorporate: ... The probability and cost of needing "speculative copies"...

  • This approach is unconventional because it proposes baking the handling of uncertainty and unreliability into the core resource request and allocation primitive itself ("uncertainty-aware virtual size") rather than treating it as a separate, overlaid layer of failure mitigation.

Skeptic's View

  • The paper is firmly rooted in the Hadoop/Spark era (evaluated on Hadoop YARN 2.3, Spark 0.7.3). The core scheduling primitive is the "slot" – a concept largely superseded by more granular and heterogeneous resource descriptions... in modern container orchestration platforms like Kubernetes.

  • The paper's focus on speculative copies of tasks within a fixed DAG/phase model (like MapReduce or basic Spark DAGs) is less applicable to streaming workloads, serverless functions, or microservice architectures...

  • The Hopper design, while claiming generic applicability, is deeply intertwined with the "slot" allocation model and the specific structure of MapReduce/Spark-like DAGs. The translation to more complex, multi-resource, containerized, or serverless environments is non-obvious...

  • The explicit exclusion of data locality in the theoretical model (p. 11) is a major limitation for data-intensive workloads. While addressed heuristically in the implementation, this disconnect between theory and practice weakens the claim of "provably optimal" scheduling derived from the model.

Final Takeaway / Relevance

Watch