Vision for Social Robots: Human Perception and Pose Estimation

, 2018

Category: Computer Vision

Overall Rating

2.7/5 (19/35 pts)

This paper highlights specific, under-discussed perceptual challenges (like how inherent facial structure biases distance estimates or what minimal depth cues are sufficient for 3D pose) and empirically validates characteristics of human action data.
While the methods themselves are largely obsolete, these specific identified problems and empirical findings could serve as inspiration and validation targets for developing novel, more robust implicit or self-supervised learning objectives in modern AI systems aiming for nuanced human perception.

The central observation that individual physiognomy significantly biases distance estimation... is a profound insight.
Modern implicit neural representations... could involve training an implicit model... to learn a disentangled latent space representing both 3D facial identity and camera parameters...
The ability to discover interpretable, 3D, rotation-invariant components of motion (movemes) solely from static 2D images is a powerful form of geometric-aware unsupervised learning.
The key finding is that surprisingly little supervision – sparse relative depth orderings... – is sufficient to train a 3D pose estimator to competitive performance.

The core assumptions and methods feel dated.
...its specific technical implementations proved either too brittle, too narrow in scope, or less powerful than parallel or immediately subsequent research directions.
Chapter 4's reliance on a predefined verb list... struggles with the long tail of real-world actions and doesn't easily integrate with modern end-to-end deep learning pipelines.
Attempting to directly apply these specific perceptual techniques... to embodied social robot control... would be misguided.

Watch