We address the problem of integrating object reasoning with supervoxel labeling in multiclass semantic video segmentation. To this end,
we first propose an object-augmented dense CRF in spatio-temporal domain, which captures long-range dependency between supervoxels, and
imposes consistency between object and supervoxel labels. We develop an efficient mean field inference algorithm to jointly infer the
supervoxel labels, object activations and their occlusion relations for a moderate number of object hypotheses. To scale up our method,
we adopt an active inference strategy to improve the efficiency, which adaptively selects object subgraphs in the object-augmented dense-
CRF. We formulate the problem as a Markov Decision Process, which learns an approximate optimal policy based on a reward of accuracy
improvement and a set of well-designed model and input features. We evaluate our method on three publicly available multiclass video
semantic segmentation datasets and demonstrate superior efficiency and accuracy.
Propose a dynamic hierarchical model for anytime scene labeling that allows us to achieve flexible
trade-offs between efficiency and accuracy in pixel-level prediction.
Learn a high-quality policy of feature and model selection based on an approximate policy iteration method.
With increasing demand for efficient image and video analysis, test-time cost of scene parsing becomes critical for many large-scale
or time-sensitive vision applications. We propose a dynamic hierarchical model for anytime scene labeling that allows us to achieve flexible
trade-offs between efficiency and accuracy in pixel-level prediction. In particular, our approach incorporates the cost of feature computation
and model inference, and optimizes the model performance for any given test-time budget by learning a sequence of image-adaptive hierarchical
models. We formulate this anytime representation learning as a Markov Decision Process with a discrete-continuous state-action space. A high-
quality policy of feature and model selection is learned based on an approximate policy iteration method with action proposal mechanism. We
demonstrate the advantages of our dynamic non-myopic anytime scene parsing on three semantic segmentation datasets, which achieves 90% of
the state-of-the-art performances by using 15% of their overall costs.
Video segmentation has been a popular research topic in recent years. Leveraging spatio-temporal
video segmentation, we decompose a dynamic scene captured by a video into geometric and semantic
consistent class, based on a model that combines the geometric and semantic information under a
box framework. We also appy multilayer representation as a higher smoothness term. Moreover, our
box representation proves to be beneficial in understanding both static and dynamic scenes. Lastly,
we show that our full model can improve the result. Our method is tested in CamVid dataset and
produces a 96% pixel as well as per-class accuracy in geometric label part. As for semantic part,
we obtain the best per-class accuracy and the pixel level accuracy is comparable to the closest
spatio-temporal settings.
We tackle the problem of holistic dynamic scene understanding. In particular, we jointly reason
multiple high-level relations in video sequences in a consistent manner. We propose a model to
jointly infer the geometric and semantic relation as well as object occlusion relation in videos
taken from monocular camera. Moreover, we explore temporal information to effectively generate
detection results with weaker detectors and reason occlusion relations between spatially occluded
objects. Finally, our model is able to infer the semantic and geometric labeling, detection and
the relative ordering between objects under one-shot optimization. We demonstrate the effective-
ness of our method in two datasets and show that our model can achieve comparable or much better
results compare with the state-of-the-art per-class accuracy with less supervision.