Buyu Liu

Research

Efficient inference for object-augmented denseCRF model for multi-class semantic video segmentaion task.
Active inference method to select informative subset of object proposals to scale up with a large number of proposals.

[ pdf | Abstract | show subgraph details]

 We address the problem of integrating object reasoning with supervoxel labeling in multiclass semantic video segmentation. To this end, 
 we first propose an object-augmented dense CRF in spatio-temporal domain, which captures long-range dependency between supervoxels, and 
 imposes consistency between object and supervoxel labels. We develop an efficient mean field inference algorithm to jointly infer the 
 supervoxel labels, object activations and their occlusion relations for a moderate number of object hypotheses. To scale up our method, 
 we adopt an active inference strategy to improve the efficiency, which adaptively selects object subgraphs in the object-augmented dense-
 CRF. We formulate the problem as a Markov Decision Process, which learns an approximate optimal policy based on a reward of accuracy 
 improvement and a set of well-designed model and input features. We evaluate our method on three publicly available multiclass video 
 semantic segmentation datasets and demonstrate superior efficiency and accuracy.

Anytime Scene Understanding

Propose a dynamic hierarchical model for anytime scene labeling that allows us to achieve flexible trade-offs between efficiency and accuracy in pixel-level prediction.
Learn a high-quality policy of feature and model selection based on an approximate policy iteration method.

[ pdf | Abstract | Test-time algorithm | method overview]

With increasing demand for efficient image and video analysis, test-time cost of scene parsing becomes critical for many large-scale
or time-sensitive vision applications. We propose a dynamic hierarchical model for anytime scene labeling that allows us to achieve flexible
trade-offs between efficiency and accuracy in pixel-level prediction. In particular, our approach incorporates the cost of feature computation
and model inference, and optimizes the model performance for any given test-time budget by learning a sequence of image-adaptive hierarchical
models. We formulate this anytime representation learning as a Markov Decision Process with a discrete-continuous state-action space. A high-
quality policy of feature and model selection is learned based on an approximate policy iteration method with action proposal mechanism. We
demonstrate the advantages of our dynamic non-myopic anytime scene parsing on three semantic segmentation datasets, which achieves 90% of
the state-of-the-art performances by using 15% of their overall costs.

Holistic Dynamic Scene Understanding

Joint infer mutiple (semantic and geometric) labels consistently in video by introducing an intermediate scene representation.
Introduce hierarchical structure to enforce long-range spatio-temporal labeling smoothness.

[ pdf | Abstract | scene model overview]

Video segmentation has been a popular research topic in recent years. Leveraging spatio-temporal 
video segmentation, we decompose a dynamic scene captured by a video into geometric and semantic 
consistent class, based on a model that combines the geometric and semantic information under a 
box framework. We also appy multilayer representation as a higher smoothness term. Moreover, our 
box representation proves to be beneficial in understanding both static and dynamic scenes. Lastly, 
we show that our full model can improve the result. Our method is tested in CamVid dataset and 
produces a 96% pixel as well as per-class accuracy in geometric label part. As for semantic part, 
we obtain the best per-class accuracy and the pixel level accuracy is comparable to the closest 
spatio-temporal settings.

Semantic Video Parsing with Object Cues

Joint infer supervoxel, object activation and their relationships (support and occlusion) in a unified energy function.
Exploit temporal cues to learn from less training data : 10 to 20 exemplars for each class.

[ pdf | Abstract | show proposal pipeline]

We tackle the problem of holistic dynamic scene understanding. In particular, we jointly reason 
multiple high-level relations in video sequences in a consistent manner. We propose a model to 
jointly infer the geometric and semantic relation as well as object occlusion relation in videos 
taken from monocular camera. Moreover, we explore temporal information to effectively generate 
detection results with weaker detectors and reason occlusion relations between spatially occluded 
objects. Finally, our model is able to infer the semantic and geometric labeling, detection and 
the relative ordering between objects under one-shot optimization. We demonstrate the effective-
ness of our method in two datasets and show that our model can achieve comparable or much better 
results compare with the state-of-the-art per-class accuracy with less supervision.

Buyu Liu

Research Topics

Active Infernece in Videos

Anytime Scene Understanding

Holistic Scene Understanding

Video Parsing with Object Cues

Research