Tubelets: Unsupervised Action Proposals from Spatiotemporal Super-Voxels

Mihir Jain, Jan van Gemert, Hervé Jégou, Patrick Bouthemy, Cees G.M. Snoek

Research output: Contribution to journalArticleScientificpeer-review

14 Citations (Scopus)
55 Downloads (Pure)

Abstract

This paper considers the problem of localizing actions in videos as sequences of bounding boxes. The objective is to generate action proposals that are likely to include the action of interest, ideally achieving high recall with few proposals. Our contributions are threefold. First, inspired by selective search for object proposals, we introduce an approach to generate action proposals from spatiotemporal super-voxels in an unsupervised manner, we call them Tubelets. Second, along with the static features from individual frames our approach advantageously exploits motion. We introduce independent motion evidence as a feature to characterize how the action deviates from the background and explicitly incorporate such motion information in various stages of the proposal generation. Finally, we introduce spatiotemporal refinement of Tubelets, for more precise localization of actions, and pruning to keep the number of Tubelets limited. We demonstrate the suitability of our approach by extensive experiments for action proposal quality and action localization on three public datasets: UCF Sports, MSR-II and UCF101. For action proposal quality, our unsupervised proposals beat all other existing approaches on the three datasets. For action localization, we show top performance on both the trimmed videos of UCF Sports and UCF101 as well as the untrimmed videos of MSR-II.

Original languageEnglish
Pages (from-to)287-311
Number of pages25
JournalInternational Journal of Computer Vision
Volume124
Issue number3
DOIs
Publication statusPublished - 2017

Keywords

  • Action classification
  • Action localization
  • Video representation

Fingerprint

Dive into the research topics of 'Tubelets: Unsupervised Action Proposals from Spatiotemporal Super-Voxels'. Together they form a unique fingerprint.

Cite this