Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities

1Reality Labs at Meta
2National University of Singapore
CVPR 2022

Assembly101 is a large-scale video dataset for action recognition and markerless motion capture of hand-object interactions, captured in the above cage setting. The recordings, which are multi-view captures, feature participants assembling 101 children's toys.

News


integration_instructions [Aug 1st 2023] [New] Mistake detection annotations are now available on github.
integration_instructions [April 28th 2023] Camera extrinsics can now be found in metadata.zip.
integration_instructions [Feb 20th 2023] Code and models for the Temporal Action Segmentation benchmark is now available on github.
integration_instructions [Jan 17th 2023] Code and models for the Action Anticipation benchmark is now available on github.
leaderboard [Sept. 1st 2022] We are pleased to announce awards worth up to $2000 for the top three entries in our 3D Action Recognition Challenge leaderboard.
leaderboard [August 8th 2022] 3D Action Recognition Challenge for Assembly101 is online. Results will be presented at the "Human Body, Hands, and Activities from Egocentric and Multi-view Cameras" workshop at ECCV 22.
integration_instructions [May 20th 2022] Code and models for the Action Recognition benchmark is now available on github.
integration_instructions [May 17th 2022] Annotations for both fine-grained and coarse actions are now available on github.
integration_instructions [May 2nd 2022] Scripts to download the videos are now available on github.
integration_instructions [March 28th 2022] Dataset released on Google Drive.
description [March 28th 2022] Paper released on arXiv.

Abstract


Assembly101 is a new procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 "take-apart" toy vehicles. Participants work without fixed instructions, and the sequences feature rich and natural variations in action ordering, mistakes, and corrections. Assembly101 is the first multi-view action dataset, with simultaneous static (8) and egocentric (4) recordings. Sequences are annotated with more than 100K coarse and 1M fine-grained action segments, and 18M 3D hand poses. We benchmark on three action understanding tasks: recognition, anticipation and temporal segmentation. Additionally, we propose a novel task of detecting mistakes. The unique recording format and rich set of annotations allow us to investigate generalization to new toys, cross-view transfer, long-tailed distributions, and pose vs. appearance. We envision that Assembly101 will serve as a new challenge to investigate various activity understanding problems.

Paper


Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities

Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, Angela Yao

description CVPR Proceedings
description Supplementary
description arXiv version
integration_instructions Code

Please send feedback and questions to 3dassembly101<at>gmail.com

License


Creative Commons License
Assembly101 is licensed by us under a Creative Commons Attribution-NonCommercial 4.0 International License. The terms of this license are:

Attribution : You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

NonCommercial : You may not use the material for commercial purposes.

Citation


@article{sener2022assembly101,
    title = {Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities},
    author = {F. Sener and D. Chatterjee and D. Shelepov and K. He and D. Singhania and R. Wang and A. Yao},
    journal = {CVPR 2022},
}
We thank Joey Litalien for providing us with the framework for this website.