Action Segmentation with Temporal Domain Adaptation


Despite the recent progress of fully-supervised action segmentation techniques, the performance is still not fully satisfactory. Exploiting larger-scale labeled data and designing more complicated architectures result in additional annotation and computation costs. Therefore, we aim to exploit auxiliary unlabeled videos, which are comparatively easy to obtain, to improve the performance.


One main challenge of utilizing unlabeled videos is the problem of spatio-temporal variations. For example, different people may make tea in different personalized styles even if the given recipe is the same. The intra-class variations cause degraded performance by directly deploying a model trained with different groups of people.

Our Approaches

We exploit unlabeled videos to address this problem by reformulating the action segmentation task as a cross-domain problem with domain discrepancy caused by spatio-temporal variations.
To reduce the discrepancy, we propose two approaches:

  • Mixed Temporal Domain Adaptation (MTDA): align the features embedded with local and global temporal dynamics.
  • Self-Supervised Temporal Domain Adaptation (SSTDA): align the feature spaces across multiple temporal scales with self-supervised learning.


We evaluate three challenging benchmark datasets: GTEA, 50Salads, and Breakfast, and achieve the follows:

  • Outperform other Domain Adaptation (DA) and video-based self-supervised approaches.
    The comparison of different methods that can learn information from unlabeled target videos (on GTEA).
  • Outperform the current state-of-the-art action segmentation methods by large margins.
  • Achieve comparable performance with fully-supervised methods using only 65% of the labeled training data.
    Comparison with the most recent action segmentation methods on all three datasets.
    The visualization of temporal action segmentation for our methods with color-coding (input example: make coffee)

Please check our papers for more results.


Overview Introduction:

CVPR'20 Presentation (1-min):

CVPR'20 Presentation (5-min):

WACV'20 Presentation:


Papers & Code




Slides (CVPR'20)
Slides (WACV'20)
Poster (CVPR'20)
Poster (WACV'20)

If you find this project useful, please cite our papers:


  title={Action Segmentation with Joint Self-Supervised Temporal Domain Adaptation},
  author={Chen, Min-Hung and Li, Baopu and Bao, Yingze and AlRegib, Ghassan and Kira, Zsolt},
  booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},

  title={Action Segmentation with Mixed Temporal Domain Adaptation},
  author={Chen, Min-Hung and Li, Baopu and Bao, Yingze and AlRegib, Ghassan},
  booktitle={IEEE Winter Conference on Applications of Computer Vision (WACV)},


1Georgia Institute of Technology   2Baidu USA
*work done during an internship at Baidu USA

Min-Hung Chen1*
Baopu Li2
Yingze Bao2
Ghassan AlRegib1
Zsolt Kira1
Min-Hung Chen
Min-Hung Chen
Senior Research Scientist

My research interest is Learning without Fully Supervision.