VideoGraph: Recognizing Minutes-Long Human Activities in Videos

Keras Keras

This code repository is the implementation for the paper VideoGraph: Recognizing Minutes-Long Human Activities in Videos. We provide the implementation for two different libraries: keras, tensorflow.


Please consider citing this work using this BibTeX entry

  title     = {VideoGraph: Recognizing Minutes-Long Human Activities in Videos},
  author    = {Hussein, Noureldien and Gavves, Efstratios and Smeulders, Arnold WM},
  booktitle = {ICCV Workshop on Scene Graph Representation and Learning},
  year      = {2019}


We visualize the relationship discovered by the first layer of graph embedding. Each sub-figure is related to one of the 10 activities in Breafast dataset. In each graph, the nodes represent the latent concepts learned by graph-attention block. Node size reflects how how dominant the concept, while graph edges emphasize the relationship between the nodes.



The activity of “preparing coffee” can be represented as undirected graph of unit-actions. The graph is capable of portraying the many ways one can carry out such activity. More over, it preserves the temporal structure of the unit-actions. Reproduced from.

Model OverView

Overview diagram of the proposed VideoGraph. It takes as input a video segment s_i of 8 frames from an activity video v. Then, it represents it using standard 3D CNN, e.g. I3D. The corresponding feature representation is x_i . Then, a node attention block attends to a a set of N latent concepts based on their similarities with x_i , which results in the node-attenative representation Z_i . A novel graph embedding layer then processes Z_i to learn the relationships between its latent concepts, and arrives at the final video-level representation. Finally, an MLP is used for classification.

Node Attention and Graph Embedding

(a) Node attention block measures similarities \alpha between segment feature x_i and learned nodes \hat{Y} . Then, it attends to each node in \hat{Y} using \alpha . The result is the node-attentive feature Z_i expressing how similar each node to x_i . (b) Graph Embedding layer models a set of T successive node-attentive features Z using 3 types of convolutions. i. Timewise Conv1D learns the temporal transition between node-attentive features {Zi, ..., Z{i+t}} . ii. Nodewise Conv1D learns the relationships between nodes {z{i,j}, ..., z{i,j+n}} . iii. Channelwise Conv3D updates the representation for each node z_{ij} .

Graph Embedding

(a) Timewise Conv1D learns the temporal transition between successive nodes-embeddings {Zi, ..., Z{i+t}} using kernel k^T of kernel size t . (b) Nodewise Conv1D learns the relationships between consecutive nodes {z{i,j}, ..., z{i,j+n}} using kernel k^N of kernel size n .


How to Use?

Please start from the file. You will find code to reproduce the paper results on three datasets: Epic-Kitchens, Charades and Breakfast.

Python Packages

We use python 2.7.15, provided by Anaconda 4.6.2, and we depend on the following python packages.


The code and the models in this repo are released under the GNU 3.0 LICENSE.