Repo

arXiv

Relational Group Activity Recognition

Table of Contents

  1. Key Updates
  2. Introduction
  3. Usage
  4. Dataset Overview
  5. Ablation Study

Key Updates

  • ResNet-50 Backbone: Replaced VGG19 with ResNet-50 for stronger feature extraction.
  • Ablation Studies: Comprehensive experiments to evaluate the contribution of each model component.
  • Test-Time Augmentation (TTA): Implemented to improve robustness and reliability during inference.
  • Graph Attention Operator: Implementation for an attention-based relational layer.
  • Improved Performance: Achieves consistently higher accuracy across all baselines compared to the original paper.
  • Modern Implementation: Fully implemented in PyTorch with support from PyTorch Geometric.

Introduction

Traditional pooling methods (max, average, or attention pooling) reduce dimensionality but often discard important spatial and relational details between people. The Hierarchical Relational Network (HRN) addresses this by introducing a relational layer that explicitly models interactions between individuals in a structured relationship graph.

Relational Layer Illustration

How the Relational Layer Works

  1. Graph Construction

    • Each person in a frame is represented as a node.
    • People are ordered based on the top-left corner (x, y) of their bounding boxes (first by x, then by y if tied).
    • Edges connect a person to their neighbors, forming cliques in the graph.
  2. Initial Person Features
    Each person’s initial representation comes from a CNN backbone (e.g., ResNet50):

    $$P_i^0 = \text{CNN}(I_i)$$

    where $I_i$ is the cropped image around person $i$.

  3. Relational Update

Graph Structure Illustration

At relational layer $\ell$, person $i$’s updated representation is:

$$P_i^\ell = \sum_{j \in E_i^\ell} F^\ell(P_i^{\ell-1} \oplus P_j^{\ell-1}; \theta^\ell)$$

  • $E_i^\ell$: neighbors of person $i$ in graph $G^\ell$
  • $\oplus$: concatenation operator
  • $F^\ell$: shared MLP for layer $\ell$ (input size $2N_{\ell-1}$, output size $N_\ell$)
  • This step computes pairwise relation vectors between $i$ and its neighbors, then aggregates them.
  1. Hierarchical Stacking

    Graph Structure Illustration

    • Multiple relational layers are stacked, compressing person features while refining relational context.
    • The architecture supports a variable number of people $K$ (robust to occlusions or false detections).
  2. Scene Representation
    The final scene feature $S$ is obtained by pooling person features from the last relational layer:

    $$S = P_1^L \ ▽ \ P_2^L \ ▽ \dots \ ▽ \ P_K^L$$

    where $▽$ is a pooling operator (e.g., concatenation or element-wise max pooling).


Usage


1. Clone the Repository

1
git clone https://github.com/Sh-31/Relational-Group-Activity-Recognition.git

2. Install the Required Dependencies

1
pip3 install -r requirements.txt

3. Download the Model Checkpoint

This is a manual step that involves downloading the model checkpoint files.

Option 1: Use Python Code

Replace the modeling folder with the downloaded folder:

1
2
3
4
5
6
import kagglehub

# Download latest version
path = kagglehub.model_download("sherif31/relational-group-activity-recognition/pyTorch/default")

print("Path to model files:", path)

Option 2: Download Directly

Browse and download the specific checkpoint from Kaggle:
Relational-Group-Activity-Recognition - PyTorch Checkpoint


Dataset Overview

The dataset was created using publicly available YouTube volleyball videos. The authors annotated 4,830 frames from 55 videos, categorizing player actions into 9 labels and team activities into 8 labels.

Example Annotations

image

  • Figure: A frame labeled as “Left Spike,” with bounding boxes around each player, demonstrating team activity annotations.

image

Train-Test Split

  • Training Set: 3,493 frames
  • Testing Set: 1,337 frames

Dataset Statistics

Group Activity Labels

Group Activity ClassInstances
Right set644
Right spike623
Right pass801
Right winpoint295
Left winpoint367
Left pass826
Left spike642
Left set633

Player Action Labels

Action ClassInstances
Waiting3,601
Setting1,332
Digging2,333
Falling1,241
Spiking1,216
Blocking2,458
Jumping341
Moving5,121
Standing38,696

Dataset Organization

  • Videos: 55, each assigned a unique ID (0–54).
  • Train Videos: 1, 3, 6, 7, 10, 13, 15, 16, 18, 22, 23, 31, 32, 36, 38, 39, 40, 41, 42, 48, 50, 52, 53, 54.
  • Validation Videos: 0, 2, 8, 12, 17, 19, 24, 26, 27, 28, 30, 33, 46, 49, 51.
  • Test Videos: 4, 5, 9, 11, 14, 20, 21, 25, 29, 34, 35, 37, 43, 44, 45, 47.

Dataset Download Instructions

  1. Enable Kaggle’s public API. Follow the guide here: Kaggle API Documentation.
  2. Use the provided shell script:
1
2
3
  chmod 600 .kaggle/kaggle.json 
  chmod +x script/script_download_volleball_dataset.sh
  .script/script_download_volleball_dataset.sh

For further information about dataset, you can check out the paper author’s repository:
link


Ablation Study

Baselines

Single Frame Models:

  • B1-NoRelations: In the first stage, Resnet50 is fine-tuned and a person is represented with 2048-d features. In the second stage, each person is connected to a shared dense layer of 128 features. The person representations (each of length 128 features) are then pooled and fed to a softmax layer for group activity classification.

  • RCRG-1R-1C: Pretrained Resnet50 network is fine-tuned and a person is represented with 2048-d features, then a single relational layer (1R), all people in 1 clique (1C), so all-pairs relationships are learned.

  • RCRG-1R-1C-!tuned: Same as previous variant, but Pretrained Resnet50 network without fine-tuning.

  • RCRG-2R-11C: Close to the RCRG-1R-1C variant, but uses 2 relational layers (2R) of sizes 256 and 128. The graphs of these 2 layers are 1 clique (11C) of all people. This variant and the next ones explore stacking layers with different graph structures.

  • RCRG-2R-21C: Same as the previous model, but the first layer has 2 cliques, one per team. The second layer is all-pairs relations (1C).

  • RCRG-3R-421C: There relational layers (of sizes 512, 256, and 128) with clique sizes of the layers set to (4, 2, 1). The first layer has 4 cliques, with each team divided into 2 cliques.

Performance comparison
Original Paper Baselines Score
{98F2C621-4E89-47FD-A112-A25946D611F3}
My Scores (Accuracy)
ModelTest AccTest Acc TTA (4)Paper Test ACC
B1-no-relations89.07%89.06%85.1%
RCRG-1R-1C89.42%-86.5%
RCRG-1R-1C-untuned80.86%-75.4%
RCRG-2R-11C89.15%-86.1%
RCRG-2R-21C89.49%-87.2%
RCRG-3R-421C88.97%-86.4%
RCRG-2R-11C-conc89.60%89.71%88.3%
RCRG-2R-21C-conc89.60%89.60%86.7%
RCRG-3R-421C-conc89.23%-87.3%

Notes:

  • -conc postfix is used to indicate concatenation pooling instead of max-pooling.
  • Used 4 transform augmentation at TTA.

Temporal Models:

  • RCRG-2R-11C-conc-temporal: Uses 2 relational layers (2R) of sizes 256 and 128. The graphs of these 2 layers are 1 clique (11C) of all people.

  • RCRG-2R-21C: The first layer has 2 cliques, one per team. The second layer is all-pairs relations (1C).

Performance comparison
Original Paper Baselines Score
{848262DC-9865-4F49-A7CA-60B08675A6B8}
My Scores (Accuracy)
ModelTest AccTest Acc TTA (3)Paper Test ACC
B1-no-relations-temporal88.93%89.60%-
RCRG-2R-11C-conc-V190.50%90.73%89.5%
RCRG-2R-11C-conc-V291.55%91.62%89.5%
RCRG-2R-11C-conc-V391.40%91.77%89.5%
RCRG-2R-21C--89.4%

Notes:

  • Temporal: postfix is used to indicate model work with a sequence of frames, not a frame.
  • -conc postfix is used to indicate concatenation pooling instead of max-pooling.
  • The original paper did not clearly specify where the LSTM unit should be integrated into the model.
    To explore this, I implemented three possible variants:
    • V1: LSTM before the relational layer → allows the relational layer to learn richer spatio-temporal features.
    • V2: LSTM after the relational layer → enhances the relational features with temporal modeling.
    • V3: LSTMs both before and after the relational layer → combines the strengths of V1 and V2.
  • I decided to train RCRG-2R-11C-conc only, since it achieved the best performance in both my implementation and the paper’s results.
  • I implemented B1-no-relations-temporal to evaluate the impact of the relational layer (This model was not included in the original paper).

Attention Models (new baseline):

  • Uses 2 relational layers (2R). The graphs of these two layers are one clique (11C) of all players, but this time using a graph attentional operator instead of an MLP for the relational layers.
My Scores (Accuracy)
ModelTest AccTest Acc TTA (3)Paper Test ACC
RCRG-2R-11C-conc-V191.77%92.00%-
RCRG-2R-11C-conc-V1-Attention Confusion Matrix
RCRG-2R-11C-conc-V1