Relational Group Activity Recognition

Key Updates
Introduction
- How the Relational Layer Works
Usage
Dataset Overview
Ablation Study
- Baselines

Key Updates

ResNet-50 Backbone: Replaced VGG19 with ResNet-50 for stronger feature extraction.
Ablation Studies: Comprehensive experiments to evaluate the contribution of each model component.
Test-Time Augmentation (TTA): Implemented to improve robustness and reliability during inference.
Graph Attention Operator: Implementation for an attention-based relational layer.
Improved Performance: Achieves consistently higher accuracy across all baselines compared to the original paper.
Modern Implementation: Fully implemented in PyTorch with support from PyTorch Geometric.

Introduction

Traditional pooling methods (max, average, or attention pooling) reduce dimensionality but often discard important spatial and relational details between people. The Hierarchical Relational Network (HRN) addresses this by introducing a relational layer that explicitly models interactions between individuals in a structured relationship graph.

Relational Layer Illustration

How the Relational Layer Works

Graph Construction
- Each person in a frame is represented as a node.
- People are ordered based on the top-left corner (x, y) of their bounding boxes (first by x, then by y if tied).
- Edges connect a person to their neighbors, forming cliques in the graph.
Initial Person Features
Each person’s initial representation comes from a CNN backbone (e.g., ResNet50):
$$P_i^0 = \text{CNN}(I_i)$$
where $I_i$ is the cropped image around person $i$.
Relational Update

Graph Structure Illustration

At relational layer $\ell$, person $i$’s updated representation is:

$$P_i^\ell = \sum_{j \in E_i^\ell} F^\ell(P_i^{\ell-1} \oplus P_j^{\ell-1}; \theta^\ell)$$

$E_i^\ell$: neighbors of person $i$ in graph $G^\ell$
$\oplus$: concatenation operator
$F^\ell$: shared MLP for layer $\ell$ (input size $2N_{\ell-1}$, output size $N_\ell$)

This step computes pairwise relation vectors between $i$ and its neighbors, then aggregates them.

Hierarchical Stacking
- Multiple relational layers are stacked, compressing person features while refining relational context.
- The architecture supports a variable number of people $K$ (robust to occlusions or false detections).
Scene Representation
The final scene feature $S$ is obtained by pooling person features from the last relational layer:
$$S = P_1^L \ ▽ \ P_2^L \ ▽ \dots \ ▽ \ P_K^L$$
where $▽$ is a pooling operator (e.g., concatenation or element-wise max pooling).

Usage

1. Clone the Repository

1
git clone https://github.com/Sh-31/Relational-Group-Activity-Recognition.git

2. Install the Required Dependencies

1
pip3 install -r requirements.txt

3. Download the Model Checkpoint

This is a manual step that involves downloading the model checkpoint files.

Option 1: Use Python Code

Replace the modeling folder with the downloaded folder:

1
2
3
4
5
6
import kagglehub

# Download latest version
path = kagglehub.model_download("sherif31/relational-group-activity-recognition/pyTorch/default")

print("Path to model files:", path)

Option 2: Download Directly

Browse and download the specific checkpoint from Kaggle:
Relational-Group-Activity-Recognition - PyTorch Checkpoint

Dataset Overview

The dataset was created using publicly available YouTube volleyball videos. The authors annotated 4,830 frames from 55 videos, categorizing player actions into 9 labels and team activities into 8 labels.

Example Annotations

Figure: A frame labeled as “Left Spike,” with bounding boxes around each player, demonstrating team activity annotations.

Train-Test Split

Training Set: 3,493 frames
Testing Set: 1,337 frames

Dataset Statistics

Group Activity Labels

Group Activity Class	Instances
Right set	644
Right spike	623
Right pass	801
Right winpoint	295
Left winpoint	367
Left pass	826
Left spike	642
Left set	633

Player Action Labels

Action Class	Instances
Waiting	3,601
Setting	1,332
Digging	2,333
Falling	1,241
Spiking	1,216
Blocking	2,458
Jumping	341
Moving	5,121
Standing	38,696

Dataset Organization

Videos: 55, each assigned a unique ID (0–54).
Train Videos: 1, 3, 6, 7, 10, 13, 15, 16, 18, 22, 23, 31, 32, 36, 38, 39, 40, 41, 42, 48, 50, 52, 53, 54.
Validation Videos: 0, 2, 8, 12, 17, 19, 24, 26, 27, 28, 30, 33, 46, 49, 51.
Test Videos: 4, 5, 9, 11, 14, 20, 21, 25, 29, 34, 35, 37, 43, 44, 45, 47.

Dataset Download Instructions

Enable Kaggle’s public API. Follow the guide here: Kaggle API Documentation.
Use the provided shell script:

1
2
3
  chmod 600 .kaggle/kaggle.json 
  chmod +x script/script_download_volleball_dataset.sh
  .script/script_download_volleball_dataset.sh

For further information about dataset, you can check out the paper author’s repository:
link

Ablation Study

Baselines

Single Frame Models:

B1-NoRelations: In the first stage, Resnet50 is fine-tuned and a person is represented with 2048-d features. In the second stage, each person is connected to a shared dense layer of 128 features. The person representations (each of length 128 features) are then pooled and fed to a softmax layer for group activity classification.
RCRG-1R-1C: Pretrained Resnet50 network is fine-tuned and a person is represented with 2048-d features, then a single relational layer (1R), all people in 1 clique (1C), so all-pairs relationships are learned.
RCRG-1R-1C-!tuned: Same as previous variant, but Pretrained Resnet50 network without fine-tuning.
RCRG-2R-11C: Close to the RCRG-1R-1C variant, but uses 2 relational layers (2R) of sizes 256 and 128. The graphs of these 2 layers are 1 clique (11C) of all people. This variant and the next ones explore stacking layers with different graph structures.
RCRG-2R-21C: Same as the previous model, but the first layer has 2 cliques, one per team. The second layer is all-pairs relations (1C).
RCRG-3R-421C: There relational layers (of sizes 512, 256, and 128) with clique sizes of the layers set to (4, 2, 1). The first layer has 4 cliques, with each team divided into 2 cliques.

Performance comparison

Original Paper Baselines Score

My Scores (Accuracy)

Model	Test Acc	Test Acc TTA (4)	Paper Test ACC
B1-no-relations	89.07%	89.06%	85.1%
RCRG-1R-1C	89.42%	-	86.5%
RCRG-1R-1C-untuned	80.86%	-	75.4%
RCRG-2R-11C	89.15%	-	86.1%
RCRG-2R-21C	89.49%	-	87.2%
RCRG-3R-421C	88.97%	-	86.4%
RCRG-2R-11C-conc	89.60%	89.71%	88.3%
RCRG-2R-21C-conc	89.60%	89.60%	86.7%
RCRG-3R-421C-conc	89.23%	-	87.3%

Notes:

-conc postfix is used to indicate concatenation pooling instead of max-pooling.
Used 4 transform augmentation at TTA.

Temporal Models:

RCRG-2R-11C-conc-temporal: Uses 2 relational layers (2R) of sizes 256 and 128. The graphs of these 2 layers are 1 clique (11C) of all people.
RCRG-2R-21C: The first layer has 2 cliques, one per team. The second layer is all-pairs relations (1C).

Performance comparison

Original Paper Baselines Score

My Scores (Accuracy)

Model	Test Acc	Test Acc TTA (3)	Paper Test ACC
B1-no-relations-temporal	88.93%	89.60%	-
RCRG-2R-11C-conc-V1	90.50%	90.73%	89.5%
RCRG-2R-11C-conc-V2	91.55%	91.62%	89.5%
RCRG-2R-11C-conc-V3	91.40%	91.77%	89.5%
RCRG-2R-21C	-	-	89.4%

Notes:

Temporal: postfix is used to indicate model work with a sequence of frames, not a frame.
-conc postfix is used to indicate concatenation pooling instead of max-pooling.
The original paper did not clearly specify where the LSTM unit should be integrated into the model.
To explore this, I implemented three possible variants:
- V1: LSTM before the relational layer → allows the relational layer to learn richer spatio-temporal features.
- V2: LSTM after the relational layer → enhances the relational features with temporal modeling.
- V3: LSTMs both before and after the relational layer → combines the strengths of V1 and V2.
I decided to train RCRG-2R-11C-conc only, since it achieved the best performance in both my implementation and the paper’s results.
I implemented B1-no-relations-temporal to evaluate the impact of the relational layer (This model was not included in the original paper).

Attention Models (new baseline):

Uses 2 relational layers (2R). The graphs of these two layers are one clique (11C) of all players, but this time using a graph attentional operator instead of an MLP for the relational layers.

My Scores (Accuracy)

Model	Test Acc	Test Acc TTA (3)	Paper Test ACC
RCRG-2R-11C-conc-V1	91.77%	92.00%	-

Relational Group Activity Recognition

Relational Group Activity Recognition

Table of Contents

Key Updates

Introduction

How the Relational Layer Works

Usage

1. Clone the Repository

2. Install the Required Dependencies

3. Download the Model Checkpoint

Option 1: Use Python Code

Option 2: Download Directly

Dataset Overview

Example Annotations

Train-Test Split

Dataset Statistics

Group Activity Labels

Player Action Labels

Dataset Organization

Dataset Download Instructions

Ablation Study

Baselines

Single Frame Models:

Performance comparison

Original Paper Baselines Score

My Scores (Accuracy)

Temporal Models:

Performance comparison

Original Paper Baselines Score

My Scores (Accuracy)

Attention Models (new baseline):

My Scores (Accuracy)

RCRG-2R-11C-conc-V1-Attention Confusion Matrix

Relational Group Activity Recognition#

Table of Contents#

Key Updates#

Introduction#

How the Relational Layer Works#

Usage#

1. Clone the Repository#

2. Install the Required Dependencies#

3. Download the Model Checkpoint#

Option 1: Use Python Code#

Option 2: Download Directly#

Dataset Overview#

Example Annotations#

Train-Test Split#

Dataset Statistics#

Group Activity Labels#

Player Action Labels#

Dataset Organization#

Dataset Download Instructions#

Ablation Study#

Baselines#

Single Frame Models:#

Performance comparison#

Original Paper Baselines Score#

My Scores (Accuracy)#

Temporal Models:#

Performance comparison#

Original Paper Baselines Score#

My Scores (Accuracy)#

Attention Models (new baseline):#

My Scores (Accuracy)#

RCRG-2R-11C-conc-V1-Attention Confusion Matrix#

Relational Group Activity Recognition

Table of Contents

Key Updates

Introduction

How the Relational Layer Works

Usage

1. Clone the Repository

2. Install the Required Dependencies

3. Download the Model Checkpoint

Option 1: Use Python Code

Option 2: Download Directly

Dataset Overview

Example Annotations

Train-Test Split

Dataset Statistics

Group Activity Labels

Player Action Labels

Dataset Organization

Dataset Download Instructions

Ablation Study

Baselines

Single Frame Models:

Performance comparison

Original Paper Baselines Score

My Scores (Accuracy)

Temporal Models:

Performance comparison

Original Paper Baselines Score

My Scores (Accuracy)

Attention Models (new baseline):

My Scores (Accuracy)

RCRG-2R-11C-conc-V1-Attention Confusion Matrix