Classification of body-keypoint trajectories of gesture co-occurring with time expressions

9 min readAug 31, 2022

Weekly Blog — Link

Code — Link

About this Post :

In this blog, I will explain in detail my project, “Classification of body-keypoint trajectories of gesture co-occurring with time expressions,” as a part of Google Summer of Code 2022 with Red Hen Lab organization.

About My Project

The way humans interact with each other occurs in multimodality. We not only articulate words, but also we show them. Expressing different concepts such as time, place, and emotion comes with speech and some movements called body gestures. In this study, we propose a multi-modal method that captures different patterns of body gestures aligned with an articulated time expression such as “from beginning to end”. Our proposed architecture lies on two neural networks, the Compact Transformer and the Long-Short Term Memory (LSTM). Compact Transformer, the current State-of-the-art structure for temporally distributed images, also performs well over low-resourced data. LSTM performs well over temporal data with long-term dependencies. The hypothesis of this project is the existence of a relation between the body gestures and the time expressions we speak, and using neural networks, and we try to accept or reject this hypothesis.

About the RAW Dataset

The dataset available is of MP4 (video) format, containing clips short clips broadly of the discussion done by anchor in the news room. These clip contain key words like “from beginning to end” “back in time” etc. The dataset is broadly classified into “Sequential”, “Demarcative”, “Deictic”, with finer subclases classifications labels also available. An example video is given below-

Example of Raw Data of the class — Demarcative and subclass “from begining to end”

It took a substantial amount of time to figure out the suitable data format for the input to the classifier architecture. We finally finalised on representing the video in form of the image having trajectory of keypoints like both the hand, incorporating the time factor using the color gradient scheme. In simple words for every video we would have a corresponding image with the entire “trace” of the hands, which would be colored. The colour starting from one shade, representing begining of time position and slowing changing to some other color representing the position of hand at the end of time.

This way of input had many unstated benefits, firstly it gave us the control over the size of the input, which could be modified by changing the image dimension resulting in required adjustments towards model size and latency. Also there exists many works on image processing, so any image based time series method can be used to solve the task.

Sequential To-Do’s of the project

Below I broadly list the different subgoals we had to achieve to make this project complete.

Extracting the pose coordinates using OpenPose
Process the extracted JSON file to extract “25 point” system for various body coordinates
Extraction of relevant hand and head coordinates from the entire file
Tracing the coordinates into image format
Create a neural network architecture for classification
Creating a DataLoader and Training script for the model to be trained
Extraction of PRAAT features from the audio part of the dataset
Future work

In the rest of the blog I will go through each one of the sub tasks and explain the thought process, challenges and solutions faced during completion of tasks.

Extracting OpenPose coordinates

For extracting of body pose coordinates I refer to the OpenPose documentation and GitHub repository. We follow the BODY_25 extraction template, which in simple words labels the entire body into 25 label points which include both the hands and head as one of the several points making it ideal for our work. An example of the extraction points is given below (taken from the official documentation) —

Example of BODY_25 system. We can see that the labels for the various body points, which could be used to slice the coordinates from the JSON file

broadly we extract the coordinates from the video on a loop based code, using the below snippet —

!cd openpose && ./build/examples/openpose/openpose.bin --face --hand --video ../clip.mp4 --write_json $path --display 0 --write_video /content/clip_openpose.avi --keypoint_scale 3 --num_gpu -1 --model_pose BODY_25 --part_candidates

The exact code for replication can be found in the colab notebook shared with this document. The above code is just an example.

Here the arguments “face” and “hand” indicate the extraction for hand and face from the video, write_json gives the path to the folder where the output would be stored, we also save the annotated video using the write_video command, keypoint_scale defines the origin point for the coordinate system, and model_pose selects the template for extraction, in our case as mentioned above it is selected as BODY_25

Extraction of data from created JSON Files

For this I use an “R” language based code, written by my mentors Raul and Brian. To incorporate this a simple extension given below can run the code in Google Colab.

%load_ext rpy2.ipython%%R

Using the code df_maker I was able to extract the CSV coordinates which I saved for further processing. For ease I am sharing the exact function call for the df_maker function.

result = dfMaker("PATH to the folder",save.csv = T,return.empty = T,output.folder = "Output PATH")

Here the path are to be mentioned along with save.csv which would enable the saving of the file instead of simply parsing them.

Extraction from JSON file

The next step involves filtering the CSV file output got from the previous step for relevant points, in our case — point 4 & 7 (refer to the BODY_25 image above). We do this with the use of pandas and parsing.

We extract both left and right hand coordinates for all the people in the image, (Yes there were cases when there were more than one person in the clip, we keep both their hand’s gesture assuming only the one person moves their hand vigoursly during the clip.)

The CSV file also contains more than one coordinate for the same clip with a probability score, we choose the higher probability coordinate in cases where such conflict arises. Instead of saving this redundant data, we directly plot it in the form of a continuous extraction using a loop. Example for the extracted coordinates is given below —

Example for the various hand coordinate spread through the frames of the video

The plotting of the image is explained in the next subsection

Tracing the coordinates into the image

I use matplotlib to plot the coordinates and mycolorpy to get a color schema for representing the time-based component.

The representation of the trace from the video turns out to be accurate as expected with using OpenPose and the current framework. An example is given below

The video is kept the same as stated in the previous subsection —

And the corresponding hand trace is shown below —

Image showing the hand trace of the above video

Here the Purple color represents the subject’s right hand, and the orange color represents the left hand. The shade of the color moves from lighter to darker as time progresses. So light purple represents the initial frames of the video, whereas the darker shade represents the ending of the clip and the same for orange. This can be cross-verified using the above video.

Colour schema for the hand coordinates as time progresses

Create a neural network architecture for classification

As intended, initially, I designed a vision-transformer-based architecture using the Vformer Library, and the implementation was successful, but the performance was not up to the mark. A possible explanation for this is the change in dataset format. Transformers work well with time-series-based data. In our case, we have squashed the time-based information in the form of color gradients in the image. This could be one possible reason for the non-convergence of the model.

Next, we try out Convolution Neural network-based architecture, with various manipulation to the image, like pixel flipping reason being the majority of the image is “white,” which is 1.0 in pixel terms, and the trajectory is just a small part of the image, which gets ignored, so flipping the image would make the entire white region black, helping in the backpropagation. The possible other manipulation could be Albumentations, or also adding gaussian noise to the image for better training, I leave this for future work, and these are just some of the ideas; other data manipulation techniques could also be used to better the performance.

Graphs

We see that the accuracy does not increase, and the training loss keeps decreasing, while the test loss remains stable but does not decrease. This mainly shows that the model finds it “very” hard to learn from the representation and thus converge. Having a confusion matrix does not make much sense as the outputs given by the model is constant, so there is no ingrained information extracted from the confusion matrix data.

Test loss vs the Train loss — this shows that the training loss keeps decreasing, showing that the model is overfitting while the test loss does not decrease. The accuracy remains the same and does not increase :(

Creating a DataLoader and Training script

I use the DataLoader of PyTorch to create a loader for the model with various different batch size options and random shuffling options for ideal training. I have also incorporated the use of “Weights & Biases” for easier tracking of the training procedure. I use Cross-entropy loss for the training as it is a multiclass classification task with the trial of both Adam Optimizer and SGD optimizer with a learning rate of around 5e-3 working the ideal.

Extraction of PRAAT features from video

As additional work for the project, I plan to extract the PRAAT features from the video data. This would be beneficial to track the voice modulations and changes in the features of vocal cues.

I use the parselmouth library along with some online references, which I correctly used to integrate into the extraction pipeline. Although we extract all the PRAAT features, we currently plan to use the second-wise intensity and second-wise pitch of the audio part of the data.

For extraction, we first convert the video to WAV format audio using the FFmpeg library, after which we use the parselmouth library to extract the features and save it in a csv format. An example of the output is shown above.

Future Work

For the community, I have tried to document the code in an extensive manner, although in case of further clarification, I would always be available and happy to help. Some future work directions, in my opinion, could be the use of more of the PRAAT features (extraction code already written). Also, some Gaussian noise addition can be done to the images for better classification.

For future work, we can also incorporate trained models for better trajectory recognition. Some implementation from OpenCV that could do the job much more efficiently would be appreciable.

An ensemble-based model working on the trajectory + LSTM-based architecture for the audio data could be one good approach and might give satisfactory results similar to the figure shown below.

As correctly mentioned by my mentor Prof. Cristobal, this is a small puzzle piece in a larger problem statement, and it won't be a direct and easy architecture to solve this problem.

About Me

Hi, I am Shrey Pandit, a 2023 graduate student from BITS Pilani, India. I am fond of working in Deep learning, specifically Natural Language Processing and multimodal analysis. I have worked with one of the finest labs like Microsoft Research, Princeton-NLP, and MIDAS: Multimodal Digital Media Analysis Lab to pursue my interest. I am fortunate enough to publish works at the top A* conferences like ACL-2022 main conference, NAACL 2022 main conference, MRL-EMNLP 2021, and ECML-PKDD 2020.

Contact — f20190138@goa.bits-pilani.ac.in

Links to various important pages

Link to code notebook — link
Link to the GitHub Repository containing the pre-mid-sem and post-mid-sem code — link
Link to the weekly progress report for this project — link
Link to the Red Hen Project List — link