GSoC 2022 @ Red Hen Lab

Shrey Pandit
8 min readJun 13, 2022

--

Introduction —

Hi, I am Shrey Pandit, a 2023 graduate student from BITS Pilani, India. I am fond of working in Deep learning, specifically Natural Language Processing and multimodal analysis. To pursue my interest I have worked with one of the finest labs like Princeton-NLP, and MIDAS: Multimodal Digital Media Analysis Lab. I am fortunate enough to publish works at the top A* conferences like ACL-2022 main conference, NAACL 2022 main conference, MRL-EMNLP 2021, and ECML-PKDD 2020.

Looking forward to working with top mentors from Red Hen Labs (Raúl Sánchez, Cristóbal Pagán Cánovas, Brian Herreño Jiménez, Masoumeh Moradipour-Tari and other mentors from Red Hen Lab) in the domain of Multimodal analysis, to further explore my interest and give back to the community.

About My Project

The way humans interact with each other occurs in multimodality. We not only articulate words, but also we show them. Expressing different concepts such as time, place, and emotion comes with speech and some movements called body gestures. In this study, we propose a multi-modal method that captures different patterns of body gestures aligned with an articulated time expression such as “from beginning to end”. Our proposed architecture lies on two neural networks, the Compact Transformer and the Long-Short Term Memory (LSTM). Compact Transformer, the current State-of-the-art structure for temporally distributed images, also performs well over low-resourced data. LSTM performs well over temporal data with long-term dependencies. The hypothesis of this project is the existence of a relation between the body gestures and the time expressions we speak, and using neural networks, and we try to accept or reject this hypothesis.

Week 0

Week 0 being a community bonding period, I was fortunate enough to have a meet-and-greet session on 8th June 2022 with all my fellow students and mentors from Red Hen Lab. It was a very interactive session, and I met many distinguished members of the Lab proficiently in their domains.

Apart from this meeting, I had to set up the infrastructure at the HPC cluster provided by Red Hen Lab; doing this was new to me, and I learned how to work with clusters. This week, I also shared the initial dataset that would be used in my project.

Looking forward to a productive upcoming week :)

Week 1

This week marked the official start of GSoC. After having a discussion with my mentor Raul, I started working with the raw dataset to extract the coordinates/key points of the hand from the video in the dataset. To do this, I used the OpenPose library and separated the hand coordinate from the entire set of points. The next task would be to represent this set of hand coordinates with a single representative point, which could be used to trace the path and classify the hand gesture.

I have been thinking and discussing the method of representing the trajectory and decided on 2 broad ideas-

  1. The representative point on hand-extracted from the video that has been split into pictures with their hand coordinate (currently 30fps) can be traced into 1 single picture of the entire path of the point with color gradient representing the sequential images (showing time-series based data). The benefit of this would be the simplicity and the efficient way of representing the video in a single picture, making this a faster and more conventional approach.
  2. Another way of representing could be feeding the set of images having points on the entire image (similar to a video of point moving) one at a time into ViT, making it fully utilize the benefits of transformer in the ViT (that are theoretically better at time-series data). The final representation could then be fed into an MLP and then be classified.

In the coming week, I plan to work on the audio part of the project.

Week 2

This week after reaching out to Brian, one of my mentors leading the audio section of the project, I planned to extract the frame-wise features like Pitch and Intensity and also extract the entire clip features like shimmer, jitter, etc. (mainly the PRAAT Features). I had to use libraries like FFmpeg (for the video to audio conversion), librosa, praat-parselmouth, and other audio processing libraries to extract these features.

These frame-wise audio features would be sent in along with the coordinates of the hand, nose, and head keypoints extracted frame-wise and passed in like a time-series format.

I plan to implement a basic ViT (compact-ViT) structure for the upcoming weeks as it would be useful in either technique and is crucial for the entire project.

Week 3 & 4

These two works were a bit hectic and resulted in some fruitful outcomes. I worked on various topics in these 2 weeks, starting from finalizing a framework for implementing the Vision transformer. For ViT, I used the vformer library and tried to implement a basic working structure on some test standard datasets. After looking at the ease of implementation, I finalized this library and would suggest it to anyone wanting to work with vision transformers, one benefit could be the ease with which we can try out different types of ViT models.

Apart from this, I worked on the crucial part of getting the coordinates from the OpenPose library. I planned to get the frame-wise coordinates that could be passed into the architecture. I extracted the frame-wise coordinates for both the hands individually and the head. For this, I took the help from my mentors, Brian and Raul, who were prior working on a similar problem statement; their help played an immense part in this section of the project.

In the coming weeks, I plan to create an image of the trajectory of hands and head using the coordinate. Also, an issue I am yet to tackle is the existence of multiple people in the video frame. I’ll also have a look at it :)

Week 5 & 6

In the past weeks, I have tried to implement the image trajectory from the coordinates. I used matplotlib for the same. This image also has a color gradient that changes as time passes, incorporating time-series features in an image. A sample image generated from a video is shown below-

Image showing the left and right-hand tracing extracted from the image.

The above image could be easily used in neural network architecture to predict the class. That would be the work of the coming weeks.

Also, this week, I worked on scaling this work from 1 video to multiple videos from the dataset. I chose to convert 300 videos (100*3 classes). Have created CSV and JSON files from this set of images.

Currently facing issues with cases where hands are hidden, this results in dropping the value (0,0). In the upcoming week, I will solve this issue and create a set of 300 images corresponding to the 300 videos, which could then be easily fed into the NN architecture for prediction.

Week 7

This week my main aim was to improve the documentation of the entire code base. Documentation is a key part of any project to be understood by the community and if the project needs to be taken ahead by someone else.

With the input from my mentors, I tried to document the entire code line by line, explaining the thought process behind the code and the purpose of the code.

I also updated this new code under the Github Repo section.

Week 8

Scaling to 300 data points —

The next task is to scale the current code to a subset of the entire data, which can be used to train a basic NN architecture. I wrote code to facilitate the entire process in a loop-based structure, giving it a pipeline organization.

This is an important step as, in the future, if we want to scale it to some large dataset, it would be an easier and much more systematic way to compute. The scaling was not a straightforward task as much of the code needed to be written to incorporate the nomenclature of the videos, which could be used to match the video and trajectory to verify the extraction process. The major roadblock was to incorporate the “R” based code into the python script working in a loop-based manner which took some time to get resolved.

Week 9

Created a basic Neural network architecture of two themes —

  1. A Vision Transformer-based architecture using the VFormer library and creating data loaders and training loops for the same. For this, I used a batch size of 4; higher than that required a lot of computing power which at this point was not necessary.
  2. A CNN Based architecture was the basic instinct for an image classification task; this is a much more reliable and faster way.

To my surprise, the Vision transformer architecture did not converge well; a possible reason for this is the fact of the input data; now, the input is a single image having the features of the entire time series based data in the form of color gradients. A possible alternative that could be done if using ViT is necessary is to have a set of images representing points of hands in a time series-based structure; this could be left for future work.

For now, the CNN-based architecture is working, and we will go ahead with this.

Week 10

Scaling to the entire dataset (3000 videos) —

Now the next aim is scaling the code to the entire dataset; although this is not a compute-intensive task as it cannot be done parallelly, it is a time-consuming task involving proper structuring of codes and results.

The cycle of scaling was as follows —

Video dataset ➡ OpenPose extraction ➡ R-based code to extract coordinates from OpenPose output ➡ Getting relevant coordinates from the entire CSV ➡ Plotting the coordinates to image

Also, now the extraction needs to be done for the Audio features too,

Video Dataset ➡ WAV format audio file ➡ extract second_wise intensity and pitch

--

--

Shrey Pandit

Research Intern @ Microsoft Research| GSoC ‘22| CS@BITS-PILANI