Home | MMRG

Abstract

Given an untrimmed video and a query sentence, cross-modal video moment retrieval aims to localize a video moment that best matches the query sentence. Pioneering work typically learns the representations of the textual and visual content separately and then obtains the interactions or alignments between different modalities. However, the performance of existing methods suffer from the insufficient representation learning since they neglect the relation among objects in both the video and the query sentence. Toward this end, we contribute a multi-modal relational graph to capture the interactions among objects from the visual and textual content. Specifically, we first introduce a visual relational graph and a textual relational graph to form relation-aware representations via message propagation. Thereafter, a multi-task pre-training is designed to capture domain-specific knowledge about objects and relations, enhancing the structured visual representation. Finally, the graph matching and boundary regression are employed to perform the cross-modal retrieval. We conduct extensive experiments on two datasets about daily activities and cooking activities, demonstrating significant improvements over state-of-the-art solutions.

Framework

Our proposed MMRG framework consists of three modules: dual-channel relational graph, graph pre-training, and cross-modal retrieval. Specifically, the dual-channel relational graph module constructs textual relational graph and visual relational graph, respectively. Thereinto, the textual relational graph is utilized to filter irrelevant objects in the visual relational graph. Thereafter, the pre-training module customizes two pre-training tasks, i.e., attribute masking and context prediction, to enhance the visual relation reasoning at the node-level and graph-level. Finally, the graph matching and boundary regression are utilized to perform the cross-modal retrieval.

Contributions

To the best of our knowledge, this is the first work that attempts to perform the cross-modal video moment retrieval by investigating the interactions among visual and textual objects.
We propose a graph-based solution, MMRG, to improve the performance of cross-modal video moment retrieval, which is well suited for modeling the spatio-temporal interactions among objects.
Extensive experiments are conducted on two well-known datasets, which demonstrate the effectiveness of our method. Meanwhile, we have released the dataset and implementation to facilitate the research community.

Datasets

We experimented with two publicly accessible datasets: Charades-STA and TACoS, one is related to daily activities at home and the other one is related to cooking activities in lab kitchen. We downloaded original datasets and further constructed the moment candidates with different unit sizes of [64, 128, 256, 512] frames via 80% overlap. In summary, we ultimately obtained 12,541 and 7,463 video moment-query sentence pairs for Charades-STA and TACoS, respectively.

These two datasets are available for downloading here: Link

Codes

The codes of our model and baselines are available. Click the link (coming soon!), you can download it. Just download the data from the data below and put it into the data file inside the codes, you can successfully run the following codes.

Note:

The codes are being reorganized and moved to github, please contact yawezeng11@gmail.com for details of implementation.

In addition, in the task of video moment retrieval, our other two works have been released.

yawenzeng/STRONG: ACM MULTIMEDIA CONFERENCE 2020 (github.com)
yawenzeng/AVMR: ACM MULTIMEDIA CONFERENCE 2020 (github.com)

Example