Zero-Shot Mutli-Object Tracking and Identification
Abstract
Our primary goal is to develop a system for zero-shot object tracking in videos that tracks user-selected objects. With the advent of object tracking frameworks like MeMOTR, OCSORT or HybridSORT, the ability to accurately track and identify multiple objects in dynamic environments has been revolutionized. These frameworks have significantly improved the precision and reliability of tracking systems, enabling object tracking in complex scenarios, with shape-morphing objects occlusions and non-linear motion patterns. Our work aims to leverage the advancements in object tracking to push the boundaries of video analysis and offer tracking solution that combines video with text.
Upon completion, we expect to unveil an advanced system capable of performing consistent novel-object identification and tracking even in the presence of occlusions and rapid movements. We aim for our approach to be compatible with various object tracking frameworks, showcasing its versatility and potential for widespread application in video analysis. To achieve these objectives, we will utilize the Berzelius computing resources for intensive experiments with multi-object tracking models. This will involve the development of customized memory-attention layers into Transformers, leveraging advanced optimization strategies and machine learning model training. Our evaluation will introduce a new video dataset to ensure the effectiveness of our approach across different tracking scenarios.
The technological framework for this project will be based on Python and machine learning libraries (Pytorch), incorporating code from leading research entities such as Amazon, Google Research for foundational insights. The project's findings will be documented in a detailed research article, contributing to the field of video analysis and multi-object tracking.
Results from Local Machines: Our initial experiments on local machines using a subset of our dataset have demonstrated promising results. We have achieved a tracking accuracy (HOTA) of over 75% on our test split, with a significant reduction in identification errors compared to traditional methods. With the additional computational power from Berzelius, we expect to improve our video-text tracking solution by training on the entirety of our dataset within reasonable time horizons. We hope that our preliminary results underscore the potential impact of our project, and we are confident that access to medium-scale computing resources will significantly enhance our capabilities.