1. Introduction
Soccer (football) is among the world’s most popular sports played by millions of people. Such popularity has led many computer vision researchers to work on soccer video analysis. Soccer video analysis offers the information for team/player performance analysis, referee decision support, video summarization, highlight extraction, and intelligent broadcast. As being an incredibly competitive field, soccer clubs around the world are incorporating video analysis methods as training tools for the development of the team. In the development of teams, playbacks provide unparalleled coverage of key events, enabling teams to understand their own strengths and weaknesses, facilitating strategy development
.
Team/player performance measurement systems have the potential to reveal aspects of the game that are not obvious to the human eye. Such systems can measure the distance covered by players, speed of movement, number of sprints, and players’ relative positioning with respect to others and then use this data in individual player performance evaluation, fatigue detection, assessment of team’s tactical performance and analysis of the opponents
[2] | S. Baysal, P. Duygulu. (2016). Sentioscope: A soccer player tracking system using model field. IEEE Transactions on Circuits and Systems for Video Technology. 26(7), 1350-1362. https://doi.org/10.1109/TCSVT.2015.2455713 |
[2]
.
Multi-player tracking that accurately tracks multiple players on soccer video in real time is the key issue in performance evaluation, and requires detecting all the players on video, finding their positions at regular intervals, and linking the spatiotemporal data to extract their moving trajectories.
Multi-player tracking in a soccer match is a nontrivial task due to various challenges. Unlike vehicles or pedestrians, which have relatively predictable motion patterns, soccer players try to confuse each other with unexpected changes in velocity or direction and usually run in groups. Moreover, soccer players look almost identical because of their same jerseys in each team, and they are frequently involved in possession challenges and tackles, where they can be occluded by others, resulting in some ambiguities for tracking. Since the trajectory of a tracking object can be propagated toward another tracking object or other occlusion element, the complete occlusion of tracking objects can result in identity switches or identity hijacking of tracking objects
. Here, the identity refers to a label with a unique integer value assigned to the trajectory of the tracking object to distinguish it from each other.
The accuracy of tracking is highly dependent on the accuracy of object detection, and factors such as video quality, camera long-range defocus, noise, weather change or environmental change can also be factors that make it difficult to accurately detect players
[2] | S. Baysal, P. Duygulu. (2016). Sentioscope: A soccer player tracking system using model field. IEEE Transactions on Circuits and Systems for Video Technology. 26(7), 1350-1362. https://doi.org/10.1109/TCSVT.2015.2455713 |
[4] | C. Direkoglu, M. Sah, and N. E. O’Connor. (2018). Player detection in field sports. Machine Vision and Applications, Springer. 29, 187-206. https://doi.org/10.1007/s100138-017-0893-8 |
[21] | Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision. 17-35. https://doi.org/10.48550/arXiv.1609.01775 |
[2, 4, 21]
.
With recent advances in object detection, most of the state-of-the-art multi-object tracking algorithms adopt a “tracking-by-detection” paradigm
. Even though, given single frame detection results of the video, different approaches have been proposed to improve data association, motion propagation, and life cycling, most of these works assume the localization accuracy of each detection output. Therefore, data association is usually conducted based on location, optionally combined with abstracted object attributes
[18] | Duy MH Nguyen, Roberto Henschel, Bodo Rosenhahn, Daniel Sonntag, and Paul Swoboda. (2022). Lmgp: Lifted multicut meets geometry projections for multi-camera multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8866-8875. https://doi.org/10.48550/arXiv.2111.11892 |
[19] | Ziqi Pang, et al. (2023). Standing Between Past and Future: Spatio-Temporal Modeling for Multi-Camera 3D Multi-Object Tracking, arXiv preprint arXiv: 2302.03802. 1-15. https://doi.org/10.48550/arXiv.2302.03802 |
[18, 19]
. This bias is a drawback of camera modality with higher localization uncertainty. Although the latest methods incorporate deep learning-based algorithms to improve the association with high-fidelity features such as low-level features from feature point cloud or intermediate features from cameras, these approaches are also highly dependent on the accuracy of localization
[17] | Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. (2021). FairMOT: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision. 1-13. https://doi.org/10.1007/s11263-021-01513-4 |
[20] | Fan Yang, et al. (2024). A Unified Multi-view Multi-person Tracking Framework. Computational Visual Media. 10, 1, 137-160. https://doi.org/10.1007/s41095-023-0334-8 |
[17, 20]
.
The recently published YOLOv8n object detection model
has significantly improved the detection performance of objects with different sizes and orientations, but as shown in
Figure 1, problems such as uncertainty of detection in the occlusion regions where multiple players are close to each other, and missing detection due to noise and long-range defocus, still arise. Only with the track-query prediction, the short-term missing detection problem can be solved, but the relatively long-term missing detection problem cannot be solved.
Figure 1. Detection errors of YOLOv8n player detection result.
In this paper, as a solution, we propose an approach that combines the tracking history, query prediction and detection information reasonably to improve the single-view player tracking performance and then integrates the multi-view analysis information to improve the multi-player tracking performance in soccer videos.
It can be seen that the tracking history information in several consecutive frames is relatively reliable compared to the object detection result using only one frame, and the reliability is considered to be proportional to the number of frames in the tracking history. Sometimes, object detection result using only one-frame can be ambiguous and biased. Therefore, combining reasonably the spatiotemporal detection histories of the trajectories, the query prediction information based on them, and the current detection result can improve the reliability of object detection result, gradually decrease its bias, and thus improve the tracking performance. In addition, by excluding false detections in the occlusion regions of several players, the uncertainty in player detection can be decreased to some extent.
Multi-view geometric constraints can exclude false detections in a single view and improve multi-frame association, whereas multi-frame association in each view can compensate for the effect of noise and outliers that hamper multi-view association. Therefore, we attempt to jointly leverage multi-frame and multi-view information. We demonstrate the advantages of our approach by presenting experimental results on several test datasets.
To summarize, our contributions are as follows.
1). In 2D single-view multi-player tracking,
a. We propose an approach that combines the deep learning-based player detection/tracking result and their prediction information by the trajectory to decrease the bias and mistakes in the player detection and improve the multi-player tracking performance.
b. We propose an approach that uses an assignment cost calculation method that integrates L2-distance and IOU between the tracking query region and the detected object region, and also use an identity assignment method that combines Hungarian method and Greedy one, to improve the robustness of object data association.
2). We integrate the multi-view analysis information to decrease the occlusions of the players dramatically and improve the overall tracking performance significantly.
3). Experimental results on soccer video datasets show that our approach exhibits significant improvements over the previous approaches in terms of MOT performance metrics such as MOTA, IDS and MOTP.
3. Player Candidate Detection and Verification
In this paper, we transfer the human detection model of YOLOv8n to suit the player detection situation in soccer videos and use it to detect all player candidates in video frames from each camera view
.
Object detection algorithms typically generate multiple bounding boxes with different confidence scores around an identical object. And then a post-processing technique, Non-Maximum Suppression (NMS) algorithm, filters out the redundant and irrelevant bounding boxes, keeping only the most accurate one with the highest confidence score
[5] | Juan R. Terven, Diana M. Cordova-Esparza. (2023). A comprehensive review of YOLO: from YOLOV1 and BEYOND: Under review in ACM computing surveys, arXiv preprint arXiv: 2304.00501. 1-34. https://doi.org/10.48553/arXiv.2304.00501 |
[5]
.
The standard NMS algorithm works as follows. First, filter out the unnecessary bounding boxes whose reliability score is below the threshold, and store the remaining ones in the predicted bounding box list. Next, sort the predicted bounding boxes in the bounding box list by their confidence scores in descending order. And, until the predicted bounding box list is empty, repeat as follows. First, select a bounding box with the highest confidence score in the predicted bounding box list, then put it in the final bounding box list and remove it from the predicted bounding box list. Next, in the predicted bounding box list, find and remove the bounding boxes whose IOU with the selected bounding box is larger than the threshold. Here, the IOU threshold is used to characterize a bounding box cluster containing an identical object, the bounding box sorting by the confidence score is used to select the centre of the bounding box cluster, and the confidence threshold is used to select only the most accurate ones among those cluster centres. The standard NMS algorithm uses the fixed confidence threshold and IOU threshold, which decreases the flexibility of object detection
[5] | Juan R. Terven, Diana M. Cordova-Esparza. (2023). A comprehensive review of YOLO: from YOLOV1 and BEYOND: Under review in ACM computing surveys, arXiv preprint arXiv: 2304.00501. 1-34. https://doi.org/10.48553/arXiv.2304.00501 |
[5]
.
To solve this problem, we modify the standard NMS algorithm as follows. First, we further use the lower limit threshold of the confidence scores of the potential player entities to store all of the player entities with low confidence scores. And we use them to correct some of the player detection errors described in the following sections. Verification of such player detection errors is carried out by selecting the candidate bounding box that is most similar to the prediction result by the trajectory among the candidate bounding boxes in the local error region.
The algorithm is as follows.
[Algorithm 1: Modified NMS Algorithm]
Require: , Set of predicted bounding boxes,
, confidence scores,
, IoU threshold,
, confidence threshold,
, lower bound threshold of confidences. ()
Ensure: , Set of filtered bounding boxes,
, Set of filtered candidate bounding boxes.
1: Initialize: , .
2: Filter the boxes in with : .
3: Sort all by their confidence scores in descending order.
4: while do
5: Select the box with the highest confidence score.
6: Add to : .
7: Remove from : .
8: for all remaining boxes in do
9: Calculate the IOU between and : .
10: if then
11: Remove from : .
12: end if
13: end for
14: end while
15: Filter the boxes in with : .
16: Remove from : .
4. 2D Single-view Multi-player Tracking
2D Single-view MOT, which is the basis for 3D Multi-view Multi-person Tracking (3D MM-Tracking), is still challenging due to a large number of objects that need to be tracked, the occlusions between objects, and the changing appearances of objects over time.
In general, appearance and geometric consistency are two important assumptions used for MOT. Appearance consistency means that the previous appearance of an identical object should be similar to its current appearance, and geometric consistency means that its previous location and shape added to its estimated motion should be approximate to its current location and shape. While appearance-based MOT methods
[9] | Fan Yang, Zheng Wang, Yang Wu, Sakriani Sakti, and Satoshi Nakamura. (2022). Tackling multiple object tracking with complicated motions—re-designing the integration of motion and appearance. Image and Vision Computing. 124, 1-10. https://doi.org/10.1016/j.mavis.2022.104514 |
[10] | Fangao Zeng, Bin Dong, Yuang Zhang, Tiancai Wang, Xiangyu Zhang, and Yichen Wei. (2022). Motr: End-to-end multiple-object tracking with transformer. In European Conference on Computer Vision. 659-675. https://doi.org/10.1007/978-3-031-19812-038 |
[11] | Xingyi Zhou, Tianwei Yin, Vladlen Koltun, and Philipp Krähenbühl. (2022). Global tracking transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8771-8780. https://doi.org/10.48550/arXiv.2203.13250 |
[9-11]
have achieved promising performance, recent appearance-free MOT solutions
[12] | Yunhao Du, Yang Song, Bo Yang, and Yanyun Zhao. (2022). Strongsort: Make deepsort great again. arXiv preprint arXiv: 2202.13514. https://doi.org/10.48550/arXiv.2202.13514 |
[13] | Fan Yang, Shigeyuki Odashima, Shoichi Masui, and Shan Jiang. (2023). Hard to track objects with irregular motions and similar appearances? make it easier by buffering the matching space. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 1, 4799-4808. https://doi.org/10.48550/arXiv.2211.14317 |
[12, 13]
prove that only using the geometric features can also provide robust tracking results on multiple difficult MOT datasets
[14] | Silvio Giancola, Anthony Cioppa, Adrien Deliège, Floriane Magera, Vladimir Somers, Le Kang, Xin Zhou, Olivier Barnich, Christophe De Vleeschouwer, Alexandre Alahi, et al. (2022). Soccernet 2022 challenges results. In Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports, 75-86. https://doi.org/10.48550/arXiv.2210.02365 |
[14]
.
In order to achieve fast online processing, in this paper, we mainly investigate the appearance-free approaches based on geometric consistency, and also show that adding some appearance features can improve the tracking performance.
4.1. Analysis of Multi-player Tracking Errors
Multi-object tracking by detection consists of the iterations of the object detection and data association. Therefore, multi-object tracking errors can be classified as the object detection errors and the data association errors.
From the viewpoint of the appearance of an object in an image and the detection of it, the object detection examples of a detector can generally be classified into True Positives (TPS), False Positives (FPS), False Negatives (FNS) and True Negatives (TNS).
When an object detector says an object detection example is an object, it is a True Positive if in fact it is an object and a False Positive if in fact no object is present. And, when the detector says an object detection example is not an object, it is a True Negative if in fact it is not an object and a False Negative if in fact it is an object.
Data association errors come from the wrong association between the detected objects and the tracking ones. Due to the characteristics of the soccer match, in multi-player tracking in soccer videos, occlusions and separations of the players occur frequently, and their patterns are very complex. Multi-player tracking is a free competition assignment problem because there is no constraint on the assignment between the tracking objects and detected ones. The occlusion is a situation where the correspondence between the tracking objects and new detected ones is n-to-1, and the separation is a situation where the correspondence between them is 1-to-n. Owing to the frequent occlusions and separations of the players during the match, multi-player tracking entails the frequent occurrences of new appearances and disappearances of the tracking objects over time in videos.
Similar to the object detection, we can classify the new appearances and disappearances of the tracking players in videos into the following categories.
First, we classify the newly appeared tracking players in videos into two classes: TPS and FPS. TPS refer to the newly appeared tracking players that are detected by the player detector and in fact appear in the image. They consist of two classes: that come from the outside into the inside of the camera view, and that are separated from the occluded player groups. And FPS refer to the newly appeared tracking players that are detected by the player detector but in fact do not appear in the image. They are the player detection errors caused by the player detector. They also consist of two classes: that in fact do not appear in the image, and that the player detector detects more than two bounding boxes around an identical player.
Next, we classify the disappeared tracking players in videos into two classes: TNS and FNS. TNS refer to the disappeared tracking players that are not detected by the detector and in fact do not appear in the image. They consist of two classes: that go out of the camera view, and that are occluded by other tracking players. (In other cases, there is also the occlusion by the background objects, but we exclude it.) FNS refer to the disappeared tracking players that are not detected by the detector but in fact appear in the image. They are also the player detection errors caused by the player detector.
We can classify all the player detection errors occurred during multi-player tracking to FNS or FPS, and find them in a way that distinguishes from TNS and TPS, respectively.
FNS: The errors that the detector did not detect some players that have been tracked so far and in fact appear in the current frame image (see
Figure 1(a)). They are also called the missing detections and should be distinguished from TNS.
FPS: They can be classified into two classes, and should be distinguished from TPS.
a) False detection errors: The errors that are detected by the detector but in fact do not appear in the image.
b) Overlapped detection errors: The errors that the detector detects more than two bounding boxes around an identical player (see
Figure 1(b)), or that caused by false detection in some regions where several players are adjacent (see
Figure 1(c)), due to some reasons such as NMS processing of the detector, etc. Generally, it is very difficult to find the overlapped detection errors correctly.
In multi-player tracking, such detection errors are the main causes of the segmentation of the trajectories and identity increasing, and also appear to some extent in deep learning-based player detection results with high detection performance.
4.2. Framework of Our Multi-player Tracking Module
In
Figure 4, we show the framework of our 2D single-view multi-player tracking module.
Figure 4. The framework of our 2D single-view multi-player tracking module.
Our framework can be outlined as follows.
First, with the player detection model using YOLOv8n, we detect all players appeared in the current frame image, and then apply Hungarian method
[8] | Quazzafi Rabbani et al. (2019). Modified Hungarian method for unbalanced assignment problem with multiple jobs, Applied Mathematics and Computation. 361, 493-498. https://doi.org/10.1016/j.amc.2019.05.041 |
[8]
to perform a 1-to-1 assignment between the current detection results and the tracking results to the previous frame. Next, we use the formula (
10) to find and eliminate only obvious errors among the aforementioned overlapped detection errors. And then, for the detected objects to which have not been assigned any identity and for the tracking objects that are not assigned to any detected object, we apply the greedy method to perform the occlusion/separation processes of the tracking objects. Finally, based on the tracking history to the previous frame, the result predicted by it, and the player detection result in the current frame, we find all the player detection errors in the current frame. And then, for each of them, we select a candidate bounding box that is most similar to the result predicted by the tracking history among the candidate bounding boxes that are included in the candidate bounding box list
and also are in its target region, and modify it according to the result.
4.3. 1-to-1 Assignment Between Tracking and Detected Objects
From a descriptive necessity, hereafter, let us denote the sets of the tracking objects and detected ones at time as and , respectively.
Applying Hungarian method, we assign each of the detected objects in the current frame image to each of the tracking ones up to the previous frame , and compute the assignment cost between them as follows:
(4)
where
denotes the location of the center of the bounding box of the object
,
is the
-norm between the two objects,
is the prediction region of the tracking object
,
denotes the area of the detection region of the object
,
represents the area of the occlusion region of the detection regions of
and
.
We assign the detected object to the tracking object and assign the identity of to if the minimum assignment cost is below a threshold.
where is the assignment cost threshold, whose value is determined experimentally.
To improve the detection accuracy of the tracking object, the detection information of the tracking object in the image coordinate system of each camera view is expressed as a 12D feature vector including the central position , width and height , direction , reliability score of the bounding box, and the rates of change of each of these quantities.
The rate of change of each quantity is obtained as follows:
,(8)
where is the number of prediction frames set to represent the temporal variability of the quantities of interest, is its value estimated experimentally, and is the total number of frames of the trajectory of the tracking object.
The final detection information of the tracking object is adjusted as follows.
(9)
According to Eq. (
9), the object detection bias of the tracking object gradually decreases.
4.4. Exploring and Elimination of the Overlapped Detection Errors
Since it is difficult to find the overlapped detection errors correctly, we find and remove only the obvious errors.
For each of the newly appeared detected object within the camera view, we assign an occlusion label to the occlusion pixels with the neighbouring detected objects that the tracking identity is assigned to.
Let and be the width and height of the bounding box of the detected object of interest, respectively, and let the number of pixels assigned the occlusion label inside that region. If the following formula is satisfied, we judge it as an overlapped detection error, and remove it from the detection object list.
where is a threshold determined experimentally, and we set its value to 0.95 through statistical analysis experiments.
4.5. Occlusion / Separation of the Tracking Objects
At this stage, first, we find the detected objects with no any tracking identity assigned and the tracking objects with no detected object assigned, and then for them, apply the greedy method to perform the occlusion/separation processes of the tracking objects.
1) Occlusion of the tracking objects
First, we find a tracking object with no detected object assigned. If any, we investigate whether there are any detected objects that have already been assigned a tracking identity within its appropriate neighbourhood or not. If any, among them, we find the detected object with the minimum assignment cost, and add the identity of the tracking object of interest to its tracking identity list. And then we delete the tracking object of interest. We repeat these steps until there is no such tracking object.
2) Separation of the tracking objects
First, we find a detected object with no any tracking identity assigned. If any, we investigate whether there are any detected objects to which have already been assigned a tracking identity within its appropriate neighbourhood or not. If any, we find a detected object with the minimum assignment cost, and in its identity list, select the identity of the tracking object with the most similar appearance characteristics to the detected object of interest, and then set it as the tracking identity of the target detected object. Then we delete the target detected object. We repeat these steps until there is no such detected object.
Although we did not describe specially, the team information of a tracking object with a single identity plays an important role in the analysis of the appearance characteristics of a separated object. And “its appropriate neighbourhood” means the neighbourhood determined by formula (
7).
4.6. Exploring and Correction of the Player Detection Errors Using the Tracking Results
1) Exploring and Verification of FNS
a. Among the tracking objects, we find all the tracking objects with no detected object assigned, i.e., that have been disappeared in the current frame.
b. Among them, we find and exclude all TNS.
First, we find the disappeared tracking objects due to going out of the camera view, by examining whether their vanishing positions are near the boundaries of the frame image or not, and if any, exclude them.
Next, we find the disappeared tracking objects due to the occlusions by another tracking objects, by exploring the detected object that is assigned simultaneously to more than two different tracking objects, and if any, exclude them.
a. Among the remaining tracking objects, we obtain a minimal bounding rectangle containing both of the detection region (i.e., object detection region in the last frame of the trajectory) of the target tracking object and its prediction region obtained from its trajectory information and the extended Kalman filter
[2] | S. Baysal, P. Duygulu. (2016). Sentioscope: A soccer player tracking system using model field. IEEE Transactions on Circuits and Systems for Video Technology. 26(7), 1350-1362. https://doi.org/10.1109/TCSVT.2015.2455713 |
[2]
, and then extend it slightly to get an exploring region for the verification of FNS.
b. In the candidate bounding box list from the modified NMS, we find the candidate bounding boxes that lie within the exploring region. If any, among them, we choose a candidate bounding box that is most similar to the result predicted from the trajectory information, and use it as a new detected object of the target tracking object in the current frame and continue tracking it. If no, end up tracking it.
2) Exploring and Elimination of the False Detection Errors
a. First, we find all the detected objects with no identity assigned, i.e., the newly appeared detected ones in the current frame.
b. Among them, we find the detected objects that are near the boundaries of the image by examining their position, and exclude them.
c. Among the remaining ones, we find the newly appeared detected objects due to the separation from the occlusion by another tracking objects, by exploring for the detected objects with no tracking objects assigned, and exclude them.
d. For each region of the remaining ones, we examine whether in fact an object exists in that region by exploring the candidate bounding box list as above, and if not, remove it from the detected object list.
5. Multi-View Analysis
The positions of the players as the results of single-view multi-player detection/tracking, are those in terms of the each camera view coordinates, but what we would like to achieve from the soccer video analysis is the overall match analysis results in terms of the realistic soccer field model coordinates. Therefore, we integrate the information from all the cameras to represent the results in terms of a common model coordinates of the soccer field.
We detect all the player candidates from the input video of each camera view, and execute the player tracking module, and synchronize them in terms of the shooting times, and map them onto a common model of the soccer field.
Given the projections of all players on the field model, we establish the correspondence between the detected players across cameras to integrate them, and analyse the results to decrease the occlusions of the players as much as possible and obtain the overall match analysis results.
5.1. Player Registration Onto the Field Model
A general mathematical expression describing the relationship between the 3D coordinate point of the object being captured and its 2D image point is the camera matrix. If we assume the shooting scene is to be planar, i.e. all three-dimensional coordinate points lie on a plane, then, we can reduce the camera matrix. This is known as homography, or the planar projective transformation
[2] | S. Baysal, P. Duygulu. (2016). Sentioscope: A soccer player tracking system using model field. IEEE Transactions on Circuits and Systems for Video Technology. 26(7), 1350-1362. https://doi.org/10.1109/TCSVT.2015.2455713 |
[2]
.
With the image coordinates of all players and the homography matrices of all cameras, we project all players onto the field model, shown in
Figure 5. The colour of the circle indicates the player’s team while the number inside the circle represents the camera that was tracking it.
Since the dimensions of the field model are proportional to the realistic soccer field, we can apply a scaling operation to convert the coordinates of all projections from pixels to meters.
5.2. Player Fusion on the Field Model
As seen in
Figure 5, the result of registering the players from multiple cameras produces multiple objects on the field model. Therefore, we need to identify the pairs of objects that belong to the same player in the realistic world, and integrate them.
We use the nearest neighbour method to identify the pairs of objects that belong to the same player in the realistic world. For each player, we compute the L2-distance between every other player from the same team and declare the pair with the smallest distance as the same object if their distance does not exceed a certain threshold (
Figure 6).
Figure 5. Field Model Registration of All Players.
Figure 6. Player Fusion Result on the field model.
Players that do not meet the criteria above will still remain in the field albeit without a pair. In this case, it is assumed that the player is being tracked by a single camera and thus does not have a closest pair. Such a scenario also occurs when the occlusion handling fails.
5.3. Occlusion Handling Using Multi-view Information
Uncertainty during the occlusion and separation of the players is an important issue that leads to uncertainty in the tracking identity management. The use of player tracking information in different camera views can easily solve this problem.
If the players occluded in a certain camera view appear to be the different players separated in another camera view, we treat them as the separated tracking objects on the field model. In that way, we can separate many occluded objects into the individual players and track them separately and accurately. This approach is relatively simple, but very effective.
6. Experiment Results
6.1. Evaluation Data Sets
We compare our approach with the other tracking methods using the publicly available Institute of Intelligent Systems for Automation (ISSIA) dataset
[2] | S. Baysal, P. Duygulu. (2016). Sentioscope: A soccer player tracking system using model field. IEEE Transactions on Circuits and Systems for Video Technology. 26(7), 1350-1362. https://doi.org/10.1109/TCSVT.2015.2455713 |
[2]
. This dataset consists of 3000 frames captured at 25 frames/s by six cameras placed around a stadium in a multi-view configuration. In order to evaluate our method on the large scale, we also experiment on a video dataset captured during 45 min by eight cameras at a full-length soccer stadium, as shown in
Figure 2. The evaluation metrics are approximated for this tracking data.
6.2. Evaluation Metrics
In our experiments, to evaluate the tracking results from various perspectives, we evaluate the multi-player tracking performance strictly following the official metrics to employ CLEAR metrics and Identity metrics
[20] | Fan Yang, et al. (2024). A Unified Multi-view Multi-person Tracking Framework. Computational Visual Media. 10, 1, 137-160. https://doi.org/10.1007/s41095-023-0334-8 |
[21] | Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision. 17-35. https://doi.org/10.48550/arXiv.1609.01775 |
[20, 21]
. In detail, IDS (i.e., Number of ID switches) indicate the times of identity jumps, IDF1 (i.e., ID F1 scores) accounts for identity match performance, and MOTA (i.e., Multi-Object Tracking Accuracy) is a combination of false positives, missed targets and IDs. MT (i.e., number of mostly tracked trajectories) measures the number of trajectories whose target was tracked more than 80%, whereas ML (i.e., number of mostly lost trajectories) measures the number of trajectories that have less than 20% target tracked. Among them, the MOTA score is the dominant metric used to measure the overall tracking performance. And MOTP (Multi-Object Tracking Precision) is a summary of overall tracking precision in terms of bounding box overlap between ground-truth and predicted location. It incorporates the Intersection over Union (IOU) measure to assess the quality of the predicted bounding boxes.
6.3. Experiment Setup
For detection of players, we choose the object detection model of YOLOv8n
[5] | Juan R. Terven, Diana M. Cordova-Esparza. (2023). A comprehensive review of YOLO: from YOLOV1 and BEYOND: Under review in ACM computing surveys, arXiv preprint arXiv: 2304.00501. 1-34. https://doi.org/10.48553/arXiv.2304.00501 |
[5]
and follow the default settings of it, to train and inference bounding boxes for our experimental datasets. We run it on an NVIDIA Geforce RTL 2070 GPU.
For NMS, we set up the parameters of the modified NMS algorithm as
,
,
. And, for one-to-one assignment between the tracking objects and the detected ones, we set up the assignment cost threshold of the formula (
7) as
.
To identify the pairs of objects on the field model that belong to the same player in the realistic world, we set up the L2-distance threshold at the nearest neighbour method to be less than 5m in terms of the realistic distance. As our soccer field model is 660×450 pixels, we set up it to 30.
6.4. Experiment Results
We explore the following aspects of our framework.
1) Efficacy of our assignment cost calculation method and identity assignment method.
In
Table 1, we compare the multi-player tracking performances when L2-distance (Eq. (
1)), IOU (Eq. (
4)), and our method (Eq. (
5)) are applied to the assignment cost calculation between the tracking objects and detected ones, respectively. And in
Table 2, we show the multi-player tracking performances when the Hungarian method, the greedy method and our combined approach are applied to the assignment between the tracking objects and detection ones, respectively. In these experiments, we use the object detection model of YOLOv8n and the simple online real-time tracking (SORT) method.
Table 1. MOT performances using the different assignment cost calculation methods (Hungarian method).
approach | IDF1↑ MOTA↑ MT↑ ML↓ IDs↓ |
Eq. (1) | 86.4% 80.2% 79.6% 12.1% 188 |
Eq. (4) [17] | Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. (2021). FairMOT: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision. 1-13. https://doi.org/10.1007/s11263-021-01513-4 |
[17] | 87.1% 80.9% 78.5% 12.8% 186 |
Eq. (5) (our) | 89.3% 81.2% 82.6% 11.3% 178 |
Table 2. MOT performances using the different assignment methods.
approach | IDF1↑ MOTA↑ MT↑ ML↓ IDs↓ |
Hungarian [8] | Quazzafi Rabbani et al. (2019). Modified Hungarian method for unbalanced assignment problem with multiple jobs, Applied Mathematics and Computation. 361, 493-498. https://doi.org/10.1016/j.amc.2019.05.041 |
[8] | 89.3% 81.2% 82.6% 11.3% 178 |
Greedy | 88.6% 80.7% 78.3% 13.7% 191 |
Comb. | 92.4% 83.6% 84.5% 9.5% 166 |
In
Table 1, our approach achieves a good result with some improvement on every metric. Using our approach, MOTA, IDF1 and IDS are improved more than 0.3%, 2.2%, and 8, respectively, over the previous methods. And, from it, it can be seen that, using our approach, there is a tendency for all metrics to be improved slightly. In addition, in
Table 2, our combined approach also achieves the excellent result with some significant improvements on all metrics. MOTA, IDF1 and IDS are improved more than 2.4%, 3.1%, and 12, respectively, over the other approaches, and other metrics are also improved. Such results indicate the excellent association ability of our algorithm.
2) Efficiency of Exploring and Correction of the Player Detection Errors using the Tracking results.
In this part, we analyse the efficiency of our exploring and correction method of the player detection errors using the tracking results, in terms of the performances of multi-player tracking and player detection.
In
Table 3, we show the performances of multi-player tracking and detection when we apply our approach to several object detection methods. In
Table 3, both of YOLOv5
and YOLOv8n
[5] | Juan R. Terven, Diana M. Cordova-Esparza. (2023). A comprehensive review of YOLO: from YOLOV1 and BEYOND: Under review in ACM computing surveys, arXiv preprint arXiv: 2304.00501. 1-34. https://doi.org/10.48553/arXiv.2304.00501 |
[5]
are the deep learning-based object detection methods. In the experiment, we apply our approach to the object detection/tracking results from these methods and analyse their effectiveness.
Table 3. Applying result of our approach.
approach | IDF1↑ MOTA↑ MOTP↑ IDs↓ |
YOLOv5 | 89.5% 81.7% 75.8% 187 |
YOLOv8n [5] | Juan R. Terven, Diana M. Cordova-Esparza. (2023). A comprehensive review of YOLO: from YOLOV1 and BEYOND: Under review in ACM computing surveys, arXiv preprint arXiv: 2304.00501. 1-34. https://doi.org/10.48553/arXiv.2304.00501 |
[5] | 92.4% 83.6% 73.9% 166 |
YOLOv5 + our | 94.7% 93.2% 87.3% 169 |
YOLOv8n + our | 97.2% 94.4% 86.6% 151 |
As in
Table 3, applying this approach, MOTA, IDF1 and IDS are increased, on average, 11.15%, 5.0%, and 16.5, respectively, over the previous methods. Such improvements by this approach are significantly large, compared to the improvements by the above approaches. In
Table 3, MOTP is also increased 12.1% on average. This improvement mainly comes from adapting the equation (9). Such significant improvements confirm the necessity and efficiency of our approach for multi-player tracking.
3) Efficiency of multi-view analysis.
In
Table 4, we show the multi-player tracking performances, with and without occlusion handling using multi-view analysis information. From
Table 4, it can be seen that the multi-player tracking performance on the field model is significantly improved with occlusion handling using multi-view analysis information. And the tracking performance using eight cameras is better than that using six cameras.
Table 4. Efficiency of Multi-view Analysis.
Multi-view analysis | IDF1↑ MOTA↑ MT↑ ML↓ IDs↓ |
no | 97.2% 94.4% 96.2% 4.6% 151 |
do (6 cameras) | 98.7% 98.9% 98.1% 2.2% 35 |
do (8 cameras) | 99.3% 99.4% 98.3% 1.7% 24 |