Tracking a Zero-Shot Player in Tennis with Kalman Filtering | by Derek Austin | January, 2025

nimda January 19, 2025

0 12 4 minutes read

Tracking a Zero-Shot Player in Tennis with Kalman Filtering | by Derek Austin | January, 2025

Automatic tennis tracking without labels: GroundingDINO, Kalman filtering, and court homography

Zero shot tracking for every tennis point. Video provided under the MIT license, with animation created by the author.

With the recent proliferation of sports tracking projects, many inspired by Skalski's famous football tracking project, there has been a significant shift in the use of automated player tracking for sports enthusiasts. Many of these methods follow a standard workflow: collect labeled data, train a YOLO model, project player directs to an overhead view of the field or court, and use this tracking data to generate advanced analytics for powerful competitive insights. However, in this project, we provide tools to bypass the need for labeled data, relying on GroundingDINO's zero-shot tracking capabilities in conjunction with the use of a Kalman filter to overcome noisy results from GroundingDino.

Our data comes from a collection of streaming videos, publicly available under the MIT license thanks to Hayden Faulkner and team.¹ This data includes videos from various tennis matches during the 2012 Wimbledon Olympics, focusing on the match between Serena Williams and Victoria Azarenka. .

A point between Serena Williams and Victoria Azarenka. Video made public under the MIT license.

GroundingDINO, for those unfamiliar, includes language object detection that allows users to provide input such as “tennis player” which leads the model to return candidate object detection boxes that fit the description. RoboFlow has a great tutorial here for those interested in using it – but I've attached the basic code below as well. As seen below you can tell the model to identify highly unusual objects if they are not marked in the object detection dataset as dog language!

from groundingdino.util.inference import load_model, load_image, predict, annotateBOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25
# processes the image to GroundingDino standards
image_source, image = load_image("dog.jpg")
prompt = "dog tongue, dog"
boxes, logits, phrases = predict(
model=model, 
image=image, 
caption=TEXT_PROMPT, 
box_threshold=BOX_TRESHOLD, 
text_threshold=TEXT_TRESHOLD
)

GroundingDino output when prompted for “Dog” and “dog language.” Photo owned by the author.

However, classifying players on the professional tennis court is not as simple as asking for “tennis players.” The model often misleads other people on the court, such as line judges, football people, and other referees, causing jumpy and inconsistent interpretations. Additionally, the model sometimes fails to even recognize players in certain frames, resulting in endless spaces and boxes that don't appear reliably in each frame.

Tracking takes a lineman in the first example and a ballman in the second. Image made by the author.

To address these challenges, we use several targeted approaches. First, we narrow down the boxes you get to the top three possibilities out of all possible boxes. In general, the line judges have a higher probability of scoring than the players, which is why we do not filter only two boxes. However, this raises a new question: how can we automatically classify players from line judges in each frame?

We noticed that the line and ball staff identification boxes are usually short-lived, usually taking a few frames. Based on this, we estimate that by combining the boxes in consecutive frames, we can sort out people who appear only briefly, thus distinguishing the players.

So how do we achieve this kind of correlation between objects in every frame? Fortunately, the field of mass tracking has studied this problem extensively. Kalman filters are a mainstay in multi-object tracking, often combined with other identification metrics, such as color. For our purposes, a basic implementation of the Kalman filter is sufficient. In simple terms (for a deeper dive, check out this article), the Kalman filter is a method of probabilistically estimating the location of an object based on previous measurements. It works best with noisy data but also works well to correlate objects over time in videos, even if the detection is inconsistent such as if the player is not tracked for the entire frame. We use the entire Kalman filter here but we will go through some of the important steps in the following sections.

The scenario of a 2-dimensional Kalman filter is simple as shown below. All we have to do is track the x and y position and velocity of the object in both directions (we ignore acceleration).

class KalmanStateVector2D:
x: float
y: float
vx: float
vy: float

The Kalman filter works in two steps: it first predicts the position of the object in the next frame, and then updates this prediction based on the new measurement – in our case, from the object detector. However, in our example the new frame may contain many new items, or may even drop items that were in the previous frame which leads to the question of how to relate the previously seen boxes to the currently seen ones.

We choose to do this by using the Mahalanobis distance, which corresponds to the chi-squared test, to test the probability that the current finding is similar to the previous one. Additionally, we keep a row of past items so that we have a 'memory' longer than just one frame. Specifically, our memory stores the trajectory of any object seen in the past 30 frames. Then for each object we find in the new frame we multiply over our memory and find the previous object that is most likely to match the current one given the chance given from the Mahalanbois distance. However, it is possible that we are seeing something completely new, in which case we have to add something new to our memory. If any object has a <30% probability of being associated with any box in our memory we add it to our memory as a new object.

We provide our full Kalman filter below for those who prefer the code.

from dataclasses import dataclassimport numpy as np
from scipy import stats
class KalmanStateVectorNDAdaptiveQ:
states: np.ndarray # for 2 dimensions these are [x, y, vx, vy]
cov: np.ndarray # 4x4 covariance matrix
def __init__(self, states: np.ndarray) -> None:
self.state_matrix = states
self.q = np.eye(self.state_matrix.shape[0])
self.cov = None
# assumes a single step transition
self.f = np.eye(self.state_matrix.shape[0])
# divide by 2 as we have a velocity for each state
index = self.state_matrix.shape[0] // 2
self.f[:index, index:] = np.eye(index)
def initialize_covariance(self, noise_std: float) -> None:
self.cov = np.eye(self.state_matrix.shape[0]) * noise_std**2
def predict_next_state(self, dt: float) -> None:
self.state_matrix = self.f @ self.state_matrix
self.predict_next_covariance(dt)
def predict_next_covariance(self, dt: float) -> None:
self.cov = self.f @ self.cov @ self.f.T + self.q
def __add__(self, other: np.ndarray) -> np.ndarray:
return self.state_matrix + other
def update_q(
self, innovation: np.ndarray, kalman_gain: np.ndarray, alpha: float = 0.98
) -> None:
innovation = innovation.reshape(-1, 1)
self.q = (
alpha * self.q
+ (1 - alpha) * kalman_gain @ innovation @ innovation.T @ kalman_gain.T
)
class KalmanNDTrackerAdaptiveQ:
def __init__(
self,
state: KalmanStateVectorNDAdaptiveQ,
R: float,  # R
Q: float,  # Q
h: np.ndarray = None,
) -> None:
self.state = state
self.state.initialize_covariance(Q)
self.predicted_state = None
self.previous_states = []
self.h = np.eye(self.state.state_matrix.shape[0]) if h is None else h
self.R = np.eye(self.h.shape[0]) * R**2
self.previous_measurements = []
self.previous_measurements.append(
(self.h @ self.state.state_matrix).reshape(-1, 1)
)
def predict(self, dt: float) -> None:
self.previous_states.append(self.state)
self.state.predict_next_state(dt)
def update_covariance(self, gain: np.ndarray) -> None:
self.state.cov -= gain @ self.h @ self.state.cov
def update(
self, measurement: np.ndarray, dt: float = 1, predict: bool = True
) -> None:
"""Measurement will be a x, y position"""
self.previous_measurements.append(measurement)
assert dt == 1, "Only single step transitions are supported due to F matrix"
if predict:
self.predict(dt=dt)
innovation = measurement - self.h @ self.state.state_matrix
gain_invertible = self.h @ self.state.cov @ self.h.T + self.R
gain_inverse = np.linalg.inv(gain_invertible)
gain = self.state.cov @ self.h.T @ gain_inverse
new_state = self.state.state_matrix + gain @ innovation
self.update_covariance(gain)
self.state.update_q(innovation, gain)
self.state.state_matrix = new_state
def compute_mahalanobis_distance(self, measurement: np.ndarray) -> float:
innovation = measurement - self.h @ self.state.state_matrix
return np.sqrt(
innovation.T
@ np.linalg.inv(
self.h @ self.state.cov @ self.h.T + self.R
)
@ innovation
)
def compute_p_value(self, distance: float) -> float:
return 1 - stats.chi2.cdf(distance, df=self.h.shape[0])
def compute_p_value_from_measurement(self, measurement: np.ndarray) -> float:
"""Returns the probability that the measurement is consistent with the predicted state"""
distance = self.compute_mahalanobis_distance(measurement)
return self.compute_p_value(distance)

After tracking everything detected for the past 30 frames, we can now build heuristics to identify which boxes are likely to represent our players. We tested two methods: choosing the boxes closest to the center of the base, and choosing those with the longest history recognized in our memory. In practice, the first strategy often flags the line judges as players whenever an actual player leaves the baseline, making it less reliable. During that time, we noticed that GroundingDino tends to “blink” between different referees and people playing football, while the real players maintain a stable presence. As a result, our last rule is to choose boxes that we remember with a long track record as real players. As you can see in the first video, it works amazingly well for such a simple rule!

With our tracking system now established in the image, we can move on to traditional analysis by tracking players from a bird's eye view. This view enables the evaluation of key metrics, such as total distance traveled, player speed, and field positioning trends. For example, we can analyze whether a player usually targets his opponent's backhand based on position during a point. To achieve this, we need to map the coordinates of the player from the image to a standard court template viewed from above, aligning the point of view of the spatial analysis.

This is where homography comes into play. Homography describes the mapping between two locations, which, in our case, means mapping the points in our original image to the court view. By identifying a few key points in the original image – such as line intersections in a court – we can calculate a homography matrix that translates any point into a bird's eye view. To create this homography matrix, we first need to identify these 'key points.' Various open source models, with a valid license on platforms such as RoboFlow can help locate these points, or we can label them ourselves on a reference image that we can use for conversion.

As you can see the predicted key points are not perfect but we get small errors that do not affect the final transformation matrix much.

After labeling these key points, the next step is to match them with the corresponding points in the reference court image to generate a homography matrix. Using OpenCV, we can create this transformation matrix with a few simple lines of code!

import numpy as np
import cv2# order of the points matters
source = np.array(keypoints) # (n, 2) matrix
target = np.array(court_coords) # (n, 2) matrix
m, _ = cv2.findHomography(source, target)

With the homography matrix in hand, we can place any point from our image to the court of reference. In this project, our focus is on the player's position on the field. To determine this, we take the center point at the bottom of each player's bounding box, using that as their position on the pitch in a bird's eye view.

We use the center point at the bottom of the box to map where each player is on the field. The figure shows a key point rendered on a tennis court from a bird's eye view of our homography matrix.

In summary, this project shows how we can use GroundingDINO's zero-shooting capabilities to track tennis players without relying on labeled data, turning complex object detection into actionable player tracking. By addressing key challenges – such as separating players from other court personnel, ensuring consistent independent tracking, and mapping player movements from a bird's-eye view of the field – we have created the foundation for a robust tracking pipeline without the need for clear labels.

This approach not only unlocks information such as distance traveled, speed, and position but also opens the door to deeper analysis of measurements, such as gun pointing and strategic court placement. With further refinements, including integrating the YOLO or RT-DETR model from the output of GroundingDINO, we may develop a real-time tracking system against existing commercial solutions, providing a powerful tool for both training and fan engagement in the world of tennis.

Source link

nimda January 19, 2025

0 12 4 minutes read