Detect or Track: Towards Cost-Effective Video Object Detection/Tracking

State-of-the-art object detectors and trackers are developing fast. Trackers are in general more efficient than detectors but bear the risk of drifting. A question is hence raised – how to improve the accuracy of video object detection/tracking by utilizing the existing detectors and trackers within a given time budget? A baseline is frame skipping – detecting every N-th frames and tracking for the frames in between. This baseline,however, is suboptimal since the detection frequency should depend on the tracking quality. To this end, we propose a scheduler network, which determines to detect or track at a certain frame, as a generalization of Siamese trackers. Although being light-weight and simple in structure, the scheduler network is more effective than the frame skipping baselines and flow-based approaches, as validated on ImageNet VID dataset in video object detection/tracking.

1. Detect or Track: Towards Cost-Effective Video Object Detection/Tracking Hao Luo1∗ , Wenxuan Xie2 , Xinggang Wang1 , Wenjun Zeng2 1 School of Electronic Information and Communications, Huazhong University of Science and Technology 2 Microsoft Research Asia {luohao, xgwang}, {wenxie, wezeng} arXiv:1811.05340v1 [cs.CV] 13 Nov 2018 Abstract frame individually, state-of-the-art approaches consider the temporal consistency of the detection results via tubelet pro- State-of-the-art object detectors and trackers are developing posals (Kang et al. 2016; Kang et al. 2017), optical flow fast. Trackers are in general more efficient than detectors but bear the risk of drifting. A question is hence raised – how to (Zhu et al. 2017b; Zhu et al. 2017a; Zhu et al. 2018) and improve the accuracy of video object detection/tracking by regression-based trackers (Feichtenhofer, Pinz, and Zisser- utilizing the existing detectors and trackers within a given man 2017). These approaches, however, are optimized for time budget? A baseline is frame skipping – detecting ev- the detection accuracy of each individual frame. They either ery N -th frames and tracking for the frames in between. do not associate the presence of an object in different frames This baseline, however, is suboptimal since the detection fre- as a tracklet, or associate after performing object detection quency should depend on the tracking quality. To this end, we on each frame, which is time-consuming. propose a scheduler network, which determines to detect or This paper is motivated by the constraints from practical track at a certain frame, as a generalization of Siamese track- video analytics scenarios such as autonomous driving and ers. Although being light-weight and simple in structure, the scheduler network is more effective than the frame skipping video surveillance. We argue that algorithms applied to these baselines and flow-based approaches, as validated on Ima- scenarios should be: geNet VID dataset in video object detection/tracking. • capable of associating an object appearing in different frames, such that the trajectory or velocity of the object can be further inferred. Introduction • in realtime (e.g., over 30 fps) and as fast as possible, such Convolutional neural network (CNN)-based methods have that the deployment cost can be further reduced. achieved significant progress in computer vision tasks such • with low latency, which means to produce results once a as object detection (Ren et al. 2015; Liu et al. 2016; Dai frame in a video stream has been processed. et al. 2016; Tang et al. 2018b) and tracking (Held, Thrun, Considering these constraints, we focus in this paper on and Savarese 2016; Bertinetto et al. 2016; Nam and Han the task of video object detection/tracking (Russakovsky et 2016; Bhat et al. 2018). Following the tracking-by-detection al. 2017). The task is to detect objects in each frame (similar paradigm, most state-of-the-art trackers can be viewed as a to the goal of video object detection), with an additional goal local detector of a specified object. Consequently, trackers of associating an object appearing in different frames. are generally more efficient than detectors and can obtain In order to handle this task under the realtime and low la- precise bounding boxes in subsequent frames if the specified tency constraint, we propose a detect or track (DorT) frame- bounding box is accurate. However, as evaluated commonly work. In this framework, object detection/tracking of a video on benchmark datasets such as OTB (Wu, Lim, and Yang sequence is formulated as a sequential decision problem – a 2015) and VOT (Kristan et al. 2017), trackers are encour- scheduler network makes a detection/tracking decision for aged to track as long as possible. It is non-trivial for trackers every incoming frame, and then these frames are processed to be stopped once they are not confident, although heuris- with the detector/tracker accordingly. The architecture is il- tics, such as a threshold of the maximum response value, can lustrated in Figure 1. be applied. Therefore, trackers bear the risk of drifting. The scheduler network is the most unique part of our Besides object detection and tracking, there have been re- framework. It should be light-weight but be able to deter- cently a series of studies on video object detection (Kang mine to detect or track. Rather than using heuristic rules et al. 2016; Kang et al. 2017; Feichtenhofer, Pinz, and Zis- (e.g., thresholds of tracking confidence values), we formu- serman 2017; Zhu et al. 2017b; Zhu et al. 2017a; Zhu et al. late the scheduler as a small CNN by assessing the tracking 2018; Chen et al. 2018). Beyond the baseline to detect each quality. It is shown to be a generalization of Siamese trackers ∗ This work was done when Hao Luo was an intern at Microsoft and a special case of reinforcement learning (RL). Research Asia. The contributions are summarized as follows: Copyright c 2019, Association for the Advancement of Artificial • We propose the DorT framework, in which the object de- Intelligence ( All rights reserved. tection/tracking of a video sequence is formulated as a

2. Keyframe Detected boxes Muti-box tracker Det network t via RoI convolution Feature maps Track Current frame t+τ Detect Siamese network Scheduler network Det network Figure 1: Detect or track (DorT) framework. The scheduler network compares the current frame t + τ with the keyframe t by evaluating the tracking quality, and determines to detect or track frame t + τ : either frame t + τ is detected by a single-frame detector, or bounding boxes are tracked to frame t + τ from the keyframe t. If detect is chosen, frame t + τ is assigned as the new keyframe, and the boxes in frame t + τ and frame t + τ − 1 are associated by the widely-used Hungarian algorithm (not shown in the figure for conciseness). sequential decision problem, while being in realtime and al. 2017) propose a framework that consists of per-frame with low latency. proposal generation, bounding box tracking and tubelet • We propose a light-weight but effective scheduler net- re-scoring. (Zhu et al. 2017b) proposes to detect frames work, which is shown to be a generalization of Siamese sparsely and propagates features with optical flow. (Zhu et trackers and a special case of RL. al. 2017a) proposes to aggregate features in nearby frames • The proposed DorT framework is more effective than the along the motion path to improve the feature quality. Futher- frame skipping baselines and flow-based approaches, as more, (Zhu et al. 2018) proposes a high-performance ap- validated on ImageNet VID dataset (Russakovsky et al. proach by considering feature aggregation, partial feature 2015) in video object detection/tracking. updating and adaptive keyframe scheduling based on opti- cal flow. Besides, (Feichtenhofer, Pinz, and Zisserman 2017) Related Work proposes to learn detection and tracking using a single net- work with a multi-task objective. (Chen et al. 2018) proposes To our knowledge, we are the first to formulate video ob- to propagate the sparsely detected results through a space- ject detection/tracking as a sequential decision problem and time lattice. All the methods above focus on the accuracy of there is no existing similar work to directly compare with. each individual frame. They either do not associate the pres- However, it is related to existing work in multiple aspects. ence of an object in different frames as a tracklet, or asso- ciate after performing object detection on each frame, which Video Object Detection/Tracking is time-consuming. Video object detection/tracking is a task in ILSVRC 2017 (Russakovsky et al. 2017), where the winning entries are op- Multiple Object Tracking timized for accuracy rather than speed. (Deng et al. 2017) Multiple object tracking (MOT) focuses on data association: adopts flow aggregation (Zhu et al. 2017a) to improve finding the set of trajectories that best explains the given the detection accuracy. (Wei et al. 2017) combines flow- detections (Leal-Taix´e et al. 2014). Existing approaches to based (Ilg et al. 2017) and object tracking-based (Nam and MOT fall into two categories: batch and online mode. Batch Han 2016) tubelet generation (Kang et al. 2017). THU- mode approaches pose data association as a global opti- CAS (Russakovsky et al. 2017) considers flow-based track- mization problem, which can be a min-cost max-flow prob- ing (Kang et al. 2016), object tracking (Held, Thrun, and lem (Zhang, Li, and Nevatia 2008; Pirsiavash, Ramanan, Savarese 2016) and data association (Yu et al. 2016). and Fowlkes 2011), a continuous energy minimization prob- Nevertheless, these methods combine multiple cues (e.g., lem (Milan, Roth, and Schindler 2014) or a graph cut prob- flow aggregation in detection, and flow-based and object lem (Tang et al. 2016; Tang et al. 2017). Contrarily, on- tracking-based tubelet generation) which are complemen- line mode approaches are only allowed to solve the data tary but time-consuming. Moreover, they apply global post- association problem with the present and past frames. (Xi- processing such as seq-NMS (Han et al. 2016) and tubelet ang, Alahi, and Savarese 2015) formulates data associa- NMS (Tang et al. 2018a) which greatly improve the accu- tion as a Markov decision process. (Milan et al. 2017; racy but are not suitable for a realtime and low latency sce- Sadeghian, Alahi, and Savarese 2017) employs recurrent nario. neural networks (RNNs) for feature representation and data association. Video Object Detection State-of-the-art MOT approaches aim to improve the data Approaches to video object detection have been developed association performance given publicly-available detections rapidly since the introduction of the ImageNet VID dataset since the introduction of the MOT challenge (Leal-Taix´e (Russakovsky et al. 2015). (Kang et al. 2016; Kang et et al. 2015). However, we focus on the sequential decision

3.problem of detection or tracking. Although the widely-used Keyframe Hungarian algorithm is adopted for simplicity and fairness t Siamese network in the experiments, we believe the incorporation of existing MOT approaches can further enhance the accuracy. Current frame t+τ Keyframe Scheduler Researchers have proposed approaches to adaptive keyframe Figure 2: RoI convolution. Given targets in keyframe t scheduling beyond regular frame skipping in video analyt- and search regions in frame t + τ , the corresponding RoIs ics. (Zhu et al. 2018) proposes to estimate the quality of op- are cropped from the feature maps and convolved to ob- tical flow, which relies on the time-consuming flow network. tain the response maps. Solid boxes denote detected objects (Chen et al. 2018) proposes an easiness measure to consider in keyframe t and dashed boxes denote the corresponding the size and motion of small objects, which is hand-crafted search region in frame t + τ . A star denotes the center of and more importantly, it is a detect-then-schedule paradigm its corresponding bounding box. The center of a dashed box but cannot determine to detect or track. (Li, Shi, and Lin is copied from the tracking result in frame t + τ − 1. 2018; Xu et al. 2018) learn to predict the discrepancy be- tween the segmentation map of the current frame and the keyframe, which are only applicable to segmentation tasks. since R-FCN detection is not perfect and many true positives All the methods above, however, solve an auxiliary task with low confidence scores are discarded. We therefore need (e.g., flow quality, or discrepancy of segmentation maps) but to track all the detected boxes. do not answer the question directly in a classification per- It is time-consuming to track 50 boxes without optimiza- spective – is the current frame a keyframe or not? In contrast, tion (about 3 fps). In order to speed up the tracking process, we pose video object detection/tracking as a sequential deci- we propose to share the feature extraction network of multi- sion problem, and learn directly whether the current frame is ple boxes and propose an RoI convolution layer in place of a keyframe by assessing the tracking quality. Our formula- the original cross-correlation layer in SiamFC. Figure 2 is an tion is further shown as a generalization of Siamese trackers illustration. Through cropping and convolving on the feature and a special case of RL. maps, the proposed tracker is over 10x faster than the time- consuming baseline while obtaining comparable accuracy. The DorT Framework Notably, there is no learnable parameter in the RoI convo- Video object detection/tracking is formulated as follows. lution layer, and thus we can train the SiamFC tracker fol- Given a sequence of video frames F = {f1 , f2 , . . . , fN }, lowing the original settings in (Bertinetto et al. 2016). the aim is to obtain bounding boxes B = {b1 , b2 , . . . , bM }, where bi = {recti , f idi , scorei , idi }, recti denotes the 4- Scheduler Network dim bounding box coordinates and f idi , scorei and idi are The scheduler network is the core of DorT, as our task is scalars denoting respectively the frame ID, the confidence formulated as a sequential decision problem. It takes as input score and the object ID. the current frame ft+τ and its keyframe ft , and determines Considering the realtime and low latency constraint, we to detect or track, denoted as Scheduler(ft , ft+τ ). We will formulate video object detection/tracking as a sequential elaborate this module in the next section. decision problem, which consists of four modules: single- frame detector, multi-box tracker, scheduler network and Data Association data association. An algorithm summary follows the intro- Once the scheduler network determines to detect the cur- duction of the four modules. rent frame, there is a need to associate the previous tracked boxes and the current detected boxes. Hence, a data asso- Single-Frame Detector ciation algorithm is required. For simplicity and fairness in We adopt R-FCN (Dai et al. 2016) as the detector following the paper, the widely-used Hungarian algorithm is adopted. deep feature flow (DFF) (Zhu et al. 2017b). Our framework, Although it is possible to improve the accuracy by incor- however, is compatible with all single-frame detectors. porating more advanced data association techniques (Xiang, Alahi, and Savarese 2015; Sadeghian, Alahi, and Savarese Efficient Multi-Box Tracker via RoI Convolution 2017), it is not the focus in the paper. The overall architec- The SiamFC tracker (Bertinetto et al. 2016) is adopted in our ture of the DorT framework is shown in Figure 1. More de- framework. It learns a deep feature extractor during training tails are summarized in Algorithm 1. such that an object is similar to its deformations but different from the background. During testing, the nearby patch with The Scheduler Network in DorT the highest confidence is selected as the tracking result. The The scheduler network in DorT aims to determine to detect tracker is reported to run at 86 fps in the original paper. or track given a new frame by estimating the quality of the Despite its efficiency, there are usually 30 to 50 detected tracked boxes. It should be efficient itself. Rather than train- boxes in a frame outputted by R-FCN. It is a natural idea to ing a network from scratch, we propose to reuse part of the track only the high-confidence ones and ignore the rest. Such tracking network. Firstly, the l-th layer convolutional feature an approach, however, results in a drastic decrease in mAP map of frame t and frame t + τ , denoted respectively as xtl

4.Algorithm 1 The Detect or Track (DorT) Framework As we have sampled frame t and frame t+τ from the same Input: A sequence of video frames F = {f1 , f2 , . . . , fN }. sequence, we track all the groundtruth bounding boxes using Output: Bounding boxes B = {b1 , b2 , . . . , bM } with ID, where SiamFC from frame t to frame t + τ . If all the groundtruth bi = {recti , f idi , scorei , idi }. boxes in frame t + τ are matched with the tracked boxes 1: B ← {} (e.g., IOU over 0.8), the frame is labeled as track; otherwise, 2: t ← 1 t is the index of keyframe it is labeled as detect. Any emerging or disappearing objects 3: Detect f1 with the single-frame detector. indicates a detect. Several examples are shown in Figure 4. 4: Assign new ID to the detected boxes. We have also tried to learn a scheduler for each tracker, 5: Add the detected boxes in f1 to B. 6: for i ← 2 to N do but found it difficult to handle high-confidence false detec- 7: d ← Scheduler(ft , fi ) decision of scheduler tions and non-trivial to merge the decisions of all the track- 8: if d = detect then ers. In contrast, the proposed approach to learning a single 9: Detect fi with single-frame detector. scheduler is an elegant solution which directly learns the de- 10: Match boxes in fi and fi−1 using Hungarian. cision rather than an auxiliary target such as the fraction of 11: Assign new ID to unmatched boxes in fi . pixels at which the semantic segmentation labels differ (Li, 12: Assign corresponding ID to matched boxes in fi . Shi, and Lin 2018), or the fraction of low-quality flow esti- 13: t←i update keyframe mation (Zhu et al. 2018). 14: else the decision is to track 15: Track boxes from ft to fi . 16: Assign corresponding ID to tracked boxes in fi . Relation to the SiamFC Tracker 17: Assign corresponding detection score to tracked boxes The proposed scheduler network can be seen as a general- in fi . ization of the original SiamFC (Bertinetto et al. 2016). In the 18: end if correlation layer of SiamFC, the target feature (6×6×128) is 19: Add the bounding boxes in fi to B. convolved with the search region feature (22×22×128) and 20: end for obtains the response map (17 × 17 × 1, which can be equiv- alently written as 1 × 1 × 172 ). Similarly, we can view the Feature map t correlation layer of the proposed scheduler network (see Eq. Corr Conv BN relu Conv BN relu Fc Decision score 1) as convolutions between multiple target features in the Feature map t+ keyframe and their respective nearby search regions in the current frame. The size of a target equals the receptive field of the input feature map of our scheduler. Figure 5 shows Figure 3: Scheduler network. The output feature map of the several examples of targets. Actually, however, targets in- correlation layer is followed by two convolutional layers and clude all possible patches in a sliding window manner. a fully-connected layer with a 2-way softmax. As discussed In this sense, the output feature map of the correlation later, this structure is a generalization of the SiamFC tracker. 2 layer xcorr ∈ RHl ×Wl ×(2d+1) can be regarded as a set of Hl × Wl SiamFC tracking tasks, where the response map of each is 1 ×1 ×(2d + 1)2 . The correlation feature map is then and xt+τ l , are fed into a correlation layer which performs fed into a small CNN consisting of two convolutional layers point-wise feature comparison and a fully-connected layer. t+τ In summary, the generalization of the proposed scheduler xt,t+τ t corr (i, j, p, q) = xl (i, j), xl (i + p, j + q) (1) network over SiamFC lies in two fold: • SiamFC correlates a target feature with its nearby search where −d ≤ p ≤ d and −d ≤ q ≤ d are offsets to com- region, while our scheduler extends the number of tasks pare features in a neighbourhood around the locations (i, j) from one to many. in the feature map, defined by the maximum displacement d. • SiamFC directly picks the highest value in the correlation Hence, the output of the correlation layer is a feature map of feature map as the result, whereas the proposed scheduler 2 size xcorr ∈ RHl ×Wl ×(2d+1) , where Hl and Wl denote re- fuses the multiple response maps with a CNN. spectively the height and width of the l-th layer feature map. The validity of the proposed scheduler network is hence The correlation feature map xcorr is then passed through clear – it first convolves patches in frame t (examples shown two convolutional layers and a fully-connected layer with in Figure 5) with their respective nearby regions in frame a 2-way softmax. The final output of the network is a classi- t+τ , and then fuses the response maps with a CNN, in order fication score indicating the probability to detect the current to measure the difference between the two frames, and more frame. Figure 3 is an illustration of the scheduler network. importantly, to assess the tracking quality. The scheduler is also resistant to small perturbations by inheriting SiamFC’s Training Data Preparation robustness to object deformation. Existing groundtruth in the ImageNet VID dataset (Rus- sakovsky et al. 2015) does not contain an indicator of the Relation to Reinforcement Learning tracking quality. In this paper, we simulate the tracking pro- The sequential decision problem can also be formulated in a cess between two sampled frames and label it as detect (0) RL framework, where the action, state, state transition func- or track (1) in a strict protocol. tion and reward need to be defined.

5. iou: 0.904 iou: 0.243 iou: 0.952 iou: 0.000 iou: 0.960 iou: 0.853 iou: 0.862 iou: 0.801 (a) Positive examples (b) Negative examples Figure 4: Examples of labeled data for training the scheduler network. Red and green boxes denote groundtruth and tracked results, respectively. (a) Positive examples, where the IOU of each groundtruth box and its corresponding tracked box is over a threshold; (b) Negative examples, where at least one such IOU is below a threshold or there are emerging/disappearing objects. two frames (i.e., a certain state s). We denote the groundtruth label as GT (s), which is either detect or track. Hence, the reward function can be defined as follows: 1, GT (s) = a r(s, a) = (2) 0, GT (s) = a Figure 5: Examples of targets on keyframes. The size of a which is based on the consistency between the groundtruth target equals the receptive field of the input feature map of label and the action taken. the scheduler. As shown, a target patch might be an object, After defining all the above, the RL problem can be solved a part of an object, or totally background. The “tracking” via a deep Q network (DQN) (Mnih et al. 2015) with a dis- results of these targets will be fused later. It should be noted count factor γ, penalizing the reward from future time steps. that targets include all possible patches in a sliding window However, training stability is always an issue in RL algo- manner, but not just the three boxes shown above. rithms (Anschel, Baram, and Shimkin 2017). In this paper, we set γ = 0 such that the agent only cares about the re- ward from the next time step. Therefore, the DQN becomes Action. The action space A contains two types of actions: a regression network – pushing the predicted action to be {detect, track}. If the decision is detect, object detector is the same as the GT action, and the scheduler network is a applied to the current frame; otherwise, boxes tracked from special case of RL. We empirically observe that the training the keyframe are taken as the results. procedure becomes easier and more stable by setting γ = 0. State. The state st,τ is defined as a tuple (xtl , xlt+τ ), where xtl and xt+τ denote the l-th layer convolutional feature map Experiments l of frame t and frame t + τ , respectively. Frame t is the The DorT framework is evaluated on the ImageNet VID keyframe on which object detector is applied, and frame t+τ dataset (Russakovsky et al. 2015) in the task of video object is the current frame on which actions are to be determined. detection/tracking. For completeness, we also report results in video object detection. State transition function. After the decision of action at,τ in state st,τ . The next state is obtained upon the action: Experimental Setup • detect. The next state is st+τ,1 = (xlt+τ , xt+τl +1 ). Frame Dataset description. All experiments are conducted on t + τ is fed to the object detector and is set as the new the ImageNet VID dataset (Russakovsky et al. 2015). Im- keyframe. ageNet VID is split into a training set of 3862 videos and • track. The next state is st,τ +1 = (xtl , xlt+τ +1 ). Bounding a test set of 555 videos. There are per-frame bounding box boxes tracked from the keyframe are taken as the results annotations for each video. Furthermore, the presences of a in frame t + τ . The keyframe t remains unchanged. certain target across different frames in a video are assigned As shown above, no matter whether the keyframe is t or with the same ID. t + τ , the task in the next state is to determine the action in frame t + τ + 1. Evaluation metric. The evaluation metric for video ob- ject detection is the extensively used mean average precision Reward. The reward function is defined as r(s, a) since (mAP), which is based on a sorted list of bounding boxes in it is determined by both the state s and the action a. As il- descending order of their scores. A predicted bounding box lustrated in Figure 4, a labeling mechanism is proposed to is considered correct if its IOU with a groundtruth box is obtain the groundtruth label of the tracking quality between over a threshold (e.g., 0.5).

6. In contrast to the standard mAP which is based on bound- 58 ing boxes, the mAP for video object detection/tracking is 57 based on a sorted list of tracklets (Russakovsky et al. 2017). 56 A tracklet is a set of bounding boxes with the same ID. 55 Similarly, a tracklet is considered correct if its IOU with a 54 mAP(%) groundtruth tracklet is over a threshold. Typical choices of 53 IOU thresholds for tracklet matching and per-frame bound- 52 51 ing box matching are both 0.5. The score of a tracklet is the Deep feature flow w/ Hungarian Crop-and-resize SiamFC 50 average score of all its bounding boxes. Fixed scheduler w/ RoI convolution Scheduler network w/ RoI convolution 49 Oracle scheduler Implementation details. Following the settings in (Zhu et 48 0 10 20 30 40 50 60 70 80 90 100 al. 2017b), R-FCN (Dai et al. 2016) is trained with a ResNet- Frame rate (fps) 101 backbone (He et al. 2016) on the training set. SiamFC is trained following the original paper (Bertinetto Figure 6: Comparison between different methods in video et al. 2016). Instead of training from scratch, however, we object detection/tracking in terms of mAP. The detector (for initialize the first four convolutional layers with the pre- deep feature flow and fixed scheduler) or the scheduler (for trained parameters from AlexNet (Krizhevsky, Sutskever, scheduler network and oracle scheduler) can be applied ev- and Hinton 2012) and change Conv5 from 3 × 3 to 1 × 1 ery σ frames to obtain different results. with the Xavier initializer. Parameters of the first four con- volutional layers are fixed during training (He et al. 2018). We only search for one scale and discard the upsampling step in the original SiamFC for efficiency. All images being Track 0.236 0.040 Track 0.118 0.019 fed into SiamFC are resized to 300 × 500. Moreover, the confidence score of a tracked box (for evaluation) is equal to Detect 0.598 0.125 Detect 0.658 0.205 its corresponding detected box in the keyframe. Track Detect Track Detect The scheduler network takes as input the Conv5 feature of our trained SiamFC. The SGD optimizer is adopted with (a) σ = 1 (b) σ = 10 a learning rate 1e-2, momentum 0.9 and weight decay 5e- 4. The batch size is set to 32. During testing, we raise the Figure 7: Confusion matrix of the scheduler network. The decision threshold of track to δ = 0.97 (i.e., the scheduler horizontal axis is the groundtruth and the vertical axis is the outputs track if the predicted confidence of track is over δ) to predicted label. The scheduler is applied every σ frames. ensure conservativeness of the scheduler. Furthermore, since nearby frames look similar, the scheduler is applied every σ frames (where σ is a tunable parameter) to reduce unneces- We compare our DorT framework with a frame skip- sary computation. ping baseline, namely a “fixed scheduler” – R-FCN is per- All experiments are conducted on a workstation with an formed every σ frames and SiamFC is adopted to track for Intel Core i7-4790k CPU and a Titan X GPU. We em- the frames in between. As aforementioned, our scheduler pirically observe that the detection network and the track- can also be applied every σ frames to improve efficiency. ing/scheduler network run at 8.33 fps and 100fps, respec- Moreover, there could be an oracle scheduler – predicting tively. This is because the ResNet-101 backbone is much the groundtruth label (detect or track) as shown in Figure heavier than AlexNet. Moreover, the speed of the Hungarian 4 during testing. The oracle scheduler is a 100% accurate algorithm is as high as 667 fps. scheduler in our setting. The results are shown in Figure 6. Video Object Detection/Tracking We can observe that the frame rate and mAP vary as σ changes. Interestingly, the curves are not monotonic – as To our knowledge, the most closely related work to ours the frame rate decreases, the accuracy in mAP is not neces- is (Lan et al. 2016), which handles cost-effective face de- sarily higher. In particular, detectors are applied frequently tection/tracking. Since face is much easier to track and is when σ = 1 (the leftmost point of each curve). Associat- with less deformation, the paper achieves success by utiliz- ing boxes using the Hungarian algorithm is generally less ing non-deep learning-based detectors and trackers. How- reliable (given missed detections and false detections) than ever, we aim at general object detection/tracking in video, tracking boxes between two frames. It is also a benefit of which is much more challenging. We demonstrate the ef- the scheduler network – applying tracking only when confi- fectiveness of the proposed DorT framework against several dent, and thus most boxes are reliably associated. Hence, the strong baselines. curve of the scheduler network is on the upper-right side of Effectiveness of scheduler. The scheduler network is a that of the fixed scheduler as shown in Figure 6. core component of our DorT framework. Since SiamFC However, it can be also observed that there is certain dis- tracking is more efficient than R-FCN detection, the sched- tance between the curve of the scheduler network and that uler should predict track when it is safe for the trackers and of the oracle scheduler. Given that the oracle scheduler is a be conservative enough to predict detect when there is suffi- 100% accurate classifier, we analyze the classification accu- cient change to avoid track drift. racy of the scheduler network in Figure 7. Let us take the

7. iou: 0.843 80 iou: 0.899 78 D&T High performance VOD 76 ST-Lattice Deep feature flow DorT (ours) mAP(%) iou: 0.712 74 iou: 0.915 72 70 iou: 0.490 68 0 10 20 30 40 50 60 70 80 Frame rate (fps) iou: 0.886 Figure 9: Comparison between different methods in video object detection in terms of mAP. Results of D&T, High per- formance VOD and ST-Lattice are copied from the original Figure 8: Qualitative results of the scheduler network. Red, papers. The detector (for deep feature flow) or the scheduler blue and green boxes denote groundtruth, detected boxes (for scheduler network) can be applied every σ frames to and tracked boxes, respectively. The first row: R-FCN is ap- obtain different results. plied in the keyframe. The second row: the scheduler de- termines to track since it is confident. The third row: the only to improve the mAP by adopting complicated methods scheduler predicts to track in the first image although the and post processing, leading to inefficient solutions with- red panda moves; however, the scheduler determines to de- out guaranteeing low latency. Their reported results on the tect in the second image as the cat moves significantly and test set ranges from 51% to 65% mAP. Our proposed DorT, is unable to be tracked. notably, achieves 57% mAP on the validation set, which is comparable to the existing methods in magnitude, but is much more principled and efficient. σ = 10 case as an example. Although the classification ac- curacy is only 32.3%, the false positive rate (i.e., misclas- sifying a detect case as track) is as low as 1.9%. Because Video Object Detection we empirically find that the mAP drops drastically if the We also evaluate our DorT framework in video object detec- scheduler mistakenly predicts track, our scheduler network tion for completeness, by removing the predicted object ID. is made conservative – track only when confident and detect Our DorT framework is compared against deep feature flow if unsure. Figure 8 shows some qualitative results. (Zhu et al. 2017b), D&T (Feichtenhofer, Pinz, and Zisser- man 2017), high performance video object detection (VOD) Effectiveness of RoI convolution. Trackers are optimized (Zhu et al. 2018) and ST-Lattice (Chen et al. 2018). The re- for the crop-and-resize case (Bertinetto et al. 2016) – the sults are shown in Figure 9. It can be observed that D&T and target and search region are cropped and resized to a fixed high performance VOD manage to achieve a speed-accuracy size before matching. It is a nice choice since the tracking balance. They obtain higher results but cannot fit into re- algorithm is not affected by the original size of the target. altime (over 30 fps) scenarios. ST-Lattice, although being It is, however, slow in multi-box case and we propose RoI fast and accurate, adopts detection results in future frames convolution as an efficient approximation. As shown in Fig- and is thus not suitable in a low latency scenario. As com- ure 6, crop-and-resize SiamFC is even slower than detec- pared with deep feature flow, our DorT framework performs tion – the overall running time is 3 fps. Notably, its mAP is significantly faster with comparable performance (no more 56.5%, which is roughly the same as that of our DorT frame- than 1% mAP loss). Although our aim is not the video ob- work empowered with RoI convolution. Our DorT frame- ject detection task, the results in Figure 9 demonstrate the work, however, runs at 54 fps when σ = 10. RoI convolution effectiveness of our approach. obtains over 10x speed boost while retaining mAP. Comparison with existing methods. Deep feature flow Conclusion and Future Work (Zhu et al. 2017b) focuses on video object detection without We propose a DorT framework for cost-effective video ob- tracking. We can, however, associate its predicted bounding ject detection/tracking, which is in realtime and with low boxes with per frame data association using the Hungarian latency. Object detection/tracking of a video sequence is for- algorithm. The results are shown in Figure 6. It can be ob- mulated as a sequential decision problem in the framework. served that our framework performs significantly better than Notably, a light-weight but effective scheduler network is deep feature flow in video object detection/tracking. proposed, which is shown to be a generalization of Siamese Concurrent works that deal with video object detec- trackers and a special case of RL. The DorT framework turns tion/tracking are the submitted entries in ILSVRC 2017 out to be effective and strikes a good balance between speed (Deng et al. 2017; Wei et al. 2017; Russakovsky et al. 2017). and accuracy. As discussed in the Related Work section, these methods aim The framework can still be improved in several aspects.

8. The SiamFC tracker can search for multiple scales to im- [Lan et al. 2016] Lan, X.; Xiong, Z.; Zhang, W.; Li, S.; Chang, H.; prove performance as in the original paper. More advanced and Zeng, W. 2016. A super-fast online face tracking system for data association methods can be applied by resorting to the video surveillance. In ISCAS. state-of-the-art MOT algorithms. Furthermore, there is room [Leal-Taix´e et al. 2014] Leal-Taix´e, L.; Fenzi, M.; Kuznetsova, A.; to improve the training of the scheduler network to approach Rosenhahn, B.; and Savarese, S. 2014. Learning an image-based the oracle scheduler. These are left as future work. motion context for multiple people tracking. In CVPR. [Leal-Taix´e et al. 2015] Leal-Taix´e, L.; Milan, A.; Reid, I.; Roth, Acknowledgment. This work was partly supported by S.; and Schindler, K. 2015. Motchallenge 2015: Towards a bench- NSFC (No. 61876212 & 61733007). The authors would like mark for multi-target tracking. arXiv. to thank Chong Luo and Anfeng He for fruitful discussions. [Li, Shi, and Lin 2018] Li, Y.; Shi, J.; and Lin, D. 2018. Low- latency video semantic segmentation. In CVPR. References [Liu et al. 2016] Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; [Anschel, Baram, and Shimkin 2017] Anschel, O.; Baram, N.; and Reed, S.; Fu, C.-Y.; and Berg, A. C. 2016. Ssd: Single shot multi- Shimkin, N. 2017. Averaged-dqn: Variance reduction and stabi- box detector. In ECCV. lization for deep reinforcement learning. In ICML. [Milan et al. 2017] Milan, A.; Rezatofighi, S. H.; Dick, A. R.; Reid, [Bertinetto et al. 2016] Bertinetto, L.; Valmadre, J.; Henriques, I. D.; and Schindler, K. 2017. Online multi-target tracking using J. F.; Vedaldi, A.; and Torr, P. H. 2016. Fully-convolutional siamese recurrent neural networks. In AAAI. networks for object tracking. In ECCVw. [Bhat et al. 2018] Bhat, G.; Johnander, J.; Danelljan, M.; Khan, [Milan, Roth, and Schindler 2014] Milan, A.; Roth, S.; and F. S.; and Felsber, M. 2018. Unveiling the power of deep tracking. Schindler, K. 2014. Continuous energy minimization for In ECCV. multitarget tracking. TPAMI. [Chen et al. 2018] Chen, K.; Wang, J.; Yang, S.; Zhang, X.; Xiong, [Mnih et al. 2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, Y.; Loy, C. C.; and Lin, D. 2018. Optimizing video object detection A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; via a scale-time lattice. Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature. [Dai et al. 2016] Dai, J.; Li, Y.; He, K.; and Sun, J. 2016. R-fcn: Object detection via region-based fully convolutional networks. In [Nam and Han 2016] Nam, H., and Han, B. 2016. Learning multi- NIPS. domain convolutional neural networks for visual tracking. In CVPR. [Deng et al. 2017] Deng, J.; Zhou, Y.; Yu, B.; Chen, Z.; Zafeiriou, S.; and Tao, D. 2017. Speed/accuracy trade-offs for object detec- [Pirsiavash, Ramanan, and Fowlkes 2011] Pirsiavash, H.; Ra- tion from video. manan, D.; and Fowlkes, C. C. 2011. Globally-optimal greedy talks_2017/Imagenet%202017%20VID.pdf. algorithms for tracking a variable number of objects. In CVPR. [Feichtenhofer, Pinz, and Zisserman 2017] Feichtenhofer, C.; Pinz, [Ren et al. 2015] Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. A.; and Zisserman, A. 2017. Detect to track and track to detect. In Faster r-cnn: Towards real-time object detection with region pro- ICCV. posal networks. In NIPS. [Han et al. 2016] Han, W.; Khorrami, P.; Paine, T. L.; Ramachan- [Russakovsky et al. 2015] Russakovsky, O.; Deng, J.; Su, H.; dran, P.; Babaei-zadeh, M.; Shi, H.; Li, J.; Yan, S.; and Huang, Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, T. S. 2016. Seq-nms for video object detection. arXiv. A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recog- [He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep nition challenge. IJCV. residual learning for image recognition. In CVPR. [Russakovsky et al. 2017] Russakovsky, O.; Park, E.; Liu, W.; [He et al. 2018] He, A.; Luo, C.; Tian, X.; and Zeng, W. 2018. A Deng, J.; Li, F.-F.; and Berg, A. 2017. Beyond imagenet large twofold siamese network for real-time object tracking. In CVPR. scale visual recognition challenge. challenges/beyond_ilsvrc. [Held, Thrun, and Savarese 2016] Held, D.; Thrun, S.; and Savarese, S. 2016. Learning to track at 100 fps with deep [Sadeghian, Alahi, and Savarese 2017] Sadeghian, A.; Alahi, A.; regression networks. In ECCV. and Savarese, S. 2017. Tracking the untrackable: Learning to track multiple cues with long-term dependencies. In ICCV. [Ilg et al. 2017] Ilg, E.; Mayer, N.; Saikia, T.; Keuper, M.; Dosovit- skiy, A.; and Brox, T. 2017. Flownet 2.0: Evolution of optical flow [Tang et al. 2016] Tang, S.; Andres, B.; Andriluka, M.; and Schiele, estimation with deep networks. In CVPR. B. 2016. Multi-person tracking by multicut and deep matching. In ECCV. [Kang et al. 2016] Kang, K.; Ouyang, W.; Li, H.; and Wang, X. 2016. Object detection from video tubelets with convolutional neu- [Tang et al. 2017] Tang, S.; Andriluka, M.; Andres, B.; and Schiele, ral networks. In CVPR. B. 2017. Multiple people tracking by lifted multicut and person reidentification. In CVPR. [Kang et al. 2017] Kang, K.; Li, H.; Xiao, T.; Ouyang, W.; Yan, J.; Liu, X.; and Wang, X. 2017. Object detection in videos with tubelet [Tang et al. 2018a] Tang, P.; Wang, C.; Wang, X.; Liu, W.; Zeng, proposal networks. In CVPR. W.; and Wang, J. 2018a. Object detection in videos by high quality [Kristan et al. 2017] Kristan, M.; Leonardis, A.; Matas, J.; Felsberg, object linking. arXiv. M.; Pflugfelder, R.; Cehovin Zajc, L.; Vojir, T.; Hager, G.; Lukezic, [Tang et al. 2018b] Tang, P.; Wang, X.; Bai, S.; Shen, W.; Bai, X.; A.; Eldesokey, A.; and Fernandez, G. 2017. The visual object Liu, W.; and Yuille, A. L. 2018b. Pcl: Proposal cluster learning for tracking vot2017 challenge results. In ICCVw. weakly supervised object detection. TPAMI. [Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.; [Wei et al. 2017] Wei, Y.; Zhang, M.; Li, J.; Chen, Y.; Feng, Sutskever, I.; and Hinton, G. E. 2012. Imagenet classifica- J.; Dong, J.; Yan, S.; and Shi, H. 2017. Improv- tion with deep convolutional neural networks. In NIPS. ing context modeling for video object detection and track-

9. ing. tion and appearance feature. In ECCVw. 2017/ilsvrc2017_short(poster).pdf. [Zhang, Li, and Nevatia 2008] Zhang, L.; Li, Y.; and Nevatia, R. [Wu, Lim, and Yang 2015] Wu, Y.; Lim, J.; and Yang, M.-H. 2015. 2008. Global data association for multi-object tracking using net- Object tracking benchmark. TPAMI. work flows. In CVPR. [Xiang, Alahi, and Savarese 2015] Xiang, Y.; Alahi, A.; and [Zhu et al. 2017a] Zhu, X.; Wang, Y.; Dai, J.; Yuan, L.; and Wei, Y. Savarese, S. 2015. Learning to track: Online multi-object tracking 2017a. Flow-guided feature aggregation for video object detection. by decision making. In ICCV. In ICCV. [Xu et al. 2018] Xu, Y.-S.; Fu, T.-J.; Yang, H.-K.; and Lee, C.-Y. [Zhu et al. 2017b] Zhu, X.; Xiong, Y.; Dai, J.; Yuan, L.; and Wei, Y. 2018. Dynamic video segmentation network. In CVPR. 2017b. Deep feature flow for video recognition. In CVPR. [Yu et al. 2016] Yu, F.; Li, W.; Li, Q.; Liu, Y.; Shi, X.; and Yan, J. [Zhu et al. 2018] Zhu, X.; Dai, J.; Yuan, L.; and Wei, Y. 2018. To- 2016. Poi: Multiple object tracking with high performance detec- wards high performance video object detection. In CVPR.