Object recognition and tracking are two of the most dynamic research sub-areas that belong to the field of Computer Vision. Computer vision is one of the most active research fields that lies at the intersection of deep learning and machine vision. This paper presents an efficient ensemble algorithm for the recognition and tracking of fixed shape moving objects while accommodating the shift and scale invariances that the object may encounter. The first part uses the Maximum Average Correlation Height (MACH) filter for object recognition and determines the bounding box coordinates. In case the correlation based MACH filter fails, the algorithms switches to a much reliable but computationally complex feature based object recognition technique i.e., affine scale invariant feature transform (ASIFT). ASIFT is used to accommodate object shift and scale object variations. ASIFT extracts certain features from the object of interest, providing invariance in up to six affine parameters, namely translation (two parameters), zoom, rotation and two camera axis orientations. However, in this paper, only the shift and scale invariances are used. The second part of the algorithm demonstrates the use of particle filters based Approximate Proximal Gradient (APG) technique to periodically update the coordinates of the object encapsulated in the bounding box. At the end, a comparison of the proposed algorithm with other state-of-the-art tracking algorithms has been presented, which demonstrates the effectiveness of the proposed algorithm with respect to the minimization of tracking errors.

The problem of estimating the position of fixed shape moving objects still persists in a remote scene environment because of ever-changing environmental conditions and change in the dimensions and physical attributes of the object [

In recent years, image detectors such as ASIFT have been introduced which can improve the recognition process by providing shift and scale invariances. These detectors are normally classified based on the properties relating to incremental invariance. The Harris point detector was one of the earliest ones which was rotationally invariant [

Similar to recognition, multiple tracking algorithms have been proposed over the years. Tracking refers to a complex problem of estimating the approximate path of an object of interest in the image plane as it starts to move. It can also be referred to as dynamic object identification. The primary motive behind tracking is to find a route for the object in all the frames of a certain video [

The novel contribution of this paper is that ASIFT and MACH are used in combination with particle filters and ASIFT for image tracking for the first time. The proposed algorithm involves two major steps, i.e., object recognition and object tracking. Object recognition involves either MACH or an ASIFT filter. The MACH generates a correlation peak that can be considered maximum with respect to the produced noise. It then proceeds to minimize a metric commonly known as the Average Similarity Measure (ASM). MACH will be employed first for recognition of objects using the first frame. Objects’ coordinates are identified, and a bounding box is constructed using MACH. If the object changes its position drastically, then it is very important to recognize the feature points of the object i.e., the points that best describe the nomenclature of the object. ASIFT is an upgraded version of the SIFT algorithm, providing invariance for up to six parameters (in comparison to SIFT, which provides invariance in four parameters). The six parameters are: Translation (2 parameters), zoom3, rotation and two camera axis orientations.

Once the recognition part is completed, the bounding box coordinates are forwarded to the second part of the algorithm, which is based on object tracking. The tracking portion of the algorithm employs the use of particle filters for periodically updating the coordinates of the bounding box that constitutes the object of interest. The particle filters will use the probability density function for estimating the positioning of the object. The probability estimation makes it convenient for the trained tracker to train an object under certain complex conditions, such as when object gets occluded by another object. The particle filters are then improved using the proximal gradient technique, which is used for the best precision results. In the end, performance comparisons are made with recently proposed algorithms to prove the effectiveness and speed of the proposed algorithm.

This paper proposes a tracker that first uses an ensemble of Correlation and Feature based filters for recognition of object and then proceeds to track the object of interest in an efficient manner.

The working phenomenon of DOG includes smoothing of input image. The smoothing is performed by convolving the Gaussian kernel with the input image. The process is achieved by differentiating two Gaussian functions g(x, y) for σ = 1, 2,…. The Gaussian is expressed using _{1}(x, y) as shown in

Here * represents the convolution. By employing a different width σ_{2}, a second smoothed image is obtained using

Since the DOG is calculated by differentiating two low pass filters, therefore it can effectively be called a bandpass filter. The DOG eliminates mostly the high frequency components some low-frequency components. After preprocessing, the next step to perform the recognition. The recognition step involves employing of a MACH filter for the object recognition. The first step in using the MACH filter is to train it.

To perform this step, a temporal derivative is computed for each pixel of the input image, which results in a volume for every sequence involved in the training process. Afterwards, in frequency domain, each volume is represented by performing a 2-D Discrete Fourier Transform (DFT) operation using

In _{i} denote the resulting column of dimension “d” obtained after concatenation. The dimensions are calculated using L * M. After obtaining the column vectors, the MACH filter (which is used for minimization of the ASM, average correlation energy and maximizing a metric called the average correlation height) can be synthesized [

Here, h shows the frequency response of the filter, mx represents the arithmetic mean of all the input vectors x_{i}, C is d * d dimensional diagonal covariance matrix and d signifies the total number of elements. Dx represents the average spectral density of the training videos and is also a d * d diagonal matrix. Dx is calculated using _{x} is calculated using _{x} can be considered similar to m_{x} with similar values arranged in a diagonal array. For tradeoff parameters α, β and γ values can be set appropriately [

In this paper, the MACH parameters have been estimated using Particle Swarm Optimization (PSO) [_{min}, X_{max}] = [−1, 1], [V_{min}, V_{max}] = [−0.1, 0.1], W = 0.9, C_{1} = C_{2} = 2. The optimized values of α, β and γ enables much sharper correlation peaks which ensures better object recognition.

Values estimation using Particle Swarm Optimization Technique | |||

Data Set | α | β | γ |

Car-1 | 0.0001 | 0.4327 | 0.3169 |

Car-2 | 0.00012 | 0.4739 | 0.3170 |

Blur Body | 0.0001 | 0.4238 | 0.2266 |

Singer | 0.00012 | 0.5103 | 0.4424 |

Skating | 0.0002 | 0.5355 | 0.4180 |

This correlation filter is employed as part of the proposed algorithm for recognition of the object from an image. For training of the filter, a few preliminary frames (eight) for the detection of the object are required. For detection and recognition of the object, i.e., testing of the images, the image is cross correlated with the training images using the two parameters i.e., Peak correlation energy (PCE) and Correlation output peak intensity (COPI).

The peak depicts the identified object coordinates. The coordinates of the object are used to construct the bounding box used for tracking of the object. The bounding box can be seen in

Once the bounding box is established, the next task is to track the recognized object once it starts to move. For that, the particle filter-based algorithm will be used. In cases where the object changes its dimensions or path drastically, sometimes the MACH-based recognition filter does not give accurate results. In order to improve the efficiency of the tracker, ensemble of ASIFT and MACH has been proposed. The ensemble has been proposed for the first time, for efficient detection of the object of interest.

In case, object undergoes drastic changes in shift or scale, the MACH tends to give inaccurate results. In addition, in the presence of neighboring objects too close to object of interest, multiple correlation peaks tends to appear.

Each image is transformed using the affine distortion simulations caused by the change in orientation of the camera from frontal positioning. The distortions are dependent mainly on two factors: (i) Latitude θ and longitude Φ. The image must perform Φ rotations, and then, the tilt must be performed with a parameter t = 1/cosθ. For digital images, directional sub-sampling is used for performing the tilt.

The tilts and rotations are .achieved as required for a finite amount of longitude and latitude angle changes. The sampling steps will most likely ensure that simulated images remain uniform with any other images generated by the same latitude and longitude angles.

All simulated images are produced by steps 1 and 2. Comparison is then performed using SIFT.

The sampling rate of all the involved parameters to perform tilt is very critical. Object recognition is possible in any slanted case irrespective of the source producing it only if the object is perfectly planar. For this paper, a practical physical upper bound is enforced, i.e., t_{max} is physically obtained by using image pairs of the object both in the original position and in the slanting position. Multiple examples are presented here.

After the recognition of object using MACH or ASIFT, the next step is the efficient tracking of the object in successive frames. The tracking algorithm uses the recognition algorithm in the first frame and then a modified particle filter is employed for updating the positioning of the object in successive frames. The MACH filter is used to construct the bounding box by efficiently performing the object recognition, while the APG based technique is used for updating the coordinates of the bounding box once the object is in motion. ASIFT is called if the longitudinal or latitudinal coordinates of the object change drastically or it starts to tilt as described in the previous section. For object tracking, particle filters are used. The tracking routine that is used in this paper is APG approach which utilizes the particle filter along with the gradient descent technique for object tracking.

i Set |

ii For k = 0, 1,… until converge |

_{t}, I] pertains to target templates coefficients. The matrix used for representing non target template coefficients is shown as a = [a_{T}, a_{I}]. For defining the energy pertaining to non-target templates, a crucial parameter, μ_{t}, is defined. Now, the APG method is applied on both F(a) and G(a) using

The APG-based tracking method is shown in Algorithm-3 that uses the particle filters and the APG approach defined in ALGORITHM. Implementation of the APG-based tracker is performed in MATLAB.

Multiple following datasets are used for the testing of the proposed method. The first dataset involves the Car-1 data set, which shows color images of a vehicle moving around in a curve-like pattern on a road. The data set is obtained from a video sequence that also involves a cluttered scene environment. The algorithm is tested on different frames of the data set, and it can be observed that coordinates of the red bounding box are periodically updated as the object changes its location along a curve.

The third dataset shows movement of a person in his office. The dataset is named as “Blur Body” [

The fourth data employed for testing is referred to as “Singer” [

The results of the data sets shows that the bounding box consistently changes its position in conjunction with the movement of the object. This shows the efficiency and correctness of the tracker and also shows its efficiency while working in an environment with diverse conditions. Data set CAR-1 shows a simple colored object i.e., vehicle moving in a curve like pattern. Data set CAR-2 shows partial occlusion of a vehicle. Data set blur body shows occasional blurring of an object. Data set skating and singer each shows changes in projection and lightening conditions occasionally.

The proposed algorithm is compared with similar state of the art algorithms in terms of execution time and average tracking errors. The average tracking error is measured using the Euclidian distance of two center points, which has been normalized by the size of the target from the ground truth. The execution time is based on how fast the algorithm can detect the object of interest precisely. The execution time of algorithms is calculated using same language and machine i.e., MATLAB 2019 and Core i5 processing machine. The algorithm is compared to some of the state-of-the-art algorithms such as novel hybrid Local Multiple system (LM-CNN-SVM) based on Convolutional Neural Networks (CNNs) and Support Vector Machines (SVMs) [

Average Tracking Errors (Min. 100 Frames) | ||||||

Data Set | ODDL | MDCNN | ICTL | (LM-CNN-SVM) | Non APG Approach | Proposed Algorithm with APG Approach |

Car-1 | 0.46 | 0.42 | 0.17 | 0.055 | 0.21 | 0.041 |

Car-2 | 0.059 | 0.057 | 0.056 | 0.051 | 0.061 | 0.048 |

Person | 0.09 | 0.088 | 0.094 | 0.071 | 0.041 | 0.012 |

Blur Body | 0.10 | 0.101 | 0.089 | 0.09 | 0.088 | 0.079 |

Singer | 0.14 | 0.14 | 0.1328 | 0.129 | 0.144 | 0.127 |

Execution Time (s) [Min. 100 Frames] | |||||||

Data Set | Frames | ODDL | MDCNN | ICTL | (LM-CNN-SVM) | Non APG Approach | Proposed Algorithm with APG Approach |

Car-1 | 234 | 14.2 | 13.2 | 10.2 | 10 | 11.4 | 9.9 |

Car-2 | 300 | 12.1 | 11.1 | 9.1 | 9.4 | 11.0 | 9.0 |

Person | 277 | 17.1 | 14.1 | 11.1 | 11.3 | 12.6 | 10.2 |

Blur Body | 277 | 18.2 | 17.2 | 12.2 | 12.7 | 13.8 | 12.0 |

Singer | 301 | 14.7 | 16.7 | 10.7 | 10.4 | 12.1 | 9.8 |

This paper proposes an efficient technique that utilizes an ensemble of two recognition techniques and a novel tracking routine for tracking of fixed shape moving objects. First, MACH is used for detecting the object of interest by maximizing the average similarity measure. The detected coordinates are used to construct a bounding box that indicates the presence of object. In each subsequent frame, the coordinates of the bounding box are updated using the APG approach. The analysis show that the proposed algorithm is not only less error prone compared to the previous methods, but it also possesses less computational complexity than its predecessors due to the APG approach. The APG method eradicates the practice of templates that are trivial in nature, resulting in fewer complexities and a faster tracking procedure. The proposed algorithm can be improved in the future by training the proposed tracker to work on objects that have been occluded in a remote scene environment. Once the object becomes occluded, the MACH filter will be used for the prediction of coordinates instead of the particle filter, to hopefully provide more accurate results.