I am a researcher in Azure AI Frameworks on rotation from Microsoft Research. Currently I am particularly interested in the intersection of reinforcement learning, and graph / kernel automatic scheduling for novel accelerators (GPUs, NPUs) and am exclusively focused on this topic.
Previously I was a researcher in the Machine Learning area at Microsoft Research, Redmond for 8 years.
I conduct not only fundamental research but also love to build high-quality software for e.g.,
I finished my PhD at the Robotics Institute, Carnegie Mellon University. I do fundamental as well as applied research in machine learning, control and computer vision with applications to autonomous agents in general and robotics in particular. My interests include decison-making under uncertainty, reinforcement learning, artificial intelligence and machine learning. My work has been honored with Best Paper of the Year Shortlist at the International Journal of Robotics Research.
From 2015-2023 I used to regularly area-chair/review for NeurIPS, ICLR, ICML, AutoML. On occasion for ICRA, IROS, IJRR, JFR, CVPR, ECCV, ICCV, TMLR.
PhD in Robotics, 2015
Carnegie Mellon University
MS in Robotics, 2012
Carnegie Mellon University
Bachelor of Electrical Engineering, 2007
Delhi College of Engineering
Convolutional models have been widely used in multiple domains. However, most existing models only use local convolution, making the model unable to handle long-range dependency efficiently. Attention overcomes this problem by aggregating global information based on the pair-wise attention score but also makes the computational complexity quadratic to the sequence length. Recently, Gu et al. [2021a] proposed a model called S4 inspired by the state space model. S4 can be efficiently implemented as a global convolutional model whose kernel size equals the input sequence length. With Fast Fourier Transform, S4 can model much longer sequences than Transformers and achieve significant gains over SoTA on several long-range tasks. Despite its empirical success, S4 is involved. It requires sophisticated parameterization and initialization schemes that combine the wisdom from several prior works. As a result, S4 is less intuitive and hard to use for researchers with limited prior knowledge. Here we aim to demystify S4 and extract basic principles that contribute to the success of S4 as a global convolutional model. We focus on the structure of the convolution kernel and identify two critical but intuitive principles enjoyed by S4 that are sufficient to make up an effective global convolutional model: 1) The parameterization of the convolutional kernel needs to be efficient in the sense that the number of parameters should scale sub-linearly with sequence length. 2) The kernel needs to satisfy a decaying structure that the weights for convolving with closer neighbors are larger than the more distant ones. Based on the two principles, we propose a simple yet effective convolutional model called Structured Global Convolution (SGConv). SGConv exhibits strong empirical performance over several tasks: 1) With faster speed, SGConv surpasses S4 on Long Range Arena and Speech Command datasets. 2) When plugging SGConv into standard language and vision models, it shows the potential to improve both efficiency and performance.
The Transformer architecture is ubiquitously used as the building block of large-scale autoregressive language models. However, finding architectures with the optimal trade-off between task performance (perplexity) and hardware constraints like peak memory utilization and latency is non-trivial. This is exacerbated by the proliferation of various hardware. We leverage the somewhat surprising empirical observation that the number of decoder parameters in autoregressive Transformers has a high rank correlation with task performance, irrespective of the architecture topology. This observation organically induces a simple Neural Architecture Search (NAS) algorithm that uses decoder parameters as a proxy for perplexity without need for any model training. The search phase of our training-free algorithm, dubbed Lightweight Transformer Search (LTS), can be run directly on target devices since it does not require GPUs. Using on-target-device measurements, LTS extracts the Pareto-frontier of perplexity versus any hardware performance cost. We evaluate LTS on diverse devices from ARM CPUs to NVIDIA GPUs and two popular autoregressive Transformer backbones: GPT-2 and Transformer-XL. Results show that the perplexity of 16-layer GPT-2 and Transformer-XL can be achieved with up to 1.5x, 2.5x faster runtime and 1.2x, 2.0x lower peak memory utilization. When evaluated in zero and one-shot settings, LTS Pareto-frontier models achieve higher average accuracy compared to the 350M parameter OPT across 14 tasks, with up to 1.6x lower latency. LTS extracts the Pareto-frontier in under 3 hours while running on a commodity laptop. We effectively remove the carbon footprint of hundreds of GPU hours of training during search, offering a strong simple baseline for future NAS methods in autoregressive language modeling.
We propose a neural architecture search (NAS) algorithm, Petridish, to iteratively add shortcut connections to existing network layers. The added shortcut connections effectively perform gradient boosting on the augmented layers. The proposed algorithm is motivated by the feature selection algorithm forward stage-wise linear regression, since we consider NAS as a generalization of feature selection for regression, where NAS selects shortcuts among layers instead of selecting features. In order to reduce the number of trials of possible connection combinations, we train jointly all possible connections at each stage of growth while leveraging feature selection techniques to choose a subset of them. We experimentally show this process to be an efficient forward architecture search algorithm that can find competitive models using few GPU days in both the search space of repeatable network modules (cell-search) and the space of general networks (macro-search). Petridish is particularly well-suited for warm-starting from existing models crucial for lifelong-learning scenarios
Developing and testing algorithms for autonomous vehicles in real world is an expensive and time consuming process. Also, in order to utilize recent advances in machine intelligence and deep learning we need to collect a large amount of annotated training data in a variety of conditions and environments. We present a new simulator built on Unreal Engine that offers physically and visually realistic simulations for both of these goals. Our simulator includes a physics engine that can operate at a high frequency for real-time hardware-in-the-loop (HITL) simulations with support for popular protocols (e.g. MavLink). The simulator is designed from the ground up to be extensible to accommodate new types of vehicles, hardware platforms and software protocols. In addition, the modular design enables various components to be easily usable independently in other projects. We demonstrate the simulator by first implementing a quadrotor as an autonomous vehicle and then experimentally comparing the software components with real-world flights.
Increasingly, real world problems require multiple predictions while traditional supervised learning techniques focus on making a single best prediction. For instance in advertisement placement on the web, a list of advertisements is placed on a page with the objective of maximizing click-through rate on that list. In this work, we build an efficient framework for making sets or lists of predictions where the objective is to optimize any utility function which is (monotone) submodular over a list of predictions. Other examples of tasks where multiple predictions are important include: grasp selection in robotic manipulation where the robot arm must evaluate a list of grasps with the aim of finding a sucessful grasp, as early on in the list as possible and trajectory selection for mobile ground robots where given the computational time limits, the task is to select a list of trajectories from a much larger set of feasible trajectories for minimizing expected cost of traversal. In computer vision tasks like frame-to-frame target tracking in video, multiple hypotheses about the target location and pose must be considered by the tracking algorithm. For each of these cases, we optimize for the content and order of the list of predictions. Crucially– and in contrast with existing work on list prediction – our approach to pre- dicting lists is based on very simple reductions of the problem of predicting lists to a series of simple classification/regression tasks. This provides powerful flexibility to use any existing prediction method while ensuring rigorous guarantees on prediction performance. We analyze these meta-algorithms for list prediction in both the online, no-regret and generalization settings. Furthermore we extend the methods to make multiple predictions in structured output domains where even a single prediction is a combinatorial object, e.g. , challenging vision tasks like semantic scene labeling and monocular pose estimation. We conclude with case studies that demonstrate the power and flexibility of these reductions in problems from document summarization, prediction of the pose of humans in images, to predicting the best set of robotic grasps and purely vision based autonomous flight in densely cluttered environments.
Sequence optimization, where the items in a list are ordered to maximize some reward has many applications such as web advertisement placement, search, and control libraries in robotics. Previous work in sequence optimization produces a static ordering that does not take any features of the item or context of the problem into account. In this work, we propose a general approach to order the items within the sequence based on the context (e.g., perceptual information, environment description, and goals). We take a simple, efficient, reduction-based approach where the choice and order of the items is established by repeatedly learning simple classifiers or regressors for each “slot” in the sequence. Our approach leverages recent work on submodular function maximization to provide a formal regret reduction from submodular sequence optimization to simple costsensitive prediction. We apply our contextual sequence prediction algorithm to optimize control libraries and demonstrate results on two robotics problems: manipulator trajectory prediction and mobile robot path planning.