Zhe Huang

Software Engineer, Waymo

Skills

Python; C++; C; CUDA; HTML; Shell; LaTeX.
PyTorch; PyTorch3D; MXNet; TensorFlow; Keras; scikit-learn; MPI; OpenGL.
Slurm; Git; tmux; Vim; AWS; Jupyter; Docker; gcc; gdb; g++.

Education

Master of Science in Robotics

Aug 2020 - May 2022

Carnegie Mellon University

Advisor: Prof. Jeff Schneider; Full RA-ship funded; GPA: 3.8/4.0; Research topics: DL & RL.

Bachelor of Science in Computer Sciences

Sep 2017 - May 2020

University of Wisconsin-Madison

On Dean’s List; GPA: 3.96/4.0; Graduated with Distinction.

Experiences

Software Engineer, LLM/VLM

Oct 2024 - Present

Waymo, Mountain View, California

Perception scene understanding.

Engineer, Deep Perception

Sep 2023 - Oct 2024

Bot Auto, Houston, Texas

Designed and implemented a single-model perception foundation model for end-to-end video 2D & 3D Detection, LiDAR semantic segmentation, camera-LiDAR fusion, traffic light & vehicle light detection, tracking & prediction.
For efficiency reasons, the model does not employ Bird’s-Eye-View (BEV) feature representation. The model is fully sparse and can detection objects in ±400m.
Conducted self-supervised pre-training of unified multi-modal Transformer backbone using MAE and Voxel-MAE, respectively. Adopted mixed precision training for numerical stability when handling 12-bit raw image. Multi-stage fine-tuning is also necessary to orchestrate multi-task learning.
Worked collaborately with experienced team members to implement custom TensorRT plugins and optimize the performance of TensorRT engines.

Research Engineer, Deep Perception

June 2022 - Sep 2023

TuSimple Inc., San Diego, California

Upgraded monocular 3D objection detection module (for vehicles and vulnerable road users) on the autonomous driving truck.
Devised a novel attention mechanism that elevates the yaw prediction accuracy of current monocular 3D pedestrian detection module by 20% while reducing the model size by 50% and maintaining the same inference time.
Designed and built a BEV 3D objection detection and segmentation model with multi-modal input sources, including multi-cam raw images, perspective view image segmentation results, lidar occlusion and occupancy grids and lidar segmentation results.
Worked on the Transformer-based motion prediction project. Devised a novel multi-scale feature representation for map context and LiDAR voxels. Designed a velocity-based local attention which helps agents better understand their surrounding environment. On Waymo Open Dataset leaderboard, our model, dubbed MGTR, has achieved 1st place in terms of the overall prediction mAP and soft mAP.

Machine Learning Engineer Intern, Deep Perception

May 2021 - Aug 2021

TuSimple Inc., San Diego, California

Initiated the Emergency Vehicle (EV) Detection project. Built the general codebase and baseline models.
Tackled siren recognition task. Designing pipelines for sound event detection and sound source localization.
Revealed the flaws of previous state-of-the-art models by Grad-CAM visualization and heatmap analysis.
With performance-boosting add-on modules, the finalized model achieved 99% detection accuracy and 10% gain in terms of localization precision & recall in comparison with other state-of-the-art methods.

Computer Vision Research Intern

Sep 2018 - June 2019

SenseTime Group Ltd., Shenzhen & Hong Kong

Focused on a broad range of research topics, such as DL acceleration, data-denoising, self-/weakly- supervised learning, autoML, image restoration and DNN black-box visualization & interpretation.
Improved the company’s PyTorch-based classification & detection frameworks by adding new features as well as inspecting and refactoring existing code base.
Explored the limits of large-scale weakly-supervised training of WebVision dataset which consists of 16 million samples, achieving 2nd place of the WebVision 2019 Challenge at CVPR ‘19 hosted by the Computer Vision Laboratory, ETH Zurich.
Made two paper submissions as the 1st author, one for ACM Multimedia involving image segmentation & super-resolution, and another for AAAI related to NAS, detection & GCN.

Selected Projects

CARLA Self-driving via Reinforcement Learning

Auton Lab, The Robotics Institute, CMU

A research project sponsored by Argo AI and advised by Prof. Jeff Schneider.

Designing asynchronous multi-agent RL algorithms (distributed PPO & SAC) for autonomous driving on CARLA. % - Reducing the training time from several weeks to 40 hours while improving the agent performance by 10%.
Boosted the RL training speed by 40 times while improving the agent performance by 10%.
Achieving 80.64 total driving score on CARLA Leaderboard evaluation routes assuming having access to a perfect detection model.

Apple E-Waste Recycling

Biorobotics Lab, The Robotics Institute, CMU

A research & engineering project sponsored by Apple Inc. Work in progress.

Designed the phone recognition pipeline and generated photorealistic phone CAD synthetic dataset.
Improved the phone classification accuracy by 6% (top-1 acc. boosted from ~90% to 96%).

Learning-based Image Synthesis

16-726, The Robotics Institute, CMU

Published websites containing 5 assignments and 1 final project, which are

Colorizing the Prokudin-Gorskii Photo Collection;
Gradient Domain Fusion;
When Cats meet GANs;
Neural Style Transfer;
GAN Photo Editing (best project finalist);
Image Generation via Independent Semantic Synthesis.

Learning for 3D Vision

16-889, The Robotics Institute, CMU

This course covers topics including explicit, implicit, neural 3D representations, differentiable rendering, neural rendering, mesh and point cloud processing, radiance Fields, multi-plane images, implicit surfaces, etc. Will publish websites containing 5 assignments and 1 final project, currently have done 2 of them, which are

Rendering Basics with PyTorch3D;
Single View to 3D;
Volume Rendering and Neural Radiance Fields;
Neural Surfaces;
Point Cloud Classification and Segmentation.

Parallelized Password Cracker via CUDA

CS759, College of Engineering, UW-Madison

A CUDA application that can compromise both simple passwords & hash-encoding passwords (e.g. MD5).

Human Pose Estimation via Synthetic Data

Department of Computer Science, UW-Madison

A research project advised by Prof. Yin Li.

Improved 2D & 3D human pose estimation in real/synthetic video using SMPL models.
Modeled real-world noise, variation & occlusion for better 3D joint predictions.

Large-scale YouTube Video Semantic Analysis

Department of Life Sciences Communication, UW-Madison

A research project funded by Prof. Kaiping Chen

Large-scale data mining & applied deep learning to understand scientific videos on YouTube.
This work has been accepted by a top-tier conference in communication & media area.

pytorl: PyTorch Toolbox for Reinforcement Learning

Personal Repository on GitHub

Implemented 4 DQN(and its variants) algorithms & a distributed DQN learning algorithm similar to Google Gorila via parameter server architecture.

Publications

Multi-Granular Transformer for Motion Prediction with LiDAR

Yiqian Gan, Hao Xiao, Yizhe Zhao, Ethan Zhang, Zhe Huang, Xin Ye, Lingting Ge

In Proceedings of 2024 IEEE International Conference on Robotics and Automation (ICRA 2024)

Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning

Adam Villaflor, Zhe Huang, Swapnil Pande, John Dolan, Jeff Schneider

In Proceedings of the 39th International Conference on Machine Learning (ICML 2022)

Distributed Reinforcement Learning for Autonomous Driving

Zhe Huang

Tech. Report, CMU-RI-TR-22-09, The Robotics Institute, CMU (Master's Thesis)

Gradual Network for Single Image De-raining

Zhe Huang*, Weijiang Yu*, Wayne Zhang, Litong Feng, Nong Xiao

In Proceedings of the 27th ACM International Conference on Multimedia (MM 2019) [Oral]