Below we show examples of camera-robot pose and robot joint state estimation, from an uncalibrated camera, without any motor information — the robot is not even plugged in!
TL;DR:
• Reformulation of robot state estimation and control using vision instead of just motors
• Link pose estimation made easy via 3D-printed mounts per-link with fiducial markers
We present a simple yet effective approach to robot state estimation that eliminates the need for expensive, high-precision actuators. Conventional methods rely on forward kinematics (FK) and precise motor encoders, but these degrade in accuracy due to backlash, thermal drift, and mismatched simulation–sensor dynamics. To improve both accuracy (especially on lower-cost hardware), we reformulate state estimation as a visual pose-estimation problem. Our method estimates the 6-DoF pose of each robot link from a single RGB image and performs a global joint optimization—optionally warm-started with encoder readings—to recover the full robot configuration. This formulation supports accurate 3D tasks such as joint-angle inference, camera pose estimation, and robot calibration directly from monocular images. To enable robust visual pose estimation of each link, we introduce the fiducial exoskeleton: a 3D-printed frame with fiducial markers, mounted on each link. The known marker-link relationship allows precise per-link pose estimation without sequential calibration or specialized equipment. Experiments on state-estimation and robotic control tasks demonstrate substantial accuracy gains over standard FK-based methods, even on such low-cost hardware.
We compare state estimation (estimated robot state rendered on top of original video) with the traditional method of just using forward kinematics, our method using the raw encoders as input, and our method without using any internal motor readings.
The gray rendered robot is the estimated robot state overlayed on top of the real robot.
We compare accuracy of moving the robot to a target kinematic state using encoders only, our method without delta correction, and our method with delta correction.
The gray rendered robot is the target robot state overlayed on top of the robot moving to (and ideally reaching) that state, for a sequence of target states .
We demo a simple pipeline where a grasp for a known object is computed with IK and then executed with the robot.