Abstract
Deep neural network (DNN)-based policy models, such as vision-language-action (VLA) models, excel at automating complex decision-making from multi-modal inputs. However, scaling these models greatly increases computational overhead, complicating deployment in resource-constrained settings like robot manipulation and autonomous driving. To address this, we propose Saliency-Aware Quantized Imitation Learning (SQIL), which combines quantization-aware training with a selective loss-weighting strategy for mission-critical states. By identifying these states via saliency scores and emphasizing them in the training loss, SQIL preserves decision fidelity under low-bit precision. We validate SQIL's generalization capability across extensive simulation benchmarks with environment variations, real-world tasks, and cross-domain tasks (self-driving, physics simulation), consistently recovering full-precision performance. Notably, a 4-bit weight-quantized VLA model for robotic manipulation achieves up to 2.5x speedup and 2.5x energy savings on an edge GPU with minimal accuracy loss. These results underline SQIL's potential for efficiently deploying large IL-based policy models on resource-limited devices.
Method & Analysis
Quantization compresses policy parameters to low-bit precision, reducing compute and memory. Given full-precision weights \( w^{\text{FP}} \), we apply symmetric uniform quantization:
This enables a quantized policy \( \pi^Q_\theta \) that is efficient, but incur performance loss at high-sensitivity states, as illustrated in the introductory figure.
Saliency-based Importance Score (SIS)
To detect such mission-critical states, SQIL computes a Saliency-based Importance Score (SIS):
where \( \phi(s_t, k) \) introduces a local state perturbation at location \(k\). High SIS indicates strong sensitivity in decision-making.
Saliency-aware Quantized Imitation Learning (SQIL)
SQIL enhances imitation learning under quantization by combining two complementary components: quantization-aware training (QAT) and quantization-robust action distillation (QRD). QAT aligns the quantized policy with expert actions, while QRD further reduces quantization errors by matching the output distribution of the quantized policy to that of the full-precision (FP) policy.
To identify which states deserve more focus during distillation, we use the saliency-based importance score (SIS). QRD applies a selective weighting coefficient \(\alpha_t\), assigning larger weights to mission-critical states—those with high SIS values.
Here, \( D(\cdot || \cdot) \) is a discrepancy metric such as the L2 norm, and \(\alpha_t = \beta\) for the top 20% highest SIS states (\( \text{SIS}(s_t) > T \)), otherwise 1. This weighting emphasizes learning from states most affected by quantization. As shown in experiments, this mechanism significantly reduces action discrepancies and improves control fidelity under 4-bit quantization.

Keyframe (KF) methods identify coarse transitions (e.g., "drawer open") using object state or vision-language cues. SIS captures finer interaction moments like grasping or releasing, by measuring control sensitivity, improving performance under quantization (+1.1% over KF).

Saliency visualization shows how quantization distorts the policy's attention. While the FP policy attends to meaningful regions (e.g., robot arm, bowl, plate), PTQ often misfocuses on irrelevant areas. SQIL successfully restores the focus pattern of the FP policy, producing saliency maps that align closely with expert behavior.

This figure compares the action distributions of FP, PTQ, QAT, QRD, and SQIL in a self-driving task.
• PTQ deviates significantly from FP due to quantization noise.
• QAT aligns peaks with expert actions but overly sharpens the distribution.
• QRD maintains FP-like shape but may underrepresent expert intent.
• SQIL combines both benefits—preserving the FP structure while prioritizing expert-like decisions.
Experiments
Despite operating under 4-bit quantization, SQIL outperforms other quantized baselines and matches full-precision performance across real-world and cross-domain tasks, demonstrating its robustness and generality.
In autonomous driving, our 4-bit model achieves up to 3.7× lower latency and 3.1× energy savings.
In robot manipulation, INT4 provides 2.5× speedup and 4× memory reduction, enabling efficient inference on edge devices.
Rollout Videos
Real-World Robot Manipulation: Qunatized OpenVLA
Simulation-based Robot Manipulation: Quantized OpenVLA on LIBERO Benchmark
Autonomous Driving: Quantized CILRS on NoCrash-dense Benchmark
BibTeX
@article{park2025saliency, title={Saliency-Aware Quantized Imitation Learning for Efficient Robotic Control}, author={Park, Seongmin and Kim, Hyungmin and Kim, Sangwoo and Jeon, Wonseok and Yang, Juyoung and Jeon, Byeongwook and Oh, Yoonseon and Choi, Jungwook}, journal={arXiv preprint arXiv:2505.15304}, year={2025} }