AI Research Team From Princeton, Berkeley and ETH Zurich Introduce ‘RLQP’ To Accelerate Quadratic Optimization With Deep Reinforcement Learning (RL)


Quadratic programming (QPs) is widely used in various fields, including finance, robotics, operations research, and many others, for large-scale machine learning and embedded optimal control, where a large number of related issues must be handled quickly. However, these methods require thousands of iterations. In addition, real-time control applications have tight latency constraints for solvers. 

Alternating Direction Method of Multipliers – ADMM – is a high-performance first-order optimization technique that supports the widely used and cutting-edge Operator-Splitting QP (OSQP) solver. To create a step direction, ADMM executes a linear solve on a matrix based on the QP’s optimality requirements and then projects the step onto the constraint bounds.

This method, however, contains numerous hyperparameters that must be tweaked with heuristics to regularize and control optimization. As a result, developing efficient heuristics to solve QPs in fewer iterations is critical.

Researchers from the University of California, Princeton University and ETH Zurich have proposed RLQP, an accelerated QP solver based on operator-splitting QP(OSQP). RLQP uses reinforcement learning (RL) to adapt the internal parameters of the ADMM algorithm between iterations to minimize solve times.


Because the dimensions of this internal parameter differ between QPs, the team recommends two approaches to deal with this variation:

  1.  Learning a policy to adapt an internal parameter scalar
  2. Learning a policy to modify individual internal parameter coefficients. 

The policy of the RLQP might be taught to all QPs or a specific class of QPs.

To compute the policy, the researchers employed Twin-Delayed DDPG TD3, an extension of deep deterministic policy gradients. They determine that for few QP classes, the solver can accelerate convergence by adapting all coefficients of the internal parameter rather than adapting to a scalar, i.e. using a vector. 

The team used a set of randomized QPs to train RLQP. They compared its convergence rates to non-adaptive and heuristic adaptive policies such as control, Huber fitting, support vector machines (SVM), portfolio optimization and many more.  


They used the following settings for evaluation where:

  1. They used the same class of problems for both training and test sets of QPs
  2. The train set contains a superset of classes in the test set
  3. The train set includes a subset
  4. The train and test sets are from different classes. 

The results demonstrate that RLQP outperforms OSQP by reducing the convergence three times and also generalizes across a broad range of problems. In future, they plan to explore additional RL policy options such as hierarchical policy using meta-learning and online learning to speed up the convergence rate further.