Publications

* indicates equal contribution. See also my Google Scholar.


Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

W. Deng, J. Huang, K. Ozkara, Y. Li, C. Thrampoulidis, X. Li, Y. Park.

Published in ICML 2026 Second Workshop on Agents in the Wild: Safety, Security, and Beyond, 2026.