Publications
You can also find my articles on my Google Scholar profile.
Published in ICML 2026, Spotlight, 2026
Recommended citation: Wang, Z., Dang, X., Lee, J. D., & Lyu, K. (2026). The Power of Power Law: Asymmetry Enables Compositional Reasoning. arXiv preprint arXiv:2604.22951. https://arxiv.org/abs/2604.22951
Published in arXiv:2503.15477, 2025
Recommended citation: Razin, N., Wang, Z., Strauss, H., Wei, S., Lee, J. D., & Arora, S. (2025). What Makes a Reward Model a Good Teacher? An Optimization Perspective. arXiv preprint arXiv:2503.15477. https://arxiv.org/abs/2503.15477
Published in arXiv:2502.21212, 2025
Recommended citation: Huang, J., Wang, Z., & Lee, J. D. (2025). Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought. arXiv preprint arXiv:2502.21212. https://arxiv.org/pdf/2502.21212
Published in arXiv:2410.23438, 2024
Recommended citation: Ren, Y., Wang, Z., & Lee, J. D. (2024). Learning and Transferring Sparse Contextual Bigrams with Linear Transformers. arXiv preprint arXiv:2410.23438. https://arxiv.org/pdf/2410.23438
Published in arXiv:2406.06893, 2024
Recommended citation: Wang, Z., Wei, S., Hsu, D., & Lee, J. D. (2024). Transformers provably learn sparse token selection while fully-connected nets cannot. arXiv preprint arXiv:2406.06893. https://arxiv.org/pdf/2406.06893
Published in , 2024
Recommended citation: Wang, Z., Wei, S., Hsu, D., & Lee, J. D. (2024). Transformers provably learn sparse token selection while fully-connected nets cannot. arXiv preprint arXiv:2406.06893.
Published in arXiv:2210.03294, 2022
Recommended citation: Zhu, X., Wang, Z., Wang, X., Zhou, M., & Ge, R. (2022). Understanding Edge-of-Stability Training Dynamics with a Minimalist Example. arXiv preprint arXiv:2210.03294. http://wangzx19.github.io/files/EoS_tex_ICLRv2.pdf