Jindong Jiang

Research Scientist · NVIDIA Research

Email: jindongj at nvidia.com · Google Scholar · GitHub

About

I am a Research Scientist at NVIDIA Research. Prior to joining NVIDIA, I was a PhD student at Rutgers University under the supervision of Prof. Sungjin Ahn. My research interests lie at the intersection of representation learning and visual reasoning, with a strong interests in developing novel architectures that can improve agent's visual reasoning capabilities.

The long-term objective of my research is to develop artificial intelligence agents capable of human-like reasoning. This involves designing systems that can uncover latent structure of the physical world, predict future scenarios based on current states, infer the causality or correlation between events, and engage in logical planning to accomplish goals.

Currently, I am focusing on Multimodal LLMs, Vision Foundation Models, and their synergy.

Interests: Multimodal LLMs, Image/Video Perception, Diffusion Models, State Space Models

News

May 2025: I joined Learning and Perception Research Group as a Research Scientist.
April 2025: I successfully completed my Ph.D. defense at Rutgers University.
Mar 2025: Preprint of our new work, Token-Efficient Long Video Understanding for Multimodal LLMs, for long-video Multimodal LLMs, is out on arXiv.
Sep 2024: Our Slot State Space Models is accepted to NeurIPS 2024.
June 2024: Preprint of our new work, Slot State Space Models, for modular sequence modeling, is out on arXiv.
June 2024: I joined Learning and Perception Research Group as a Research Intern in the Summer of 2024.
Feb 2024: Our Layout-Agnostic Scene Text Image Synthesis with Diffusion Models is accepted to CVPR 2024.
Sep 2023: Our Object-Centric Slot Diffusion is accepted to NeurIPS 2023 as a 🌟Spotlight🌟 paper.
Apr 2023: Our Object-Centric Slot Diffusion for Unsupervised Compositional Generation, is accepted to CVPR 2023 GCV Workshop.
Feb 2023: I will be joining as a Research Scientist Intern in the Summer of 2023.

Publications

Token-Efficient Long Video Understanding for Multimodal LLMs
- { Jindong Jiang, Xiuyu Li }, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon
- Preprint 2025 [paper] [webpage]
Slot State Space Models
- Jindong Jiang, Fei Deng, Gautam Singh, Minseung Lee, Sungjin Ahn
- NeurIPS 2024 [paper] [webpage]
SceneTextGen: Layout-Agnostic Scene Text Image Synthesis with Diffusion Models
- Qilong Zhangli, Jindong Jiang, Di Liu, Licheng Yu, Xiaoliang Dai, Ankit Ramchandani, Guan Pang, Dimitris N. Metaxas, Praveen Krishnan
- CVPR 2024 [paper]
Object-Centric Slot Diffusion
- Jindong Jiang, Fei Deng, Gautam Singh, Sungjin Ahn
- NeurIPS 2023 🌟Spotlight🌟 (top 3% = 378/12343) [paper] [webpage]
Generative Neurosymbolic Machines
- Jindong Jiang and Sungjin Ahn
- NeurIPS 2020 🌟Spotlight🌟 (top 4% = 395/9454) [paper] [code]
Improving Generative Imagination in Object-Centric World Models
- Zhixuan Lin, Yi-Fu Wu, Skand Peri, Bofeng Fu, Jindong Jiang, Sungjin Ahn
- ICML 2020 [paper] [webpage]
SCALOR: Generative World Models with Scalable Object Representations
- { Jindong Jiang, Sepehr Janghorbani }, Gerard de Melo, Sungjin Ahn
- ICLR 2020 [paper] [webpage]
SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition
- { Zhixuan Lin, Yi-Fu Wu, Skand Vishwanath Peri }, Weihao Sun, Gautam Singh, Fei Deng, Jindong Jiang, Sungjin Ahn
- ICLR 2020 [paper] [webpage]

My Name

My Chinese name is "江锦东", it can be pronounced as "Jiang Jindong" in Mandarin, and "Gong Kam Dong" in Guangdong dialect (also known as Cantonese). As ChatGPT once cleverly put it, "江" stands for river, giving a nice flow to the name, while "锦" jazzes things up, representing brocade or ornamental cloth. Then we've got "东" which means east, adding a sense of direction. So, when you put them all together, you end up with a quirky name that doesn't quite translate directly into English.