Saining Xie

Saining Xie is a computer-vision researcher, co-founder and Chief Science Officer of AMI Labs, and an assistant professor of computer science at New York University, best known for ResNeXt, ConvNeXt, and the Diffusion Transformer that underlies OpenAI's Sora.
Saining Xie

Saining Xie

Saining Xie is a computer-vision researcher, co-founder and Chief Science Officer of AMI, and an Assistant Professor of Computer Science at the Courant Institute of Mathematical Sciences at New York University. He is the first author of ResNeXt, the senior author of ConvNeXt, and the co-creator with William Peebles of the Diffusion Transformer architecture that underlies OpenAI's Sora and several frontier video-generation systems. As of May 2026, he is on academic leave from NYU and serves as AMI's Chief Science Officer following the company's $1.03 billion seed round announced March 9, 2026.

At a glance

Origins

Xie was born in China and completed his undergraduate degree in computer science at Shanghai Jiao Tong University as a member of the ACM Honors Class. He worked as a research assistant in the Computational Intelligence Lab under Hongtao Lu, his earliest exposure to computer-vision and machine-learning research.

He moved to the United States for doctoral study, joining the University of California, San Diego computer science department in 2013. He was advised by Zhuowen Tu, whose group was active in deep representation learning and computer vision. His ICCV 2015 paper "Deeply-Supervised Nets" with Tu and others received the Marr Prize Honorable Mention. He completed his PhD in 2018 with a dissertation titled "Deep Representation Learning with Induced Structural Priors."

Career

Xie joined Facebook AI Research (FAIR) in Menlo Park as a research scientist in 2018, immediately after his PhD. He spent four years at FAIR within the computer-vision group alongside Kaiming He, Ross Girshick, Piotr Dollár, and Trevor Darrell. The FAIR years produced several of his most-cited papers: ResNeXt (CVPR 2017, first author), Momentum Contrast (MoCo, CVPR 2020), Masked Autoencoders (MAE, CVPR 2022), and ConvNeXt (CVPR 2022, senior corresponding author).

In 2022, while still affiliated with FAIR, Xie began the research project that produced Diffusion Transformers (DiT). The work with William Peebles, then a Berkeley PhD student interning at Meta, replaced the U-Net backbone of latent diffusion models with a transformer operating on latent patches. The paper was rejected at CVPR 2023, then accepted as an oral at ICCV 2023, and became the architectural foundation for OpenAI's Sora and Stable Diffusion 3. Peebles later joined OpenAI and co-led the Sora team.

Xie joined New York University as an Assistant Professor of Computer Science at the Courant Institute in 2023. He established the VISIONx group and continued publishing on multimodal learning and visual representation. The 2024 Cambrian-1 paper, senior-authored with Yann LeCun and led by NYU PhD student Shengbang Tong, presented a vision-centric exploration of multimodal large language models at NeurIPS 2024. He taught the graduate computer-vision course (CSCI-GA 2271) and a course on Learning with Large Language and Vision Models (CSCI-GA 3033-102).

In parallel, Xie joined Google DeepMind's GenAI team as a research scientist on the Nano Banana image and vision research line. The DeepMind period produced "Image Generators are Generalist Vision Learners" (Vision Banana), showing that a generalist multimodal model could match specialist computer-vision systems on segmentation, depth, and surface-normal tasks.

In late 2025, Xie joined AMI as co-founder and Chief Science Officer alongside Yann LeCun and Alexandre LeBrun. The AMI $1.03 billion seed round at a $3.5 billion pre-money valuation was announced March 9, 2026. His NYU page records academic leave through Spring and Summer 2026, with the AMI role taking primary focus.

Affiliations

Notable contributions

Xie's published work spans visual representation learning, generative models, and multimodal systems.

  • ResNeXt (CVPR 2017). First-author paper with Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Introduced "cardinality" as a third design dimension beyond depth and width by aggregating residual transformations across parallel branches. ResNeXt became one of the standard backbone families for image classification and downstream vision tasks.
  • Momentum Contrast (MoCo) (CVPR 2020). With Kaiming He, Haoqi Fan, Yuxin Wu, and Ross Girshick. Built a dynamic dictionary of negative examples with a queue and a momentum encoder, enabling large-scale contrastive self-supervised learning that surpassed supervised pre-training on several downstream benchmarks.
  • Masked Autoencoders (MAE) (CVPR 2022). With Kaiming He, Xinlei Chen, Yanghao Li, Piotr Dollár, and Ross Girshick. Applied masked-image modeling to vision transformers, masking 75 percent of patches and reconstructing them with an asymmetric encoder-decoder. Became the standard self-supervised pre-training recipe for ViT-family models.
  • ConvNeXt (CVPR 2022). Senior corresponding author with first author Zhuang Liu plus Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, and Trevor Darrell. Revisited convolutional design with the lessons of vision transformers, producing a pure ConvNet that matches or exceeds Swin Transformer accuracy on ImageNet, COCO, and ADE20K.
  • Diffusion Transformers (DiT) (ICCV 2023). With William Peebles. Replaced the U-Net backbone of latent diffusion models with a pure transformer operating on latent patches, demonstrating clean compute-versus-quality scaling for image generation. Became the foundation for OpenAI's Sora and Stable Diffusion 3. Xie has cited the paper's CVPR 2023 rejection publicly as a reminder that frontier-impact judgement is hard even for expert reviewers.
  • Cambrian-1 (NeurIPS 2024). Senior author with Yann LeCun and a team of NYU students. Introduced the Spatial Vision Aggregator connector and the CV-Bench evaluation. Places the vision representation, rather than the language model, at the center of multimodal-system design.
  • Vision Banana (Google DeepMind, 2025). With Kaiming He and others on the Nano Banana team. An instruction-tuned generalist multimodal model that outperforms specialist computer-vision systems on segmentation, depth estimation, and surface-normal prediction.
  • Public commentary. Xie's research-philosophy talk "Research as an Infinite Game" (CVPR 2025) and his TUM AI lecture "The Multimodal Future: Why Visual Representation Still Matters" (March 2025) are widely circulated as statements of his research outlook.

Investments and boards

  • AMI (AI): Co-founder and Chief Science Officer, late 2025 to present. AMI announced a $1.03 billion seed round at a $3.5 billion pre-money valuation on March 9, 2026.

No public personal angel-investor activity is on record in AI, semiconductors, datacenters, software, or energy as of May 2026.

Network

Xie's strongest professional relationships sit in the computer-vision lineage at FAIR. Kaiming He is his most-frequent senior collaborator from the FAIR years and a co-author on ResNeXt, MoCo, MAE, and Vision Banana. Ross Girshick, Piotr Dollár, Yanghao Li, Xinlei Chen, and Haoqi Fan are recurring FAIR co-authors. Trevor Darrell at UC Berkeley collaborated on ConvNeXt. His PhD advisor Zhuowen Tu at UCSD remains a long-running co-author.

The mentee relationship with William Peebles, the Berkeley PhD student who interned at Meta during the DiT project and now co-leads Sora at OpenAI, is the most-watched of his research-supervision relationships. Other former students include Zhuang Liu (assistant professor at Princeton), Sanghyun Woo (Google DeepMind), Eric Mintun (Sora team at OpenAI), and Shengbang Tong (NYU PhD student and Cambrian-1 lead author).

The AMI founding cohort places Xie alongside Yann LeCun, Alexandre LeBrun, Pascale Fung (chief research and innovation officer), Michael Rabbat (VP, world models), and Laurent Solly (COO). The LeCun connection through both NYU and AMI is the structural anchor of his current career position.

Position in the field

Xie occupies a distinctive position among computer-vision researchers of his cohort. The combination of a frontier-tier publication record across architecture (ResNeXt, ConvNeXt), self-supervised representation learning (MoCo, MAE), and generative modeling (DiT) is unusual; most peers concentrate on one or two of those three lines. DiT in particular has had measurable downstream impact on commercial video-generation systems including Sora.

The AMI Chief Science Officer role places him in a senior research-leadership position at the most-watched non-LLM frontier-research bet of 2026. Industry coverage has placed Xie's recruitment alongside Pascale Fung's as one of the principal validating data points for AMI's $4.5 billion post-money seed valuation, given that the company is pre-product. His research record provides credibility on visual representation and multimodal architecture, areas central to the JEPA world-model thesis AMI is pursuing.

The dual NYU and AMI structure mirrors the model used by Yann LeCun at Meta and AMI, and several other senior researchers who have continued academic appointments alongside frontier-lab leadership. Whether Xie returns to full-time NYU teaching after the AMI launch period is an open question.

Outlook

Open questions over the next 6 to 18 months:

  • First AMI publications. Whether AMI will publish papers under Xie's name, and whether those papers extend the JEPA family or produce new architectural directions.
  • NYU return. Whether Xie returns to NYU teaching in Fall 2026 or extends academic leave further into the AMI launch period.
  • Vision in world models. AMI's stated focus on world models built on continuous sensory input aligns directly with Xie's research on visual representation. Whether the company produces a vision-first world model artifact in 2026, and at what scale, is a watchable signal.
  • DiT-family commercial deployment. As Sora and other diffusion-transformer systems scale through 2026 and 2027, whether Xie continues to make architectural contributions to the line, or pivots fully to AMI's world-model direction, will indicate the breadth of his research focus.
  • Mentee trajectory. William Peebles, Zhuang Liu, and other former students continue to occupy senior research-leadership roles at frontier labs. The pattern of former Xie collaborators in senior positions is itself a watchable indicator of his research influence.
  • Public-commentary positioning. Xie's X account and conference talks have consistently emphasized the role of visual representation in multimodal AI. Whether the AMI period sharpens that public position or shifts it toward LeCun-style critique of the LLM paradigm is an open question.

Sources

About the author
Nextomoro

Nextomoro

nextomoro tracks progress for AI research labs, models, and what's next.

AI Research Lab Intelligence

nextomoro tracks progress for AI research labs, models, and what's next.

AI Research Lab Intelligence

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to AI Research Lab Intelligence.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.