Accepted at ICLR 2026

Conjuring Semantic Similarity

1Department of Computer Science, University of California, Los Angeles
Paper (arXiv) ICLR 2026 OpenReview

Visualizing Semantic Divergence

"A man with a jersey is dunking the ball"
Induced Generative Image Distribution P₀
Jeffreys Divergence DJ(P₀ || P₁)
Computed via Monte-Carlo sampling of reverse-time SDEs
"The ball is being dunked by a man"
Induced Generative Image Distribution P₁

Abstract

The semantic similarity between sample expressions measures the distance between their latent 'meaning'. These meanings are themselves typically represented by textual expressions. We propose a novel approach whereby the semantic similarity among textual expressions is based not on other expressions they can be rephrased as, but rather based on the imagery they evoke. While this is not possible with humans, generative models allow us to easily visualize and compare generated images, or their distribution, evoked by a textual prompt. Therefore, we characterize the semantic similarity between two textual expressions simply as the distance between image distributions they induce, or 'conjure.' We show that by choosing the Jeffreys divergence between the reverse-time diffusion stochastic differential equations (SDEs) induced by each textual expression, this can be directly computed via Monte-Carlo sampling. Our method contributes a novel perspective on semantic similarity that not only aligns with human-annotated scores, but also opens up new avenues for the evaluation of text-conditioned generative models while offering better interpretability of their learnt representations.

Meanings from Generative Processes, not Generative Models

A "well-trained" (text/image) generative model equipped with a random (diffusion/next-token) sampler yields generative processes which evidently attributes "no meaning" to its inputs (initial conditions).

Liu et al. (2024) view meanings in autoregressive LLMs via distributions over their generated continuations, or trajectories. In this work, we show that even image-generation models can capture textual meaning through distributions over their generative trajectories in image space.

Crucially, meanings are not only a function of a trained model (weights, architecture etc.), but rather a function of the generative process. Different generative processes on the same model can yield completely different meaning representations for the same inputs. This key feature of our definitions differentiates our work from prior art.

Splash Figure

We illustrate the process of conjuring semantic similarity between textual expressions “Snow Leopard” and “Bengal Tiger”. We denoise each sequence of noisy images (middle row of both halves of figure) with both prompts (top and bottom row of both halves of figure). Our method can be interpreted as taking the Euclidean distance between the resulting images in the two rows. The sequences of noisy images are obtained with either of the two text expressions (top / bottom halves of Figure) starting from a Gaussian prior (t = T ). Observing cells highlighted in red, we see that the model converts pictures of Snow Leopards into Bengal Tigers by changing their characteristic spotted coats into stripes, and adding striped textures to the animal’s face (top half of Figure), and conversely converts Bengal Tigers into Snow Leopards by changing their characteristic stripes into spotted coats (bottom half of Figure). This enables interpretability of their semantic differences via changes in their evoked imageries.

ICLR 2024

Textual Trajectories

"Meaning Representations from Trajectories in Autoregressive Models"

  • Meanings in autoregressive LLMs are distributions over generated textual continuations, or trajectories.
  • Semantic similarity of different textual expressions is measured by the divergence between these induced distributions over the same support set of trajectories.
Read Paper
ICLR 2026

Visual Trajectories

"Conjuring Semantic Similarity"

  • Meanings in text-conditioned image generation models are distributions over generative trajectories in image space.
  • "Visually-grounded" semantic similarity, in the case of diffusion models, is measured by the Jeffreys divergence between reverse-time SDEs induced by the different textual inputs.

BibTeX Citation

Conjuring Semantic Similarity
@inproceedings{liu2026conjuring,
  title={Conjuring Semantic Similarity},
  author={Liu, Tian Yu and Soatto, Stefano},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026}
}
}
Meanings Representations from Trajectories
@inproceedings{liu2024meaning,
  title={Meaning Representations from Trajectories in Autoregressive Models},
  author={Liu, Tian Yu and Trager, Matthew and Achille, Alessandro and Perera, Pramuditha and Zancato, Luca and Soatto, Stefano},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024}
}
}