The semantic similarity between sample expressions measures the distance between their latent 'meaning'. These meanings are themselves typically represented by textual expressions. We propose a novel approach whereby the semantic similarity among textual expressions is based not on other expressions they can be rephrased as, but rather based on the imagery they evoke. While this is not possible with humans, generative models allow us to easily visualize and compare generated images, or their distribution, evoked by a textual prompt. Therefore, we characterize the semantic similarity between two textual expressions simply as the distance between image distributions they induce, or 'conjure.' We show that by choosing the Jeffreys divergence between the reverse-time diffusion stochastic differential equations (SDEs) induced by each textual expression, this can be directly computed via Monte-Carlo sampling. Our method contributes a novel perspective on semantic similarity that not only aligns with human-annotated scores, but also opens up new avenues for the evaluation of text-conditioned generative models while offering better interpretability of their learnt representations.
Liu et al. (2024) view meanings in autoregressive LLMs via distributions over their generated continuations, or trajectories. In this work, we show that even image-generation models can capture textual meaning through distributions over their generative trajectories in image space.
Crucially, meanings are not only a function of a trained model (weights, architecture etc.), but rather a function of the generative process. Different generative processes on the same model can yield completely different meaning representations for the same inputs. This key feature of our definitions differentiates our work from prior art.
We illustrate the process of conjuring semantic similarity between textual expressions “Snow Leopard” and “Bengal Tiger”. We denoise each sequence of noisy images (middle row of both halves of figure) with both prompts (top and bottom row of both halves of figure). Our method can be interpreted as taking the Euclidean distance between the resulting images in the two rows. The sequences of noisy images are obtained with either of the two text expressions (top / bottom halves of Figure) starting from a Gaussian prior (t = T ). Observing cells highlighted in red, we see that the model converts pictures of Snow Leopards into Bengal Tigers by changing their characteristic spotted coats into stripes, and adding striped textures to the animal’s face (top half of Figure), and conversely converts Bengal Tigers into Snow Leopards by changing their characteristic stripes into spotted coats (bottom half of Figure). This enables interpretability of their semantic differences via changes in their evoked imageries.
"Meaning Representations from Trajectories in Autoregressive Models"
"Conjuring Semantic Similarity"
@inproceedings{liu2026conjuring,
title={Conjuring Semantic Similarity},
author={Liu, Tian Yu and Soatto, Stefano},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026}
}
}
@inproceedings{liu2024meaning,
title={Meaning Representations from Trajectories in Autoregressive Models},
author={Liu, Tian Yu and Trager, Matthew and Achille, Alessandro and Perera, Pramuditha and Zancato, Luca and Soatto, Stefano},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024}
}
}