Despite decades of research, we still do not know how the brain integrates the many features of an object into a coherent whole, or whether artificial systems perform similar binding. In our first study, we find that large self-supervised vision transformers spontaneously develop a low-dimensional “same-object” representation that predicts whether two image patches belong to the same object with over 90 % accuracy. Removing this signal disrupts segmentation, showing that object binding naturally emerges in deep networks trained on natural images. In our second study, we develop a mechanistic model of attention and binding that captures core neurobiological phenomena such as selective focus and inhibition of return. Together, these results suggest a shared computational principle: binding arises from structured interactions between distrib