Separating Knowledge and Perception with Procedural Data
- VLMOCL

We train representation models with procedural data only, and apply them on visual similarity, classification, and semantic segmentation tasks without further training by using visual memory -- an explicit database of reference image embeddings. Unlike prior work on visual memory, our approach achieves full compartmentalization with respect to all real-world images while retaining strong performance. Compared to a model trained on Places, our procedural model performs within on NIGHTS visual similarity, outperforms by and on CUB200 and Flowers102 fine-grained classification, and is within on ImageNet-1K classification. It also demonstrates strong zero-shot segmentation, achieving an on COCO within of the models trained on real data. Finally, we analyze procedural versus real data models, showing that parts of the same object have dissimilar representations in procedural models, resulting in incorrect searches in memory and explaining the remaining performance gap.
View on arXiv