% !TeX root = ../main.tex \begin{abstract} % Transformers, particularly Vision Transformers (ViTs), have achieved state-of-the-art performance in large-scale image classification. % However, they often require large amounts of data and can exhibit biases, such as center or size bias, that limit their robustness and generalizability. % This paper introduces \schemename, a novel data augmentation operation that addresses these challenges by explicitly imposing invariances into the training data, which are otherwise part of the neural network architecture. % \schemename is constructed by using pretrained foundation models to separate and recombine foreground objects with different backgrounds. % This recombination step enables us to take fine-grained control over object position and size, as well as background selection. % We demonstrate that using \schemename significantly improves the accuracy of ViTs and other architectures by up to 4.5 percentage points (p.p.) on ImageNet, which translates to 7.3 p.p. on downstream tasks. % Importantly, \schemename not only improves accuracy but also opens new ways to analyze model behavior and quantify biases. % Namely, we introduce metrics for background robustness, foreground focus, center bias, and size bias and show that using \schemename during training substantially reduces these biases. % In summary, \schemename provides a valuable tool for analyzing and mitigating biases, enabling the development of more robust and reliable computer vision models. % Our code and dataset are publicly available at \code{}. Large-scale image classification datasets exhibit strong compositional biases: objects tend to be centered, appear at characteristic scales, and co-occur with class-specific context. % Models can exploit these biases to achieve high in-distribution accuracy, yet remain brittle under distribution shifts. By exploiting such biases, models attain high in-distribution accuracy but remain fragile under distribution shifts. To address this issue, we introduce \schemename, a controlled composition augmentation scheme that factorizes each training image into a \emph{foreground object} and a \emph{background} and recombines them to explicitly manipulate object position, object scale, and background identity. \schemename uses off-the-shelf segmentation and inpainting models to (i) extract the foreground and synthesize a neutral background, and (ii) paste the foreground onto diverse neutral backgrounds before applying standard strong augmentation policies. Compared to conventional augmentations and content-mixing methods, our factorization provides direct control knobs that break foreground-background correlations. % while preserving the label. Across 10 architectures, \schemename improves ImageNet top-1 accuracy by up to 6 percentage points (p.p.) and yields gains of up to 7.3 p.p. on fine-grained downstream datasets. Moreover, the same control knobs enable targeted diagnostic tests: we quantify background reliance, foreground focus, center bias, and size bias via controlled background swaps and position/scale sweeps, and show that training with \schemename substantially reduces these shortcut behaviors and significantly increases accuracy on standard distribution-shift benchmarks by up to $19$ p.p. % Moreover, the same control knobs enable targeted diagnostic tests: we quantify background reliance, foreground focus, center bias, and size bias via controlled background swaps and position/scale sweeps, and show that training with \schemename substantially reduces these shortcut behaviors and significantly increases accuracy on standard distribution-shift benchmarks like ImageNet-A/-C/-R by up to $19$ p.p. Our code and dataset are publicly available at \code{}. \keywords{Data Augmentation \and Vision Transformer \and Robustness} \end{abstract}