% !TeX root = ../main.tex

\begin{abstract}
    % Transformers, particularly Vision Transformers (ViTs), have achieved state-of-the-art performance in large-scale image classification.
    % However, they often require large amounts of data and can exhibit biases, such as center or size bias, that limit their robustness and generalizability.
    % This paper introduces \schemename, a novel data augmentation operation that addresses these challenges by explicitly imposing invariances into the training data, which are otherwise part of the neural network architecture.
    % \schemename is constructed by using pretrained foundation models to separate and recombine foreground objects with different backgrounds.
    % This recombination step enables us to take fine-grained control over object position and size, as well as background selection.
    % We demonstrate that using \schemename significantly improves the accuracy of ViTs and other architectures by up to 4.5 percentage points (p.p.) on ImageNet, which translates to 7.3 p.p. on downstream tasks.
    % Importantly, \schemename not only improves accuracy but also opens new ways to analyze model behavior and quantify biases.
    % Namely, we introduce metrics for background robustness, foreground focus, center bias, and size bias and show that using \schemename during training substantially reduces these biases.
    % In summary, \schemename provides a valuable tool for analyzing and mitigating biases, enabling the development of more robust and reliable computer vision models.
    % Our code and dataset are publicly available at \code{<url>}.

    Large-scale image classification datasets exhibit strong compositional biases: objects tend to be centered, appear at characteristic scales, and co-occur with class-specific context.
    % Models can exploit these biases to achieve high in-distribution accuracy, yet remain brittle under distribution shifts.
    By exploiting such biases, models attain high in-distribution accuracy but remain fragile under distribution shifts.
    To address this issue, we introduce \schemename, a controlled composition augmentation scheme that factorizes each training image into a \emph{foreground object} and a \emph{background} and recombines them to explicitly manipulate object position, object scale, and background identity.
    \schemename uses off-the-shelf segmentation and inpainting models to (i) extract the foreground and synthesize a neutral background, and (ii) paste the foreground onto diverse neutral backgrounds before applying standard strong augmentation policies.
    Compared to conventional augmentations and content-mixing methods, our factorization provides direct control knobs that break foreground-background correlations. % while preserving the label.
    Across 10 architectures, \schemename improves ImageNet top-1 accuracy by up to 6 percentage points (p.p.) and yields gains of up to 7.3 p.p. on fine-grained downstream datasets.
    Moreover, the same control knobs enable targeted diagnostic tests: we quantify background reliance, foreground focus, center bias, and size bias via controlled background swaps and position/scale sweeps, and show that training with \schemename substantially reduces these shortcut behaviors and significantly increases accuracy on standard distribution-shift benchmarks by up to $19$ p.p.
    % Moreover, the same control knobs enable targeted diagnostic tests: we quantify background reliance, foreground focus, center bias, and size bias via controlled background swaps and position/scale sweeps, and show that training with \schemename substantially reduces these shortcut behaviors and significantly increases accuracy on standard distribution-shift benchmarks like ImageNet-A/-C/-R by up to $19$ p.p.
    Our code and dataset are publicly available at \code{<url>}.

    \keywords{Data Augmentation \and Vision Transformer \and Robustness}
\end{abstract}