\section{Related Work}
\label{sec:related_work}

\paragraph{Data Augmentation for Image Classification}
Data augmentation is a crucial technique for improving the performance and generalization of image classification models.
Traditional augmentation strategies rely on simple geometric or color-space transformations like cropping, flipping, roatation, blurring, color jittering, or random erasing \cite{Zhong2017} to increase the diversity of the training data without changing their semantic meaning.
With the advent of Transformers, new data augmentation operations like PatchDropout \cite{Liu2022d} have been proposed.
Other transformations like Mixup \cite{Zhang2018a}, CutMix \cite{Yun2019}, or random cropping and patching \cite{Takahashi2018} combine multiple input images.
These simple transformations are usually bundled to form more complex augmentation policies like AutoAugment \cite{Cubuk2018} and RandAugment \cite{Cubuk2019}, which automatically search for optimal augmentation policies or 3-augment \cite{Touvron2022} which is optimized to train a ViT.
For a general overview of data augmentation techniques for image classification, we refer to \cite{Shorten2019, Xu2023d}.

We build upon these general augmentation techniques by introducing a novel approach to explicitly separate and recombine foregrounds and backgrounds for image classification.
Our approach is used in tandem with traditional data augmentation techniques to improve model performance and reduce biases.

\paragraph{Copy-Paste Augmentation}
The copy-paste augmentation \cite{Ghiasi2020}, which is used for object detection \cite{Shermaine2025,Ghiasi2020} and instance segmentation \cite{Werman2021,Ling2022}, involves copying segmented objects from one image and pasting them onto another.
While typically human-annotated segmentation masks are used to extract the foreground objects, other foregound sources have been explored, like 3D models \cite{Hinterstoisser2019} and pretrained object-detection models for use on objects on white background \cite{Dwibedi2017} or synthetic images \cite{Ge2023}.
DeePaste \cite{Werman2021} focuses on using inpainting for a more seamless integration of the pasted object.

Unlike these methods, \name focuses on image classification.
While for detection and segmentation, objects are pasted onto another image (with a different foreground) or on available or rendered background images of the target scene, we extract foreground objects and fill in the resulting holes in the background in a semantically neutral way.
This way, we can recombine any foreground object with a large variety of neutral backgrounds from natural images, enabling a controlled and diverse manipulation of image composition.


\paragraph{Model robustness evaluation}
Evaluating model robustness to various image variations is critical for understanding and improving model generalization.
Datasets like ImageNet-C \cite{Hendrycks2019} and ImageNet-P \cite{Hendrycks2019} introduce common corruptions and perturbations.
ImageNet-E \cite{Li2023e} evaluates model robustness against a collection of distribution shifts.
Other datasets, such as ImageNet-D \cite{Zhang2024f}, focus on varying background, texture, and material, but rely on synthetic data.
Stylized ImageNet \cite{Geirhos2018} investigates the impact of texture changes.
ImageNet-9 \cite{Xiao2020} explores background variations using segmented images, but the backgrounds are often artificial.

In contrast to these existing datasets, which are used only for evaluation, \name provides fine-grained control over foreground object placement, size, and background selection, enabling a precise and comprehensive analysis of specific model biases within the context of a large-scale, real-world image distribution.
As \name also provides controllable training set generation, it goes beyond simply measuring robustness to actively improving it through training.