cvpr submission

2026-02-24 12:01:26 +01:00
parent 5c08f9d31a
commit e7c0b531d6
59 changed files with 7238 additions and 4939 deletions
--- a/sec/related_work.tex
+++ b/sec/related_work.tex
@@ -3,44 +3,44 @@
 \section{Related Work}
 \label{sec:related_work}

-\textbf{Data Augmentation for Image Classification.}
-Data augmentation is a crucial technique for improving the model performance and generalization.
-Traditional augmentation strategies rely on simple geometric or color-space transformations like cropping, flipping, rotation, blurring, color jittering, or random erasing~\cite{Zhong2020} to increase training data diversity without changing the semantic meaning.
-With the advent of ViTs~\cite{Dosovitskiy2021}, new data augmentation operations like PatchDropout~\cite{Liu2022d} have been proposed.
-Other transformations like MixUp~\cite{Zhang2018a}, CutMix~\cite{Yun2019}, or random cropping and patching~\cite{Takahashi2018} combine multiple input images.
-These simple transformations are usually bundled to form more complex augmentation policies like AutoAugment~\cite{Cubuk2019} and RandAugment~\cite{Cubuk2020}, or 3-Augment~\cite{Touvron2022}. %, which is optimized to train a ViT.
-For a general overview of data augmentation for image classification, we refer to Shorten et al.~\cite{Shorten2019} and Xu et al.~\cite{Xu2023d}.
+\paragraph{Data Augmentation for Image Classification}
+Data augmentation is a crucial technique for improving the performance and generalization of image classification models.
+Traditional augmentation strategies rely on simple geometric or color-space transformations like cropping, flipping, roatation, blurring, color jittering, or random erasing \cite{Zhong2017} to increase the diversity of the training data without changing their semantic meaning.
+With the advent of Vision Transformers, new data augmentation operations like PatchDropout \cite{Liu2022d} have been proposed.
+Other transformations like Mixup \cite{Zhang2018a}, CutMix \cite{Yun2019}, or random cropping and patching \cite{Takahashi2018} combine multiple input images.
+These simple transformations are usually bundled to form more complex augmentation policies like AutoAugment \cite{Cubuk2018} and RandAugment \cite{Cubuk2019},
+% which automatically search for optimal augmentation policies 
+or 3-augment \cite{Touvron2022} which is optimized to train a ViT.
+For a general overview of data augmentation techniques for image classification, we refer to \citet{Shorten2019, Xu2023d}.

-We advance these general augmentations by introducing \schemename to explicitly separate objects and backgrounds for image classification, allowing us to move beyond image compositions from the dataset.
-Thus, \schemename unlocks performance improvements and bias reduction not possible with traditional data augmentation.
-% \schemename is used additionally to traditional augmentation techniques to improve performance and reduce biases.
+We build upon these general augmentations by introducing a novel approach to explicitly separate objects and backgrounds for image classification, allowing us to -- unlike these basic transformations -- move beyond dataset image compositions.
+Our approach is used additionally to strong traditional techniques to improve performance and reduce biases.

-\textbf{Copy-Paste Augmentation.}
-The copy-paste augmentation~\cite{Ghiasi2021}, which is used only for object detection~\cite{Shermaine2025,Ghiasi2021} and instance segmentation~\cite{Werman2022,Ling2022}, involves copying segmented objects from one image and pasting them onto another.
-While typically human annotated segmentation masks are used to extract the foreground objects, other foreground sources have been explored, like 3D models~\cite{Hinterstoisser2019} and pretrained object-detection models for use on objects on white background~\cite{Dwibedi2017} or synthetic images~\cite{Ge2023}.
-Kang et al.~\cite{Kang2022} apply copy-paste as an alternative to CutMix in image classification, but they do not shift the size or position of the foregrounds and use dataset images (with object) as backgrounds.
+\paragraph{Copy-Paste Augmentation}
+The copy-paste augmentation \cite{Ghiasi2020}, which is used only for object detection \cite{Shermaine2025,Ghiasi2020} and instance segmentation \cite{Werman2021,Ling2022}, involves copying segmented objects from one image and pasting them onto another.
+While typically human annotated segmentation masks are used to extract the foreground objects, other foregound sources have been explored, like 3D models \cite{Hinterstoisser2019} and pretrained object-detection models for use on objects on white background \cite{Dwibedi2017} or synthetic images \cite{Ge2023}.
+% DeePaste \cite{Werman2021} focuses on using inpainting for a more seamless integration of the pasted object.
+\cite{Kang2022} apply copy-paste as an alternative to CutMix in image classification, but they do not shift the size or position of the foregrounds and use normal dataset images as backgrounds.

+% Unlike these methods, \schemename focuses on image classification.
+% While these methods paste objects onto another image (with a different foreground) or on available or rendered background images of the target scene, we extract foreground objects and fill in the resulting holes in the background in a semantically neutral way.
 Unlike prior copy-paste methods that overlay objects, \schemename extracts foregrounds and replaces their backgrounds with semantically neutral fills, thereby preserving label integrity while enabling controlled and diverse recombination.
+% This way, we are preserving label integrity while also having diverse, neutral backgrounds available for recombination, enabling a controlled and diverse manipulation of image composition.

-\textbf{Generative data augmentation.}
-Recent work uses generative models to synthesize additional training images, e.g., via GANs or diffusion models driven by text prompts or attribute labels~\cite{Lu2022,Trabucco2024,Islam2024}.
-Concurrently to our work, AGA~\cite{Rahat2025} combines LLMs, diffusion models, and segmentation to generate fully synthetic backgrounds from text prompts, onto which real foregrounds are pasted.
-These synthetic images are appended to the original training set.
+\begin{figure*}[ht!]
+    \centering
+    \includegraphics[width=.9\textwidth]{img/fig-2.pdf}
+    \caption{Overview of \schemename. The data creation consists of two stages: Segmentation (offline, \Cref{sec:segmentation}), where we segment the foreground objects from the background and fill in the background. Recombination (online, \Cref{sec:recombination}), where we combine the foreground objects with different backgrounds to create new samples. After recombination, we apply strong, commonly used augmentation policies.}
+    \label{fig:method}
+\end{figure*}

-While AGA focuses on increasing diversity via prompt-driven background synthesis, \schemename uses generative models differently:
-We apply inpainting only to locally neutralize the original object region, yielding semi-synthetic backgrounds that preserve the global layout, style, and characteristics of real dataset images.
-% AGA's focus on synthetic background is likely to produce a shifted, or even collapsed background image distribution~\cite{Zverev2025,Shumailov2024,Adamkiewicz2026}.
-Fully synthetic, prompt-generated backgrounds are likely to change, the effective background distribution, especially when prompts or generators are biased~\cite{Zverev2025,Shumailov2024,Adamkiewicz2026}.
-We then do online recombination of real foregrounds with these neutralized, dataset-consistent backgrounds under explicit control of object position and scale.
-Thus, \schemename acts as a dynamic large-scale augmentation method while AGA is statically expanding small-scale training sets with synthetic data.
-
-\textbf{Model robustness evaluation.}
+\paragraph{Model robustness evaluation}
 Evaluating model robustness to various image variations is critical for understanding and improving model generalization.
-Datasets like ImageNet-A~\cite{Hendrycks2021}, ImageNet-C~\cite{Hendrycks2019} and ImageNet-P~\cite{Hendrycks2019} introduce common corruptions and perturbations.
-ImageNet-E~\cite{Li2023e} evaluates model robustness against a collection of distribution shifts.
-Other datasets, such as ImageNet-D~\cite{Zhang2024f} and ImageNet-R~\cite{Hendrycks2021a}, focus on varying background, texture, and material, but rely on synthetic data.
-Stylized ImageNet~\cite{Geirhos2019} investigates the impact of texture changes.
-ImageNet-9~\cite{Xiao2020} explores background variations using segmented images for a 9-class subset of ImageNet with artificial backgrounds.
+Datasets like ImageNet-C \cite{Hendrycks2019} and ImageNet-P \cite{Hendrycks2019} introduce common corruptions and perturbations.
+ImageNet-E \cite{Li2023e} evaluates model robustness against a collection of distribution shifts.
+Other datasets, such as ImageNet-D \cite{Zhang2024f}, focus on varying background, texture, and material, but rely on synthetic data.
+Stylized ImageNet \cite{Geirhos2018} investigates the impact of texture changes.
+ImageNet-9 \cite{Xiao2020} explores background variations using segmented images, but backgrounds are often artificial.

 In contrast to these existing datasets, which are used only for evaluation, \schemename provides fine-grained control over foreground object placement, size, and background selection, enabling a precise and comprehensive analysis of specific model biases within the context of a large-scale, real-world image distribution.
-As \schemename also provides controllable training data generation, it goes beyond simply measuring robustness to actively improving it through training.
+As \schemename also provides controllable training set generation, it goes beyond simply measuring robustness to actively improving it through training.