ForAug/sec/intro_old.tex

% !TeX root = ../main.tex

\section{Introduction}
\label{sec:intro}

% \begin{itemize}
%   \item General Into Image classification
%   \item ImageNet
%   \item CNNs $\to$ Transformers
%   \item Traditional Data Augmentation: CNNs
%   \item Problems with that: Other model properties of Transformers
%   \item Our approach: Recombining ImageNet forgrounds and backgrounds
% \end{itemize}

\begin{figure}
    \centering
    \includegraphics[width=.5\columnwidth]{img/fig-1.pdf}
    \caption{Comparison of traditional image classification training and training when using \schemename. \schemename recombines foreground objects with different backgrounds each epoch, thus creating a more diverse training set. We still apply strong traditional data augmentation afterwards.}
    \label{fig:fig-1}
\end{figure}

Image classification, a fundamental task in computer vision (CV), involves assigning labels to images from a set of categories.
It underpins a wide range of applications, like medical diagnosis~\cite{Sanderson2022,Vezakis2024}, autonomous driving~\cite{Wang2023a}, and object recognition~\cite{Carion2020,He2017,Girshick2014} and facilitates large-scale pretraining~\cite{Dosovitskiy2021,Liu2021,Touvron2021b}, and progress evaluation in CV~\cite{Khan2022, Rangel2024}.
% Furthermore, image classification is used for large-scale pretraining of vision models~\cite{Dosovitskiy2021,Liu2021,Touvron2021b} and to judge the progress of the field of CV \cite{Khan2022, Rangel2024}.
The advent of large-scale datasets, particularly ImageNet~\cite{Deng2009}, served as a catalyst for the rise of large-scale CV models~\cite{Krizhevsky2012, He2016} and remains the most important CV benchmark for more than a decade \cite{Krizhevsky2012,Touvron2022, Wortsman2022, He2016}.
% containing millions of labeled images across thousands of categories, has been instrumental in driving significant progress in this field.
% ImageNet served as a catalyst for the rise of large-scale CV models~\cite{Krizhevsky2012, He2016} and remains the most important CV benchmark for more than a decade \cite{Krizhevsky2012,Touvron2022, Wortsman2022, He2016}.
% It is used to train and evaluate the best models in the field.
While traditionally, convolutional neural networks (CNNs) have been the go-to architecture in CV, Transformers \cite{Vaswani2017}, particularly the Vision Transformer (ViT) \cite{Dosovitskiy2021}, have emerged as a powerful alternative and go-to architecture, demonstrating
% These attention-based models have demonstrated
superior performance in various vision tasks, including image classification \cite{Wortsman2022,Yu2022,Carion2020,Zong2023,Wang2023b}.


Data augmentation is a key technique for training image classification models.
% A key technique for training image classification models, especially with limited data, is data augmentation.
Traditional augmentation methods, such as cropping, flipping, or color shifts, are commonly employed to increase data diversity~\cite{Xu2023d, Shorten2019}, but remain bound to existing image compositions.
While these preserve the images' semantic meaning, their ability to teach spatial invariances is limited.
% the diversity of the training data and improve the model's performance~\cite{Xu2023d, Shorten2019}.
% These basic transformations, originally designed for CNNs, change the input images in a way that preserves their semantic meaning~\cite{Alomar2023}, but are limited to existing image compositions.
While combinations of these data augmentations are still used today, they originally were proposed to benefit CNNs.
However, the architectural differences of CNNs and Transformers suggest that the latter might benefit from different data augmentation strategies.
In particular, the self-attention mechanism, unlike a CNN, is not translation equivariant~\cite{RojasGomez2023,Ding2023a}, meaning that the model is not designed to understand the spatial relationships between pixels.
% This creates the need for novel data augmentation strategies tailored to the Transformer architecture.
% This fact opens a new design space for data augmentation strategies to help Transformers understand the basic invariances of image classification.
% Note that these traditional data augmentations are also limited by existing image compositions.

Recognizing that Transformers need to learn spatial relationships directly from data,
% and in general are usually trained on larger datasets~\cite{Kolesnikov2020},
we propose \schemename, a data augmentation method that makes these relationships explicit by recombining foreground objects with diverse backgrounds.
Thus, \schemename goes beyond existing image compositions and encodes desired invariances directly into the training data (see \Cref{fig:fig-1}).
% Inspired by this inductive bias of CNNs, that is not inherent to ViTs, we propose \schemename, a novel data augmentation scheme for image classification which makes the translation equivariance of CNNs explicit in the training data by recombining foreground objects at varying positions with different backgrounds.
% In this paper, we address the challenge of effectively training Transformers for image classification by proposing \schemename, a novel data augmentation scheme for image classification, which combines foreground objects with different backgrounds.
% Applying \schemename to ImageNet gives rise to \name, a novel dataset that enables this data augmentation with with fine-grained control over the image composition.
Applying \schemename to a dataset like ImageNet is a two-step process:
(1)~We separate the foreground objects in ImageNet from their backgrounds, using an open-world object detector~\cite{Ren2024} and fill in the background in a neutral way using an object removal model~\cite{Sun2025,Suvorov2022}.
(2)~This allows us to then recombine any foreground object with any background on the fly, creating a highly diverse training set.
% During recombination, we can control important parameters, like the size and position of the foreground object, to help the model learn the spatial invariances necessary for image classification.
By exploiting the control over foreground size and position during recombination, \schemename explicitly teaches spatial invariances that image classification models typically must learn implicitly.
We show that using \schemename additionally to strong traditional data augmentation increases the model accuracy of Transformers by up to 4.5 p.p. on ImageNet and reduces the error rate by up to $7.3$ p.p. in downstream tasks.

Beyond training, \schemename becomes a diagnostic tool for analyzing model behavior and biases, when used during evaluation.
We utilize our control over the image distribution to measure a model's background robustness (by varying the choice of background), foreground focus (by leveraging our knowledge about the placement of the foreground object), center bias (by controlling position), and size bias (by controlling size).
These analyses provide valuable insights into model behavior and biases, which is crucial for model deployment and future robustness optimizations.
We show that training using \schemename significantly reduces all of these biases.
We make our code for \schemename and the output of \schemename's segmentation phase on ImageNet publicly available\footnote{Link will go here.} to facilitate further research.

\subsection*{Contributions}
\begin{itemize}
    \item We propose \schemename, a novel data augmentation scheme, that recombines objects and backgrounds. \schemename allows us to move beyond the (possibly biased) image compositions in the dataset while preserving label integrity.
    \item We show that training a standard ViT using \schemename leads to up to 4.5 p.p. improved accuracy on ImageNet-1k and 7.3 p.p. on downstream tasks.
    \item We propose novel \schemename-based metrics to analyze and quantify fine-grained biases of trained models: Background Robustness, Foreground Focus, Center Bias, and Size Bias. We show that \schemename significantly reduces these biases by encoding invariance that benefits ViT into the training data.
\end{itemize}