ForAug/sec/intro.tex

% !TeX root = ../main.tex

\section{Introduction}
\label{sec:intro}

% \begin{itemize}
%   \item General Into Image classification
%   \item ImageNet
%   \item CNNs $\to$ Transformers
%   \item Traditional Data Augmentation: CNNs
%   \item Problems with that: Other model properties of Transformers
%   \item Our approach: Recombining ImageNet forgrounds and backgrounds
% \end{itemize}

\begin{figure}
    \centering
    \includegraphics[width=\columnwidth]{img/fig-1.pdf}
    \caption{Comparison of \name and ImageNet. \name recombines foreground objects with different backgrounds each epoch, thus creating a more diverse training set. We still apply traditional data augmentation afterwards.}
    \label{fig:fig-1}
\end{figure}

Image classification, a fundamental task in computer vision (CV), involves assigning labels to images from a set of categories.
% This seemingly simple task
It underpins a wide range of applications, like medical diagnosis~\cite{Sanderson2022,Vezakis2024}, autonomous driving~\cite{Wang2022b}, and object recognition~\cite{Carion2020,He2017,Girshick2013} and facilitates large-scale pretraining~\cite{Dosovitskiy2021,Liu2021,Touvron2021b}, and progress evaluation in CV~\cite{Khan2022, Rangel2024}.
% Furthermore, image classification is used for large-scale pretraining of vision models~\cite{Dosovitskiy2021,Liu2021,Touvron2021b} and to judge the progress of the field of CV \cite{Khan2022, Rangel2024}.
The advent of large-scale datasets, particularly ImageNet~\cite{Deng2009}, served as a catalyst for the rise of large-scale CV models~\cite{Krizhevsky2012, He2016} and remains the most important CV benchmark for more than a decade \cite{Krizhevsky2012,Touvron2022, Wortsman2022, He2016}.
% containing millions of labeled images across thousands of categories, has been instrumental in driving significant progress in this field.
% ImageNet served as a catalyst for the rise of large-scale CV models~\cite{Krizhevsky2012, He2016} and remains the most important CV benchmark for more than a decade \cite{Krizhevsky2012,Touvron2022, Wortsman2022, He2016}.
% It is used to train and evaluate the best models in the field.
While traditionally, convolutional neural networks (CNNs) have been the go-to architecture in CV, Transformers \cite{Vaswani2017}, particularly the Vision Transformer (ViT) \cite{Dosovitskiy2021}, have emerged as a powerful alternative, demonstrating
% These attention-based models have demonstrated
superior performance in various vision tasks, including image classification \cite{Wortsman2022,Yu2022,Carion2020,Zong2022,Wang2022a}.


Data augmentation is a key technique for training image classification models.
% A key technique for training image classification models, especially with limited data, is data augmentation.
Traditional data augmentation methods, such as random cropping, flipping, and color changes, are commonly employed to increase the diversity of the training data and improve the model's performance~\cite{Xu2023d, Shorten2019}.
These basic transformations, originally designed for CNNs, change the input images in a way that preserves their semantic meaning~\cite{Alomar2023}, but are limited to existing image compositions.
While combinations of these data augmentations are still used today, they originally were proposed to benefit CNNs.
However, the architectural differences of CNNs and Transformers suggest that the latter might benefit from different data augmentation strategies.
In particular, the Transformers self-attention mechanism, unlike a CNN, is not translation equivariant~\cite{RojasGomez2023,Ding2023a}, meaning that the model does not understand the spatial relationships between pixels by design.
% This creates the need for novel data augmentation strategies tailored to the Transformer architecture.
% This fact opens a new design space for data augmentation strategies to help Transformers understand the basic invariances of image classification.
% Note that these traditional data augmentations are also limited by existing image compositions.

Recognizing that Transformers need to learn the spatial relationships from data,
% and in general are usually trained on larger datasets~\cite{Kolesnikov2020},
we propose \schemename, a novel data augmentation scheme that recombines foreground objects with different backgrounds.
Thus, \schemename goes beyond existing image compositions and encodes desired invariances directly into the training data (see \Cref{fig:fig-1}).
% Inspired by this inductive bias of CNNs, that is not inherent to ViTs, we propose \schemename, a novel data augmentation scheme for image classification which makes the translation equivariance of CNNs explicit in the training data by recombining foreground objects at varying positions with different backgrounds.
% In this paper, we address the challenge of effectively training Transformers for image classification by proposing \schemename, a novel data augmentation scheme for image classification, which combines foreground objects with different backgrounds.
Applying \schemename to ImageNet gives rise to \name, a novel dataset that enables this data augmentation with with fine-grained control over the image composition.
We separate the foreground objects in ImageNet from their backgrounds, using an open-world object detector~\cite{Ren2024}, and fill in the background in a plausible way using an object removal model~\cite{Sun2024,Suvorov2021}.
This allows us to then recombine any foreground object with any background on the fly, creating a highly diverse training set.
During recombination, we can control important parameters, like the size and position of the foreground object, to help the model learn the spatial invariances necessary for image classification.
We show that training on \name instead of ImageNet increases the model accuracy of Transformers by up to 4.5 p.p. on ImageNet and an up to $39.3\%$ reduction in error rate on downstream tasks.

On top of that, \schemename is a useful tool for analyzing model behavior and biases, when used during evaluation.
We utilize our control over the image distribution to measure a model's background robustness (by varying the choice of background), foreground focus (by leveraging our knowledge about the placement of the foreground object), center bias (by controlling position), and size bias (by controlling size).
These analyses provide valuable insights into model behavior and biases, which is crucial for model deployment and future robustness optimizations.
We show that training on \name, instead of ImageNet, significantly reduces all of these biases.
We make our code for \schemename and the \name-dataset publicly available\footnote{Link will go here.} to facilitate further research.

\subsection*{Contributions}
\begin{itemize}
    \item We propose \schemename, a novel data augmentation scheme, that recombines objects and backgrounds. \schemename allows us to move beyond the (possibly biased) image compositions in the dataset while preserving label integrity.
    \item We show that training ViT on \name, the ImageNet instantiation of \schemename, leads to up to 4.5 p.p. improved accuracy on ImageNet and 7.3 p.p. on downstream tasks.
    \item We propose novel \schemename-based metrics to analyze and quantify fine-grained biases of trained models: Background Robustness, Foreground Focus, Center Bias, and Size Bias. We show that training on \name, instead of ImageNet, significantly reduces these biases.
\end{itemize}