Files
ForAug/sec/experiments.tex
Tobias Christian Nauen 78765791be iccv 2025 submission
2026-02-24 12:08:38 +01:00

418 lines
43 KiB
TeX

% !TeX root = ../main.tex
\section{Experiments}
\label{sec:experiments}
% \begin{itemize}
% \item [1.] Training on RecombiNet
% \item ImageNet results (large)
% \item Ablation (TinyImageNet): Foreground position
% \item Ablation (TinyImageNet): Which background (or part of other ablation table?)
% \item Ablation (TinyImageNet+ImageNet For edge blur): Design decisions: Which infill model, pruning threshold, p$\to$t /t$\to$p, foreground rotation range (?), edge blur, original image probability/schedule, Foreground size
% \item With other Data Augmentations
% \item [2.] More evalution metrics
% \item Background accuracy (how to frame/sell? Background bias?) / Background robustness (= foreground with all background)?
% \item Foreground focus
% \item Position bias
% \item Size bias
% \end{itemize}
We conduct a comprehensive suit of experiments to validate the effectiveness of our approach.
We compare training on \name, the ImageNet instantiation of \schemename, to training on ImageNet for 7 different models.
Furthermore, we assess the impact of using \name for pretraining on multiple fine-grained downstream datasets.
Additionally, we use \schemename's control over the image distribution to quantify some model behaviors and biases.
\subsection{Design Choices of \schemename}
\label{sec:ablation}
We start by ablating the design choices of \schemename.
For this, we revert to TinyImageNet \cite{Le2015}, a subset of ImageNet containing 200 categories with 500 images each, and Tiny\name, a version of \schemename derived from TinyImageNet.
\Cref{tab:ablation} presents the results of these ablations.
\begin{table*}[t]
\centering
\resizebox{\textwidth}{!}{
\begin{tabular}{lccccccccccccc}
\toprule
\multirow{2}{*}{Dataset} & Detect. & Infill & FG. & Augmentation & BG. & BG. & edge & original & \multicolumn{2}{c}{TinyImageNet Accuracy} \\
& prompt & Model & size & Order & strategy & pruning & smoothing & image mixing & ViT-Ti [\%] & ViT-S [\%] \\
\cmidrule(r){1-1} \cmidrule(lr){2-9} \cmidrule(l){10-11}
TinyImageNet & & & & & & & & & $66.1\pm0.5$ & $68.3\pm0.7$ \\
Tiny\name & specific & LaMa \cite{Suvorov2021} & mean & crop$\to$paste$\to$color & same & - & - & \gtxt{-} & $64.6\pm0.5$ & $70.0\pm0.6$ \\
\gtxt{Tiny\name} & \gtxt{specific} & \gtxt{LaMa \cite{Suvorov2021}} & range & \gtxt{crop$\to$paste$\to$color} & \gtxt{same} & \gtxt{-} & \gtxt{-} & \gtxt{-} & $65.5\pm0.4$ & $71.2\pm0.5$ \\
\gtxt{Tiny\name} & general & \gtxt{LaMa \cite{Suvorov2021}} & \gtxt{range} & \gtxt{crop$\to$paste$\to$color} & \gtxt{same} & \gtxt{-} & \gtxt{-} & \gtxt{-} & $66.4\pm0.6$ & $72.9\pm0.6$ \\
\gtxt{Tiny\name} & \gtxt{general} & Att. Eraser \cite{Sun2024} & \gtxt{range} & \gtxt{crop$\to$paste$\to$color} & \gtxt{same} & \gtxt{-} & \gtxt{-} & \gtxt{-} & $67.5\pm1.2$ & $72.4\pm0.5$ \\
\gtxt{Tiny\name} & \gtxt{general} & \gtxt{Att. Eraser \cite{Sun2024}} & \gtxt{range} & paste$\to$crop$\to$color & \gtxt{same} & \gtxt{-} & \gtxt{-} & \gtxt{-} & $67.1\pm1.2$ & $72.9\pm0.5$ \\
\gtxt{Tiny\name} & \gtxt{general} & \gtxt{Att. Eraser \cite{Sun2024}} & \gtxt{range} & \gtxt{paste$\to$crop$\to$color} & \gtxt{same} & 1.0 & \gtxt{-} & \gtxt{-} & $67.0\pm1.2$ & $73.0\pm0.3$ \\
\gtxt{Tiny\name} & \gtxt{general} & \gtxt{Att. Eraser \cite{Sun2024}} & \gtxt{range} & \gtxt{paste$\to$crop$\to$color} & \gtxt{same} & 0.8 & \gtxt{-} & \gtxt{-} & $67.2\pm1.2$ & $72.9\pm0.8$ \\
\gtxt{Tiny\name} & \gtxt{general} & \gtxt{Att. Eraser \cite{Sun2024}} & \gtxt{range} & \gtxt{paste$\to$crop$\to$color} & \gtxt{same} & 0.6 & \gtxt{-} & \gtxt{-} & $67.5\pm1.0$ & $72.8\pm0.7$ \\
\gtxt{Tiny\name} & \gtxt{general} & \gtxt{Att. Eraser \cite{Sun2024}} & \gtxt{range} & \gtxt{paste$\to$crop$\to$color} & \gtxt{same} & \gtxt{0.8} & $\sigma_\text{max} = 2.0$ & \gtxt{-} & $67.2\pm0.4$ & $72.9\pm0.5$ \\
\gtxt{Tiny\name} & \gtxt{general} & \gtxt{Att. Eraser \cite{Sun2024}} & \gtxt{range} & \gtxt{paste$\to$crop$\to$color} & \gtxt{same} & \gtxt{0.8} & $\sigma_\text{max} = 4.0$ & \gtxt{-} & $65.9\pm0.5$ & $72.4\pm0.6$ \\
\gtxt{Tiny\name} & \gtxt{general} & \gtxt{Att. Eraser \cite{Sun2024}} & \gtxt{range} & \gtxt{paste$\to$crop$\to$color} & \gtxt{same} & \gtxt{0.8} & \gtxt{-} & $p=0.2$ & $69.8\pm0.5$ & $75.0\pm0.3$ \\
\gtxt{Tiny\name} & \gtxt{general} & \gtxt{Att. Eraser \cite{Sun2024}} & \gtxt{range} & \gtxt{paste$\to$crop$\to$color} & \gtxt{same} & \gtxt{0.8} & \gtxt{-} & $p=0.33$ & $69.5\pm0.4$ & $75.2\pm1.0$ \\
\gtxt{Tiny\name} & \gtxt{general} & \gtxt{Att. Eraser \cite{Sun2024}} & \gtxt{range} & \gtxt{paste$\to$crop$\to$color} & \gtxt{same} & \gtxt{0.8} & \gtxt{-} & $p=0.5$ & $70.3\pm1.0$ & $74.2\pm0.2$ \\
\gtxt{Tiny\name} & \gtxt{general} & \gtxt{Att. Eraser \cite{Sun2024}} & \gtxt{range} & \gtxt{paste$\to$crop$\to$color} & \gtxt{same} & \gtxt{0.8} & \gtxt{-} & linear & $70.1\pm0.7$ & $74.9\pm0.8$ \\
\gtxt{Tiny\name} & \gtxt{general} & \gtxt{Att. Eraser \cite{Sun2024}} & \gtxt{range} & \gtxt{paste$\to$crop$\to$color} & \gtxt{same} & \gtxt{0.8} & \gtxt{-} & reverse lin. & $67.6\pm0.2$ & $73.2\pm0.3$ \\
\gtxt{Tiny\name} & \gtxt{general} & \gtxt{Att. Eraser \cite{Sun2024}} & \gtxt{range} & \gtxt{paste$\to$crop$\to$color} & \gtxt{same} & \gtxt{0.8} & \gtxt{-} & cos & $71.3\pm1.0$ & $75.7\pm0.8$ \\
\gtxt{Tiny\name} & \gtxt{general} & \gtxt{Att. Eraser \cite{Sun2024}} & \gtxt{range} & \gtxt{paste$\to$crop$\to$color} & \gtxt{same} & \gtxt{0.8} & $\sigma_\text{max} = 4.0$ & \gtxt{cos} & $70.0\pm0.8$ & $75.5\pm0.7$ \\
\gtxt{Tiny\name} & \gtxt{general} & \gtxt{Att. Eraser \cite{Sun2024}} & \gtxt{range} & \gtxt{paste$\to$crop$\to$color} & orig. & \gtxt{0.8} & \gtxt{$\sigma_\text{max} = 4.0$} & \gtxt{cos} & $67.2\pm0.9$ & $69.9\pm1.0$ \\
\gtxt{Tiny\name} & \gtxt{general} & \gtxt{Att. Eraser \cite{Sun2024}} & \gtxt{range} & \gtxt{paste$\to$crop$\to$color} & all & \gtxt{0.8} & \gtxt{$\sigma_\text{max} = 4.0$} & \gtxt{cos} & $70.1\pm0.7$ & $77.5\pm0.6$ \\
\midrule
\name & \gtxt{general} & \gtxt{Att. Eraser \cite{Sun2024}} & \gtxt{range} & \gtxt{paste$\to$crop$\to$color} & \gtxt{same} & \gtxt{0.8} & \gtxt{-} & \gtxt{cos} & - & $80.5\pm0.1$ \\
\gtxt{\name} & \gtxt{general} & \gtxt{Att. Eraser \cite{Sun2024}} & \gtxt{range} & \gtxt{paste$\to$crop$\to$color} & \gtxt{same} & \gtxt{0.8} & $\sigma_\text{max} = 4.0$ & \gtxt{cos} & - & $80.7\pm0.1$ \\
\gtxt{\name} & \gtxt{general} & \gtxt{Att. Eraser \cite{Sun2024}} & \gtxt{range} & \gtxt{paste$\to$crop$\to$color} & all & \gtxt{0.8} & \gtxt{$\sigma_\text{max} = 4.0$} & \gtxt{cos} & - & $81.3\pm0.1$ \\
\bottomrule
\end{tabular}}
\caption{Ablation of design decisions of Tiny\name on TinyImageNet and \name on ImageNet.}
\label{tab:ablation}
\end{table*}
\textbf{Prompt.}
% We present the ablation of our main design decisions in \Cref{tab:ablation}.
First, we evaluate the type of prompt used to detect the foreground object.
Here, the \emph{general} prompt, which contains the class and the more general object category, outperforms only having the class name (\emph{specific}).
\textbf{Inpainting.} Attentive Eraser \cite{Sun2024} produces superior results compared to LaMa \cite{Suvorov2021} (see the supplementary for examples).
% When comparing the infill models, the GAN-based LaMa \cite{Suvorov2021} gets outperformed by the Attentive Eraser \cite{Sun2024}.
\textbf{Foreground size}
% We observe that LaMa's often infills unnatural textures compared to Attentive Eraser.
% The size of foreground objects during training has a significant impact on the performance.
% Here, using the greater variability of the \emph{range} strategy increases the performance by $\approx 1\%$ compared to the \emph{mean} strategy.
significantly impacts performance.
Employing a \emph{range} of sizes during recombination, rather than a fixed \emph{mean} size, boosts accuracy by approximately 1 p.p.
This suggests that the added variability is beneficial.
\textbf{Order of data augmentation.}
% (1) Applying the image crop related augmentations \emph{before} pasting the foreground object and the color-based ones \emph{after} pasting or (2) applying all data augmentations after pasting the foreground object.
% While results are ambiguous, we choose the second strategy, as it improves the performance of ViT-S, although not the one of ViT-Ti.
Applying all augmentations after foreground-background recombination (\emph{paste$\to$crop$\to$color}) slightly improves ViT-S's performance compared to applying crop-related augmentations before pasting (\emph{crop$\to$paste$\to$color}).
For ViT-Ti, the results are ambiguous.
\textbf{Background pruning.}
When it comes to the choice of backgrounds to use, we test two pruning thresholds ($t_\text{prune}$) to exclude backgrounds with excessive inpainting.
% and only use backgrounds with an relative size of the infilled region of at most $t_\text{prune}$ (exclusive).
A threshold of $t_\text{prune}=1.0$ means that we use all backgrounds that are not fully infilled.
% We find that the background pruning does not significantly impact the models' performance.
% We choose $t_\text{prune}=0.8$ for the following experiments to exclude backgrounds that are mostly artificial.
Varying $t_\text{prune}$ has minimal impact.
Therefore, we choose $t_\text{prune} = 0.8$ to exclude predominantly artificial backgrounds.
Similarly, applying edge smoothing to foreground masks with Gaussian blurring actually hurts performance on Tiny\name, but slightly improves it on \name.
% One of the most important design decisions is the mixing of the original dataset with \name.
\textbf{Mixing} \name with the original ImageNet data proves crucial.
While constant and linear mixing schedules improve performance over no mixing by $2-3$ p.p. compared to only using Tiny\name, the cosine annealing schedule yields the best results, boosting accuracy by another $0.5-1$ p.p.
\textbf{Background strategy.}
Another point is the allowed choice of background image for each foreground object.
% We evaluate three different strategies.
% (1) Picking the background from which that specific foreground was originally extracted.
% The major difference to ImageNet when using this setup is the variability in size and position of the foreground object.
% (2) Picking a background that originally had a foreground object of the same class in it.
% Here, we have backgrounds where objects of this type can typically appear while also creating a wider variety of samples due to pairing each foreground object with different backgrounds each time.
% (3) Picking any background.
% This choice has the largest variety of backgrounds, but the backgrounds are not semantically related to the foreground object anymore.
% We find in \Cref{fig:bg-strategy} that choosing only a foreground's original background is the worst choice.
We compare using the original background, a background from the same class, and any background.
These strategies go from low diversity and high shared information content between the foreground and background to high diversity and low shared information content.
For \emph{ViT-Ti}, the latter two strategies perform comparably, while \emph{ViT-S} benefits from the added diversity of using any background.
The same is true when training on the full (ImageNet) version of \name.
\begin{figure}
\centering
\includegraphics[width=.7\columnwidth]{img/bates.pdf}
\caption{Plot of the probability distribution function (PDF) of the extended Bates distribution for different parameters $\eta$. Higher values of $\eta$ concentrate the distribution around the center.}
\label{fig:bates-pdf}
\end{figure}
\begin{table}
\centering
\resizebox{\columnwidth}{!}{
\begin{tabular}{ccccccc}
\toprule
\multirow{2.5}{*}{\makecell{Training Set/ \\ Bates Parameter}} & \multirow{2.5}{*}{TIN} & \multicolumn{5}{c}{Tiny\name} \\
\cmidrule(l){3-7}
& & $\eta=-3$ & $-2$ & $1/-1$ & $2$ & $3$ \\
\midrule
TinyImageNet & 68.9 & 60.5 & 60.2 & 60.8 & 62.6 & 63.1 \\
$\eta=-3$ & 71.3 & 79.3 & 79.5 & 79.1 & 79.3 & 79.1 \\
$\eta=-2$ & 71.5 & 80.0 & 78.7 & 79.3 & 79.1 & 78.8 \\
$\eta=1/-1$ & 72.3 & 79.5 & 78.9 & 80.2 & 79.7 & 80.4 \\
$\eta=2$ & 71.3 & 78.2 & 77.8 & 79.1 & 79.6 & 79.9 \\
$\eta=3$ & 71.4 & 77.2 & 76.9 & 78.6 & 79.6 & 79.7 \\
\bottomrule
\end{tabular}}
\caption{Accuracy of ViT-S trained on TinyImageNet (TIN) and Tiny\name with different foreground position distributions by varying the parameter of a Bates distribution $\eta$.
The best performance is achieved using the uniform distribution ($\eta=1$).}
\end{table}
\textbf{Foreground position.}
Finally, we analyze the foreground object's positioning in the image.
We utilize an extended Bates distribution to sample the position of the foreground object.
The Bates distribution~\cite{Bates1955} with parameter $\eta \geq 1$ is the mean of $\eta$ independent uniformly distributed random variables \cite{Jonhson1995}.
Therefore, the larger $\eta$, the more concentrated the distribution is around the center.
We extend this concept to $\eta \leq -1$ by defining ${X \sim \text{Bates}(\eta) :\Leftrightarrow s(X) \sim \text{Bates}(-\eta)}$ for $\eta \leq 1$ with $s$ being the sawtooth function on $[0, 1]$:
\begin{align}
s(x) = \begin{cases}
x + 0.5 & \text{if } 0 < x < 0.5 \\
x - 0.5 & \text{if } 0.5 \leq x \leq 1
\end{cases}
\end{align}
Note that $s \circ s = \id$ on $[0, 1]$.
This way, distributions with $\eta \leq -1$ are more concentrated around the borders.
$\eta = 1$ and $\eta = -1$ both correspond to the uniform distribution.
The PDF of this extended Bates distribution is visualized in \Cref{fig:bates-pdf}.
When sampling more towards the center of the image, the difficulty of the task is reduced, which then reduces the performance on TinyImageNet.
This is reflected in the performance when evaluating on Tiny\name with $\eta=2$ and $\eta=3$ compared to $\eta=-1/1$.
We observe a similar reduction for $\eta < -1$.
This experiment is conducted using the LaMa infill model.
\begin{table}
\centering
\small
\begin{tabular}{lccc}
\toprule
Dataset & Classes & \makecell{Training \\ Images} & \makecell{Validation \\ Images} \\
\midrule
TinyImageNet & 200 & 100,000 & 10,000 \\
Tiny\name & 200 & 99,404 & 9,915 \\
ImageNet & 1,000 & 1,281,167 & 50,000 \\
\name & 1,000 & 1,274,557 & 49,751 \\
\bottomrule
\end{tabular}
\caption{Dataset statistics for TinyImageNet, Tiny\name, ImageNet, and \name. For \name and Tiny\name we report the number of foreground/background pairs.}
\label{tab:dataset-stats}
\end{table}
After fixing the optimal design parameters in \Cref{tab:ablation} (last row), we construct the full \name dataset using the entire ImageNet dataset.
\Cref{tab:dataset-stats} compares the dataset statistics of ImageNet and \name.
% The slightly lower number of images in \name is due to \emph{Grounded SAM} returning no or invalid detections for some images.
The slightly reduced image count in \name is due to instances where Grounded SAM failed to produce valid object detections.
\subsection{Image Classification Results}
\begin{table}
\centering
\begin{tabular}{lccc}
\toprule
\multirow{2.5}{*}{Model} & \multicolumn{2}{c}{\makecell{ImageNet Accuracy \\ when trained on}} & \multirow{2.5}{*}{Delta} \\
\cmidrule(lr){2-3}
& ImageNet & \name & \\
\midrule
ViT-S & $79.1\pm0.1$ & $81.4\pm0.1$ & \grntxt{+2.3} \\
ViT-B & $77.6\pm0.2$ & $81.1\pm0.4$ & \grntxt{+3.5} \\
ViT-L & $75.3\pm0.4$ & $79.8\pm0.1$ & \grntxt{+4.5} \\
\midrule
Swin-Ti & $77.9\pm0.2$ & $79.7\pm0.1$ & \grntxt{+1.8} \\
Swin-S & $79.4\pm0.1$ & $80.6\pm0.1$ & \grntxt{+1.2} \\
\midrule
ResNet-50 & $78.3\pm0.1$ & $78.8\pm0.1$ & \grntxt{+0.5} \\
ResNet-101 & $79.4\pm0.1$ & $80.4\pm0.1$ & \grntxt{+1.0} \\
\bottomrule
\end{tabular}
\caption{ImageNet results of models trained on \name and on ImageNet directly. \name improves the performance of all models in our test.}
\label{tab:imagenet-results}
\end{table}
\Cref{tab:imagenet-results} compares the ImageNet performance of models trained on \name and ones trained directly on ImageNet.
We adopt the training setup of \cite{Nauen2023} and \cite{Touvron2022} (details in the supplementary material) for training ViT \cite{Dosovitskiy2021}, Swin \cite{Liu2021} and ResNet \cite{He2016} models.
Notably, \name improves performance across all tested architectures, including the ResNet models (up to $1$ p.p.), demonstrating benefits beyond Transformers.
For Transformer models, we observe improvements from $1.2$ p.p. to $4.5$ p.p.
This improvement is more substantial for the larger models, with ViT-L gaining $4.5$ p.p. in accuracy.
\name's improvements mostly counteract the drop in performance due to overfitting for large models.
When training on ImageNet, this drop is $3.8$ p.p. from ViT-S to ViT-L, while for \name it is reduced to $1.6$ p.p.
\begin{table}
\centering
\resizebox{\columnwidth}{!}{\begin{tabular}{lccccc}
\toprule
Model & Aircraft & Cars & Flowers & Food & Pets \\
\midrule
ViT-S @ ImageNet & $72.4\pm1.0$ & $89.8\pm0.3$ & $94.5\pm0.2$ & $89.1\pm0.1$ & $93.8\pm0.2$ \\
ViT-S @ \name & $78.6\pm0.5$ & $92.2\pm0.2$ & $95.5\pm0.2$ & $89.6\pm0.1$ & $94.5\pm0.2$ \\
& \grntxt{+6.2} & \grntxt{+2.4} & \grntxt{+1.0} & \grntxt{+0.5} & \grntxt{+0.7} \\
\cmidrule(r){1-1}
ViT-B @ ImageNet & $71.7\pm0.5$ & $90.0\pm0.2$ & $94.8\pm0.4$ & $89.8\pm0.2$ & $94.1\pm0.4$ \\
ViT-B @ \name & $79.0\pm2.2$ & $93.3\pm0.1$ & $ 96.5\pm0.1$ & $90.9\pm0.1$ & $95.1\pm0.4$ \\
& \grntxt{+7.3} & \grntxt{+3.3} & \grntxt{+1.7} & \grntxt{+1.1} & \grntxt{+1.0} \\
\cmidrule(r){1-1}
ViT-L @ ImageNet & $72.1\pm1.0$ & $88.8\pm0.3$ & $94.4\pm0.3$ & $90.1\pm0.2$ & $94.2\pm0.4$ \\
ViT-L @ \name & $77.6\pm1.2$ & $89.1\pm0.2$ & $96.6\pm0.1$ & $91.3\pm0.1$ & $95.1\pm0.1$ \\
& \grntxt{+5.5} & \grntxt{+0.3} & \grntxt{+2.2} & \grntxt{+1.2} & \grntxt{+0.9} \\
\midrule
Swin-Ti @ ImageNet & $77.0\pm0.1$ & $91.3\pm0.6$ & $95.9\pm0.1$ & $90.0\pm0.2$ & $94.2\pm0.1$ \\
Swin-Ti @ \name & $81.1\pm0.8$ & $92.8\pm0.4$ & $96.2\pm0.1$ & $90.4\pm0.3$ & $94.8\pm0.5$ \\
& \grntxt{+4.1} & \grntxt{+2.5} & \grntxt{+0.3} & \grntxt{+0.4} & \grntxt{+0.6} \\
\cmidrule(r){1-1}
Swin-S @ ImageNet & $75.7\pm1.4$ & $91.0\pm0.3$ & $95.9\pm0.5$ & $91.1\pm0.2$ & $94.4\pm0.1$ \\
Swin-S @ \name & $81.4\pm0.2$ & $93.1\pm0.2$ & $96.3\pm0.3$ & $91.2\pm0.2$ & $94.9\pm0.3$ \\
& \grntxt{+5.7} & \grntxt{+2.1} & \grntxt{+1.4} & \grntxt{+0.1} & \grntxt{+0.5} \\
\midrule
ResNet-50 @ ImageNet & $78.2\pm0.5$ & $89.8\pm0.2$ & $91.7\pm0.4$ & $84.4\pm0.2$ & $93.7\pm0.3$ \\
ResNet-50 @ \name & $80.3\pm0.4$ & $90.4\pm0.2$ & $91.7\pm0.2$ & $84.5\pm0.2$ & $93.7\pm0.3$ \\
& \grntxt{+2.1} & \grntxt{+0.6} & \gtxt{$\pm$0} & \grntxt{+0.1} & \gtxt{$\pm$0} \\
\cmidrule(r){1-1}
ResNet-101 @ ImageNet & $78.4\pm0.6$ & $90.3\pm0.1$ & $91.2\pm0.5$ & $86.0\pm0.2$ & $94.3\pm0.2$ \\
ResNet-101 @ \name & $81.4\pm0.5$ & $91.3\pm0.1$ & $92.9\pm0.2$ & $86.3\pm0.1$ & $94.0\pm0.3$ \\
& \grntxt{+3.0} & \grntxt{+1.3} & \grntxt{+1.7} & \grntxt{+0.3} & \textcolor{red}{-0.3} \\
\bottomrule
\end{tabular}}
\caption{Downstream accuracy in percent when finetuning on other datasets. Models were pretrained on \name and ImageNet. Pretraining on \name increases Transformer downstream accuracy on all datasets.}
\end{table}
To assess the transferability of \name-trained models, we finetune models pretrained on ImageNet and \name on five fine-grained datasets:
FGVC-Aircraft \cite{Maji2013}, Stanford Cars~\cite{Dehghan2017}, Oxford Flowers \cite{Nilsback2008}, Food-101 \cite{Kaur2017}, and Oxford-IIIT Pets \cite{Parkhi2012}.
While for ResNets, the performance of both training datasets is about the same, for every Transformer, we see the accuracy improve on all downstream dataset by up to 7.3 p.p. and a reduction of error rate of up to $39.3\%$.
In summary, these results demonstrate that the improved representation learning achieved by training on \name translates to superior performance not only on ImageNet, but also on a variety of fine-grained image classification tasks.
\subsection{Further Model Evaluation}
% Additional to just using \name for training, its special properties and posibilities for adjustment of the data distribution make it a valuable tool for evaluating other model properties and biases.
Beyond its use for training, \name's unique properties and controlled data generation capabilities make it a powerful tool for analyzing model behavior and biases.
\paragraph*{Background Robustness}
\begin{table}
\centering
\begin{tabular}{lccc}
\toprule
\multirow{2.5}{*}{Model} & \multicolumn{2}{c}{\makecell{Background Robustness \\ when trained on}} & \multirow{2.5}{*}{Delta} \\
\cmidrule(lr){2-3}
& ImageNet & \name & \\
\midrule
ViT-S & $0.73\pm0.01$ & $0.99\pm0.01$ & \grntxt{+0.26} \\
ViT-B & $0.72\pm0.01$ & $1.00\pm0.01$ & \grntxt{+0.28} \\
ViT-L & $0.70\pm0.01$ & $1.00\pm0.01$ & \grntxt{+0.30} \\
\midrule
Swin-Ti & $0.72\pm0.01$ & $1.00\pm0.01$ & \grntxt{+0.28} \\
Swin-S & $0.72\pm0.01$ & $1.00\pm0.01$ & \grntxt{+0.28} \\
\midrule
ResNet-50 & $0.79\pm0.01$ & $0.99\pm0.01$ & \grntxt{+0.20} \\
ResNet-101 & $0.79\pm0.01$ & $1.00\pm0.01$ & \grntxt{+0.21} \\
\bottomrule
\end{tabular}
\caption{Evaluation of the background robustness of models trained on \name and on ImageNet directly. Training on \name improves the background robustness of all model to $\approx1.00$, meaning the model is indifferent to the choice of background.}
\label{tab:background-robustness}
\end{table}
% By adjusting the background distribution from using a background from an image of the same class as the foreground to using any background, we can evaluate the robustness of models to shifts in the background distribution.
% We assess background robustness by changing the background distribution, comparing accuracy with backgrounds of the same class as the foreground to using any background.
We assess the robustness of models to shifts in the background distribution from a class-related background to any background.
% We define the background robustness coefficient to be the accuracy of a model on \name when using the same class background divided by the accuracy when using any background:
Background robustness is defined to be the ratio of accuracy on \name with same-class backgrounds to accuracy with any background:
\begin{align}
\text{Background Robustness} = \frac{\text{Acc}(\name_\text{all})}{\text{Acc}(\name_\text{same})}
\end{align}
It represents the relative drop in performance under a background distribution shift.
\Cref{tab:background-robustness} presents the background robustness of various models.
When trained on ImageNet, smaller models generally exhibit greater robustness to changes in the background distribution than larger models and ResNet is more robust than the tested Transformer models.
Crucially, training on \name instead of ImageNet improves the background robustness of all models to $\approx1.00$, meaning that these models are agnostic to the choice of background and only classify based on the foreground.
These findings highlight the generalization benefits of \name.
\paragraph*{Foreground Focus}
\begin{table}
\centering
\resizebox{\columnwidth}{!}{
\begin{tabular}{lcccccc}
\toprule
\multirow{4}{*}{Model} & \multicolumn{6}{c}{Foreground Focus when trained on} \\
\cmidrule(l){2-7}
& IN & FN & IN & FN & IN & FN \\
\cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(l){6-7}
& \multicolumn{2}{c}{GradCam} & \multicolumn{2}{c}{GradCam++} & \multicolumn{2}{c}{IG} \\
\midrule
ViT-S & $1.2\pm0.1$ & $2.3\pm0.3$ & $1.2\pm0.1$ & $2.1\pm0.4$ & $1.9\pm0.1$ & $2.7\pm0.1$ \\
ViT-B & $1.2\pm0.1$ & $2.4\pm0.7$ & $1.1\pm0.1$ & $2.1\pm0.1$ & $1.7\pm0.1$ & $2.7\pm0.1$ \\
ViT-L & $1.3\pm0.1$ & $1.6\pm0.1$ & $1.1\pm0.1$ & $1.3\pm0.1$ & $1.3\pm0.1$ & $2.6\pm0.1$ \\
\midrule
Swin-Ti & $0.9\pm0.1$ & $0.7\pm0.1$ & $1.0\pm0.3$ & $0.7\pm0.3$ & $2.5\pm01$ & $4.8\pm0.3$ \\
Swin-S & $0.8\pm0.1$ & $0.7\pm0.1$ & $0.7\pm0.1$ & $0.7\pm0.4$ & $2.4\pm0.1$ & $4.6\pm0.3$ \\
\midrule
ResNet-50 & $2.2\pm0.1$ & $2.7\pm0.1$ & $2.0\pm0.1$ & $2.9\pm0.1$ & $3.2\pm0.1$ & $4.9\pm0.2$ \\
ResNet-101 & $2.3\pm0.1$ & $2.8\pm0.1$ & $2.2\pm0.1$ & $3.0\pm0.1$ & $3.2\pm0.1$ & $4.8\pm0.1$ \\
\bottomrule
\end{tabular}}
\caption{Evaluation of the foreground focus using GradCam, GradCam++ and IntegratedGradients of models trained on \name (FN) and on ImageNet (IN) directly. Training on \name improves the foreground focus of almost all models.}
\label{tab:foreground-focus}
\end{table}
Leveraging our inherent knowledge of the foreground masks when using \name, as well as common XAI techniques~\cite{Selvaraju2016,Chattopadhay2018,Sundararajan2017}, we can evaluate a model's focus on the foreground object.
We can directly evaluate ImageNet trained models, but this technique can also be extended to other datasets without relying on manually annotated foreground-masks.
To evaluate the foreground focus, we employ Grad-CAM \cite{Selvaraju2016}, Grad-CAM++ \cite{Chattopadhay2018} or IntegratedGradients (IG) \cite{Sundararajan2017} to compute the per-pixel importance of an image for the model's prediction.
The foreground focus is defined to be the ratio of the foreground's relative importance to its relative size in the image:
\begin{align}
\text{FG Focus}(\text{img}) = \frac{\text{Area}(\text{img}) \hspace{3pt} \text{Importance}(\text{fg})}{\text{Area}(\text{fg}) \hspace{3pt} \text{Importance}(\text{img})}
\end{align}
The foreground focus of a model is its average foreground focus over all test images.
\Cref{tab:foreground-focus} presents our findings.
Training on \name significantly increasees the foreground focus of ViT and ResNet across all metrics used.
For Swin, the foreground focus stagnates when measured using GradCam and GradCam++, but almost doubles when using IG.
% These differences might be due to the way GradCam is calculated for Swin \todo{cite package website where this is from} and the \todo{common critique of GradCam}.
\paragraph*{Center Bias}
\begin{table}
\centering
\resizebox{\columnwidth}{!}{
\begin{tabular}{lccc}
\toprule
\multirow{2.5}{*}{Model} & \multicolumn{2}{c}{\makecell{Center Bias when trained on}} & \multirow{2.5}{*}{Delta} \\
\cmidrule(lr){2-3}
& ImageNet & \name \\
\midrule
ViT-S & \raisebox{-6pt}{\includegraphics[width=.08\columnwidth]{img/ViT-S_ImageNet_v1.pdf} \includegraphics[width=.08\columnwidth]{img/ViT-S_ImageNet_v2.pdf} \includegraphics[width=.08\columnwidth]{img/ViT-S_ImageNet_v3.pdf}} & \raisebox{-6pt}{\includegraphics[width=.08\columnwidth]{img/ViT-S_RecombNet all_v1.pdf} \includegraphics[width=.08\columnwidth]{img/ViT-S_RecombNet all_v2.pdf} \includegraphics[width=.08\columnwidth]{img/ViT-S_RecombNet all_v3.pdf}} \\
& $0.255\pm0.008$ & $0.220\pm003$ & \grntxt{-0.035} \\
ViT-B & \raisebox{-6pt}{\includegraphics[width=.08\columnwidth]{img/ViT-B_ImageNet_v1.pdf} \includegraphics[width=.08\columnwidth]{img/ViT-B_ImageNet_v2.pdf} \includegraphics[width=.08\columnwidth]{img/ViT-B_ImageNet_v3.pdf}} & \raisebox{-6pt}{\includegraphics[width=.08\columnwidth]{img/ViT-B_RecombNet all_v1.pdf} \includegraphics[width=.08\columnwidth]{img/ViT-B_RecombNet all_v2.pdf} \includegraphics[width=.08\columnwidth]{img/ViT-B_RecombNet all_v3.pdf}} \\
& $0.254\pm0.004$ & $0.190\pm0.002$ & \grntxt{-0.064} \\
ViT-L & \raisebox{-6pt}{\includegraphics[width=.08\columnwidth]{img/ViT-L_ImageNet_v1.pdf} \includegraphics[width=.08\columnwidth]{img/ViT-L_ImageNet_v2.pdf} \includegraphics[width=.08\columnwidth]{img/ViT-L_ImageNet_v3.pdf}} & \raisebox{-6pt}{\includegraphics[width=.08\columnwidth]{img/ViT-L_RecombNet all_v1.pdf} \includegraphics[width=.08\columnwidth]{img/ViT-L_RecombNet all_v2.pdf} \includegraphics[width=.08\columnwidth]{img/ViT-L_RecombNet all_v3.pdf}} \\
& $0.243\pm0.011$ & $0.117\pm0.007$ & \grntxt{-0.126} \\
\midrule
Swin-Ti & \raisebox{-6pt}{\includegraphics[width=.08\columnwidth]{img/Swin-Ti_ImageNet_v1.pdf} \includegraphics[width=.08\columnwidth]{img/Swin-Ti_ImageNet_v2.pdf} \includegraphics[width=.08\columnwidth]{img/Swin-Ti_ImageNet_v3.pdf}} & \raisebox{-6pt}{\includegraphics[width=.08\columnwidth]{img/Swin-Ti_RecombNet all_v1.pdf} \includegraphics[width=.08\columnwidth]{img/Swin-Ti_RecombNet all_v2.pdf} \includegraphics[width=.08\columnwidth]{img/Swin-Ti_RecombNet all_v3.pdf}} \\
& $0.250\pm0.007$ & $0.165\pm0.002$ & \grntxt{-0.085} \\
Swin-S & \raisebox{-6pt}{\includegraphics[width=.08\columnwidth]{img/Swin-S_ImageNet_v1.pdf} \includegraphics[width=.08\columnwidth]{img/Swin-S_ImageNet_v2.pdf} \includegraphics[width=.08\columnwidth]{img/Swin-S_ImageNet_v3.pdf}} & \raisebox{-6pt}{\includegraphics[width=.08\columnwidth]{img/Swin-S_RecombNet all_v1.pdf} \includegraphics[width=.08\columnwidth]{img/Swin-S_RecombNet all_v2.pdf} \includegraphics[width=.08\columnwidth]{img/Swin-S_RecombNet all_v3.pdf}} \\
& $0.232\pm0.001$ & $0.156\pm002$ & \grntxt{-0.076} \\
\midrule
ResNet50 & \raisebox{-6pt}{\includegraphics[width=.08\columnwidth]{img/ResNet50_ImageNet_v1.pdf} \includegraphics[width=.08\columnwidth]{img/ResNet50_ImageNet_v2.pdf} \includegraphics[width=.08\columnwidth]{img/ResNet50_ImageNet_v3.pdf}} & \raisebox{-6pt}{\includegraphics[width=.08\columnwidth]{img/ResNet50_RecombNet all_v1.pdf} \includegraphics[width=.08\columnwidth]{img/ResNet50_RecombNet all_v2.pdf} \includegraphics[width=.08\columnwidth]{img/ResNet50_RecombNet all_v3.pdf}} \\
& $0.263\pm0.003$ & $0.197\pm0.003$ & \grntxt{-0.066} \\
ResNet101 & \raisebox{-6pt}{\includegraphics[width=.08\columnwidth]{img/ResNet101_ImageNet_v1.pdf} \includegraphics[width=.08\columnwidth]{img/ResNet101_ImageNet_v2.pdf} \includegraphics[width=.08\columnwidth]{img/ResNet101_ImageNet_v3.pdf}} & \raisebox{-6pt}{\includegraphics[width=.08\columnwidth]{img/ResNet101_RecombNet all_v1.pdf} \includegraphics[width=.08\columnwidth]{img/ResNet101_RecombNet all_v2.pdf} \includegraphics[width=.08\columnwidth]{img/ResNet101_RecombNet all_v3.pdf}} \\
& $0.230\pm0.003$ & $0.199\pm002$ & \grntxt{-0.031} \\
\bottomrule
\end{tabular} }
\includegraphics[width=.75\columnwidth]{img/colorbar_horizontal.pdf}
\caption{Evaluation of the position bias. We plot the accuracy relative to the center accuracy of multiple instantiations of the models when the foreground objects is in different cells a $3 \times 3$ grid.
Training on \name significantly reduces a models center bias.}
\label{tab:center-bias}
\end{table}
With \name we have unique control over the position of the foreground object in the image.
This lets us quantify the center bias of ImageNet- and \name-trained models.
We divide the image into a $3 \times 3$ grid and evaluate model accuracy when the foreground object is in each of the $9$ grid cells.
Each cell's accuracy is divided by the accuracy in the center cell for normalization, which gives us the relative performance drop when the foreground is in each part of the image.
The center bias is calculated as one minus the average of the minimum performance of a corner cell and the minimum performance of a side cell:
\begin{align}
\begin{split}
& \text{Center Bias} = \\
& \hspace{7pt} 1 - \frac{\min\limits_{a, b \in \{0, 2\}} \text{Acc}(\text{cell}_{(a, b)}) + \min\limits_{\substack{a=1 \text{ or } b=1 \\ a \neq b}} \text{Acc}(\text{cell}_{(a, b)})}{2 \text{Acc}(\text{cell}_{(1, 1)})}
\end{split}
\end{align}
\Cref{tab:center-bias} visualizes the center bias of three instantiations of each model.
Performance is generally highest in the center and the center top and bottom and center left and right cells, and lowest in the four corners.
Interestingly, ImageNet-trained models perform slightly better when the foreground object is on the right side of the image, compared to the left side, despite our use of random flipping with a probability of $0.5$ during training.
% Training on \name reduces the center bias of all models by at least half.
Training on \name significantly reduces center bias across all models.
This demonstrates that \name promotes a more uniform spatial attention distribution.
Their accuracy is higher in the center left and right cells than in the center top and bottom ones, which is not the case for ImageNet-trained models.
\paragraph*{Size Bias}
\begin{figure}
\centering
\includegraphics[width=.9\columnwidth]{img/size_bias.pdf}
\caption{Evaluation of the size bias of models trained on \name. We plot the accuracy relative to the accuracy when using the mean foreground size.}
\label{fig:size-bias}
\end{figure}
Finally, we evaluate the impact of different-sized foreground objects on the accuracy.
For this evaluation, we use the \emph{mean} foreground size strategy.
We introduce a size factor $f_\text{size}$ by which we additionally scale the foreground object before pasting it onto the background.
Results are again normalized by the accuracy when using the mean foreground size ($f_\text{size} = 1.0$).
\Cref{fig:size-bias} shows the size bias curves of ViT-S and ViT-B when trained on ImageNet and \name.
% When training on \name, the resulting model keeps it's good performance on smaller foreground objects, while models trained on ImageNet fall of faster and lower.
Models trained on \name maintain better performance even with smaller foreground objects, when ImageNet-trained models exhibit a more rapid performance decline.
Therefore, \name-training improves robustness to variations in object scale.