cvpr submission

2026-02-24 12:01:26 +01:00
parent 5c08f9d31a
commit e7c0b531d6
59 changed files with 7238 additions and 4939 deletions
--- a/sec/appendix.tex
+++ b/sec/appendix.tex
@@ -1,101 +1,6 @@
 % !TeX root = ../supplementary.tex

-
-
-\section{Training Setup}
-\label{sec:training_setup}
-
-\begin{table*}[h!]
-    \centering
-    \caption{Training setup and hyperparameters for our ImageNet training.}
-    \label{tab:in-setup}
-    \resizebox{\textwidth}{!}{
-        \begin{tabular}{lccc}
-            \toprule
-            Augmentation Pipeline: & Basic                                               & 3-Augment~\cite{Touvron2022} & RandAugment~\cite{Touvron2021b} \\
-            \midrule
-            Image Resolution       & \multicolumn{3}{c}{$224 \times 224$}                                                                                 \\
-            Epochs                 & \multicolumn{3}{c}{300}                                                                                              \\
-            Learning Rate          & S/B: 1e-3, L: 5e-4                                  & 3e-3                         & S/B: 1e-3, L: 5e-4              \\
-            Learning Rate Schedule & \multicolumn{3}{c}{cosine decay}                                                                                     \\
-            Batch Size             & 1024                                                & 2048                         & 1024                            \\
-            GPUs                   & \multicolumn{3}{c}{$4\times$ NVIDIA A100/H100/H200}                                                                  \\
-            Warmup Schedule        & \multicolumn{3}{c}{linear}                                                                                           \\
-            Warmup Epochs          & \multicolumn{3}{c}{3}                                                                                                \\
-            Weight Decay           & 0.05                                                & 0.02                         & 0.05                            \\
-            Label Smoothing        & \multicolumn{3}{c}{0.1}                                                                                              \\
-            Optimizer              & AdamW                                               & Lamb \cite{You2020}          & AdamW                           \\
-            \midrule
-            Augmentations          & \makecell{RandomResizedCrop                                                                                          \\ Horizontal Flip \\ ColorJitter}                                         & \makecell{Resize                                               \\ RandomCrop                                    \\ Horizontal Flip \\ Grayscale \\ Solarize \\ Gaussian-Blur \\ Color Jitter} &              \makecell{RandomResizedCrop \\ Horizontal Flip \\ RandomErase \cite{Zhong2020} \\ RandAugment \cite{Cubuk2020} \\ Color Jitter}                   \\
-            \bottomrule
-        \end{tabular}
-    }
-\end{table*}
-
-\begin{table}[h!]
-    \centering
-    \caption{Training setup for finetuning on different downstream datasets. Other settings are the same as in \Cref{tab:in-setup}. For finetuning, we always utilize 3-Augment and the related parameters from the \emph{ViT, Swin, ResNet} column of \Cref{tab:in-setup}}
-    \label{tab:downstream-setup}
-    \begin{tabular}{lcccc}
-        \toprule
-        Dataset  & Batch Size & Epochs & Learning Rate & Num. GPUs \\
-        \midrule
-        Aircraft & 512        & 500    & 3e-4          & 2         \\
-        Cars     & 1024       & 500    & 3e-4          & 4         \\
-        Flowers  & 256        & 500    & 3e-4          & 1         \\
-        Food     & 2048       & 100    & 3e-4          & 4         \\
-        Pets     & 512        & 500    & 3e-4          & 2         \\
-        \bottomrule
-    \end{tabular}
-\end{table}
-On ImageNet, we test three different data augmentation pipelines and hyperparameter settings as shown in \Cref{tab:in-setup}: A basic pipeline, a pipeline using RandAugment based on the DeiT~\cite{Touvron2021b} setup and 3-Augment, as used in \cite{Touvron2022,Nauen2025}.
-When comparing different architectures, ViT, Swin, and ResNet are trained with the 3-Augment pipeline and DeiT is trained with the RandAugment pipeline.
-% On ImageNet we use the same training setup as \cite{Nauen2025} and \cite{Touvron2022} without pretraining for ViT, Swin, and ResNet.
-% For DeiT, we train the same ViT architecture but using the data augmentation scheme and hyperparameters from \cite{Touvron2021b}.
-As our focus is on evaluating the changes in accuracy due to \schemename, like \cite{Nauen2025}, we stick to one set of hyperparameters for all models.
-We list the settings used for training on ImageNet in \Cref{tab:in-setup} and the ones used for finetuning those weights on the downstream datasets in \Cref{tab:downstream-setup}.
-Our implementation is using PyTorch \cite{Paszke2019} and the \emph{timm} library \cite{Wightman2019} for model architectures and basic functions.
-
-\begin{table*}[ht!]
-    \centering
-    \caption{Hardware and Software specifics used for both training and evaluation.}
-    \label{tab:hw-sw-versions}
-    \begin{tabular}{ll}
-        \toprule
-        Parameter        & Value                                                \\
-        \midrule
-        GPU              & $4 \times$ NVIDIA A100/H100/H200                     \\
-        CPU              & 24 CPU cores (Intel Xenon) per GPU                   \\
-        Memory           & up to 120 GB per GPU                                 \\
-        Operating System & Enroot container for SLURM based on Ubuntu 24.04 LTS \\
-        Python           & 3.12.3                                               \\
-        PyTorch          & 2.7.0                                                \\
-        TorchVision      & 0.22.0                                               \\
-        Timm             & 1.0.15                                               \\
-        \bottomrule
-    \end{tabular}
-\end{table*}
-\Cref{tab:hw-sw-versions} lists the specific hardware we use, as well as versions of the relevant software packages.
-
-\section{Resource Usage of \schemename}
-To utilize the proposed \schemename, specific computational resources are necessary, particularly for computing and storing for the output of the segmentation stage and for on-the-fly processing of the recombination stage.
-
-\paragraph{Segmentation.}
-% While calculating the segmentations and infills takes a lot of compute, this is effort that has to be spent only once per dataset.
-\schemename involves a computationally expensive segmentation and infill stage, which is a one-time calculation per dataset.
-Once computed, the segmentation and infill results can be perpetually reused, amortizing the initial cost over all subsequent experiments and applications.
-On NVIDIA H100 GPUs, the segmentation stage will compute at a rate of $374.3 \frac{\text{img}}{\text{GPU} \times \text{h}}$ when using Attentive Eraser or $5 338.6 \frac{\text{img}}{\text{GPU} \times \text{h}}$ for LaMa.
-For ImageNet this comes down to just under 9 days (Attentive Eraser) or 16 hours (LaMa) on two 8 GPU nodes.
-To facilitate immediate use and reproduction of results, we publicly provide the precalculated segmentation stage output for the ImageNet dataset for download\footnote{Link will go here.}.
-The output of \schemename's segmentation step on ImageNet dataset requires 73 GB of additional disk space for the segmentation output, which is separate from the base 147 GB ImageNet size.
-
-\paragraph{Recombination.}
-The recombination step of \schemename is implemented as a based data loader operation.
-It's thus offloaded to the CPU, where it can be heavily parallelized and thus only results in a very minor increase in the training step-time.
-For example, using a ViT-B model on an NVIDIA A100 GPU, the average update step-time increased by $1\%$, from $528 \pm 2$ ms to $534 \pm 1$ ms.
-
 \section{Extended Bates Distribution}
-\label{apdx:bates-distribution}
 \begin{figure}[h!]
    \centering
    \includegraphics[width=.5\columnwidth]{img/bates.pdf}
@@ -103,6 +8,27 @@ For example, using a ViT-B model on an NVIDIA A100 GPU, the average update step-
    \label{fig:bates-pdf}
 \end{figure}

+% Finally, we analyze the foreground object's positioning in the image.
+% We utilize an extended Bates distribution to sample the position of the foreground object.
+% The Bates distribution~\cite{Bates1955} with parameter $\eta \geq 1$ is the mean of $\eta$ independent uniformly distributed random variables \cite{Jonhson1995}.
+% Therefore, the larger $\eta$, the more concentrated the distribution is around the center.
+% We extend this concept to $\eta \leq -1$ by shifting the distribution away from the center and towards the edges.
+% We extend this concept to $\eta \leq -1$ by defining
+% \begin{align*}
+%     X \sim \text{Bates}(\eta) :\Leftrightarrow s(X) \sim \text{Bates}(-\eta)
+% \end{align*}
+% for $\eta \leq 1$ with $s$ being the sawtooth function on $[0, 1]$:
+% \begin{align}
+%     s(x) = \begin{cases}
+%                x + 0.5 & \text{if } 0 < x < 0.5       \\
+%                x - 0.5 & \text{if } 0.5 \leq x \leq 1
+%            \end{cases}
+% \end{align}
+% Note that $s \circ s = \id$ on $[0, 1]$.
+% This way, distributions with $\eta \leq -1$ are more concentrated around the borders.
+% $\eta = 1$ and $\eta = -1$ both correspond to the uniform distribution.
+% The PDF of this extended Bates distribution is visualized in \Cref{fig:bates-pdf}.
+
 We introduce an extension of the Bates distribution~\cite{Bates1955} to include negative parameters, enabling sampling of foreground object positions away from the image center.
 The standard Bates distribution, for $\eta \in \N$, is defined as the mean of $\eta$ independent random variables drawn from a uniform distribution \cite{Jonhson1995}.
 A larger $\eta$ value increases the concentration of samples around the distribution's mean, which in this case is the image center.
@@ -125,304 +51,98 @@ This transformation inverts the distribution's concentration, shifting the proba
 We visualize the distribution function of the extended Bates distribution in \Cref{fig:bates-pdf}.
 Both $\eta = 1$ and $\eta = -1$ result in a uniform distribution across the image.

-\section{Design Choices of \schemename}
-\label{sec:ablation}
+\section{Resource Usage of \schemename}
+To utilize the proposed \schemename, specific computational resources are necessary, particularly for computing and storing for the output of the segmentation stage and for on-the-fly processing of the recombination stage.

-We start by ablating the design choices of \schemename on TinyImageNet~\cite{Le2015}, a subset of ImageNet containing 200 categories with 500 images each. %, and Tiny\name, the application of \schemename to TinyImageNet.
-% \Cref{tab:ablation} presents the results of these ablations.
-\Cref{tab:ablation-segment} presents ablations for segmentation and \Cref{tab:ablation-recombine} for recombination.
+\paragraph{Segmentation.}
+% While calculating the segmentations and infills takes a lot of compute, this is effort that has to be spent only once per dataset.
+\schemename involves a computationally expensive segmentation and infill stage, which is a one-time calculation per dataset.
+Once computed, the segmentation and infill results can be perpetually reused, amortizing the initial cost over all subsequent experiments and applications.
+On NVIDIA H100 GPUs, the segmentation stage will compute at a rate of $374.3 \frac{\text{img}}{\text{GPU} \times \text{h}}$ when using Attentive Eraser or $5 338.6 \frac{\text{img}}{\text{GPU} \times \text{h}}$ for LaMa.
+For ImageNet this comes down to just under 9 days (Attentive Eraser) or 16 hours (LaMa) on two 8 GPU nodes.
+To facilitate immediate use and reproduction of results, we publicly provide the precalculated segmentation stage output for the ImageNet dataset for download\footnote{Link will go here.}.
+The output of \schemename's segmentation step on ImageNet dataset requires 73 GB of additional disk space for the segmentation output, which is separate from the base 147 GB ImageNet size.

-\begin{table}
-    \caption{Ablation of the design decisions in the segmentation phase of \schemename on TinyImageNet.
-        The first line is our baseline, while the other lines are using \schemename.
-        We use basic settings with the \emph{same} background strategy during recombination for this experiment.
-    }
-    \label{tab:ablation-segment}
+\paragraph{Recombination.}
+The recombination step of \schemename is implemented as a based data loader operation.
+It's thus offloaded to the CPU, where it can be heavily parallelized and thus only results in a very minor increase in the training step-time.
+For example, using a ViT-B model on an NVIDIA A100 GPU, the average update step-time increased by $1\%$, from $528 \pm 2$ ms to $534 \pm 1$ ms.
+
+
+\section{Training Setup}
+\label{sec:training_setup}
+
+\begin{table*}[h!]
    \centering
-    \small
-    % \resizebox{.9\columnwidth}{!}{
-    \begin{tabular}{llcc}
+    \caption{Training setup and hyperparameters for our ImageNet training.}
+    \label{tab:in-setup}
+    \begin{tabular}{lcc}
        \toprule
-        \multirow{2.5}{*}{\makecell{Detect.                                                                          \\Prompt}} & \multirow{2.5}{*}{\makecell{Infill \\ Model}} & \multicolumn{2}{c}{TinyImageNet Accuracy [\%]}         \\
-        \cmidrule{3-4}
-                                                  &                                & ViT-Ti         & ViT-S          \\
+        Parameter                & ViT, Swin, ResNet                     & DeiT                              \\
        \midrule
-        \multicolumn{2}{l}{\textbf{TinyImageNet}} & $66.1 \pm 0.5$                 & $68.3 \pm 0.7$                  \\
-        specific                                  & LaMa \cite{Suvorov2022}        & $65.5 \pm 0.4$ & $71.2 \pm 0.5$ \\
-        general                                   & \gtxt{LaMa \cite{Suvorov2022}} & $66.4 \pm 0.6$ & $72.9 \pm 0.6$ \\
-        \gtxt{general}                            & Att. Eraser \cite{Sun2025}     & $67.5 \pm 1.2$ & $72.4 \pm 0.5$ \\
+        Image Resolution         & $224 \times 224$                      & $224 \times 224$                  \\
+        Epochs                   & 300                                   & 300                               \\
+        Learning Rate            & 3e-3                                  & S/B: 1e-3, L: 5e-4                \\
+        Learning Rate Schedule   & cosine decay                          & cosine decay                      \\
+        Batch Size               & 2048                                  & 1024                              \\
+        GPUs                     & $4\times$ NVIDIA A100/H100/H200       & $4\times$ NVIDIA A100/H100/H200   \\
+        Warmup Schedule          & linear                                & linear                            \\
+        Warmup Epochs            & 3                                     & 3                                 \\
+        Weight Decay             & 0.02                                  & 0.05                              \\
+        Label Smoothing          & 0.1                                   & 0.1                               \\
+        Optimizer                & Lamb \cite{You2020}                   & AdamW                             \\
+        \cmidrule(r){1-1}
+        Data Augmentation Policy & \textbf{3-Augment \cite{Touvron2022}} & \textbf{DeiT \cite{Touvron2021b}} \\
+        Augmentations            & \makecell{Resize                                                          \\ RandomCrop                                    \\ HorizontalFlip \\ Grayscale \\ Solarize \\ GaussianBlur \\ ColorJitter \\ CutMix \cite{Yun2019}} &              \makecell{RandomResizedCrop \\ HorizontalFlip \\ RandomErase \cite{Zhong2017} \\ RandAugment \cite{Cubuk2019} \\ ColorJitter \\ Mixup \cite{Zhang2018a} \\ CutMix \cite{Yun2019}}                   \\
        \bottomrule
    \end{tabular}
-    % }
-\end{table}
+\end{table*}

-\begin{table}[t]
-    \caption{Ablation of the recombination phase of \schemename on TinyImageNet (top) and ImageNet (bottom). The first experiments use the initial segmentation settings with LaMa \cite{Suvorov2022}.}
-    \label{tab:ablation-recombine}
+\begin{table}[h!]
    \centering
-    % \resizebox{.9\columnwidth}{!}{
-    \begin{tabular}{ccccccccccc}
+    \caption{Training setup for finetuning on different downstream datasets. Other settings are the same as in \Cref{tab:in-setup}. For finetuning, we always utilize 3-Augment and the related parameters from the \emph{ViT, Swin, ResNet} column of \Cref{tab:in-setup}}
+    \label{tab:downstream-setup}
+    \begin{tabular}{lcccc}
        \toprule
-        % FG.          & Augment.              & BG.         & BG.        & Edge                             & Original     & \multicolumn{2}{c}{Accuracy [\%]}                \\
-        % Size         & Order                 & Strat.      & Prune      & Smoothing                        & Mixing       & ViT-Ti                            & ViT-S        \\
-        \multirow{2.5}{*}{\makecell{FG.                                                                                                                                                      \\size}} & \multirow{2.5}{*}{\makecell{Augment.\\Order}} & \multirow{2.5}{*}{\makecell{BG\\Strat.}} & \multirow{2.5}{*}{\makecell{BG.\\Prune}} & \multirow{2.5}{*}{\makecell{Original\\Mixing}} & \multirow{2.5}{*}{\makecell{Edge\\Smooth.}} & \multicolumn{2}{c}{Accuracy [\%]} \\
-        \cmidrule{7-8}
-                                                  &                       &                     &            &              &                                  & ViT-Ti       & ViT-S        \\
+        Dataset  & Batch Size & Epochs & Learning Rate & Num. GPUs \\
        \midrule
-        % TinyImageNet             &              &                                 &             &            &                                  &              & $66.1\pm0.5$                 & $68.3\pm0.7$ \\
-        \multicolumn{6}{l}{\textbf{TinyImageNet}} & \gtxt{$66.1\pm0.5$}   & \gtxt{$68.3\pm0.7$}                                                                                              \\
-        mean                                      & crop$\to$paste        & same                & -          & -            & \gtxt{-}                         & $64.6\pm0.5$ & $70.0\pm0.6$ \\
-        range                                     & \gtxt{crop$\to$paste} & \gtxt{same}         & \gtxt{-}   & \gtxt{-}     & \gtxt{-}                         & $65.5\pm0.4$ & $71.2\pm0.5$ \\
-        \midrule
-        % \gtxt{range} & \gtxt{crop$\to$paste} & \gtxt{same} & \gtxt{-}   & \gtxt{-}     & \gtxt{-}                         & $66.4\pm0.6$ & $72.9\pm0.6$ \\
-        {range}                                   & {crop$\to$paste}      & {same}              & {-}        & {-}          & {-}                              & $67.5\pm1.2$ & $72.4\pm0.5$ \\
-        \gtxt{range}                              & paste$\to$crop        & \gtxt{same}         & \gtxt{-}   & \gtxt{-}     & \gtxt{-}                         & $67.1\pm1.2$ & $72.9\pm0.5$ \\
-        \gtxt{range}                              & \gtxt{paste$\to$crop} & \gtxt{same}         & 1.0        & \gtxt{-}     & \gtxt{-}                         & $67.0\pm1.2$ & $73.0\pm0.3$ \\
-        \gtxt{range}                              & \gtxt{paste$\to$crop} & \gtxt{same}         & 0.8        & \gtxt{-}     & \gtxt{-}                         & $67.2\pm1.2$ & $72.9\pm0.8$ \\
-        \gtxt{range}                              & \gtxt{paste$\to$crop} & \gtxt{same}         & 0.6        & \gtxt{-}     & \gtxt{-}                         & $67.5\pm1.0$ & $72.8\pm0.7$ \\
-        % \gtxt{range} & \gtxt{paste$\to$crop} & \gtxt{same} & \gtxt{0.8} & $\sigma_\text{max} = 2.0$        & \gtxt{-}     & $67.2\pm0.4$ & $72.9\pm0.5$ \\
-        % \gtxt{range} & \gtxt{paste$\to$crop} & \gtxt{same} & \gtxt{0.8} & $\sigma_\text{max} = 4.0$        & \gtxt{-}     & $65.9\pm0.5$ & $72.4\pm0.6$ \\
-        \gtxt{range}                              & \gtxt{paste$\to$crop} & \gtxt{same}         & \gtxt{0.8} & $p=0.2$      & \gtxt{-}                         & $69.8\pm0.5$ & $75.0\pm0.3$ \\
-        \gtxt{range}                              & \gtxt{paste$\to$crop} & \gtxt{same}         & \gtxt{0.8} & $p=0.33$     & \gtxt{-}                         & $69.5\pm0.4$ & $75.2\pm1.0$ \\
-        \gtxt{range}                              & \gtxt{paste$\to$crop} & \gtxt{same}         & \gtxt{0.8} & $p=0.5$      & \gtxt{-}                         & $70.3\pm1.0$ & $74.2\pm0.2$ \\
-        \gtxt{range}                              & \gtxt{paste$\to$crop} & \gtxt{same}         & \gtxt{0.8} & linear       & \gtxt{-}                         & $70.1\pm0.7$ & $74.9\pm0.8$ \\
-        \gtxt{range}                              & \gtxt{paste$\to$crop} & \gtxt{same}         & \gtxt{0.8} & reverse lin. & \gtxt{-}                         & $67.6\pm0.2$ & $73.2\pm0.3$ \\
-        \gtxt{range}                              & \gtxt{paste$\to$crop} & \gtxt{same}         & \gtxt{0.8} & cos          & \gtxt{-}                         & $71.3\pm1.0$ & $75.7\pm0.8$ \\
-        \gtxt{range}                              & \gtxt{paste$\to$crop} & \gtxt{same}         & \gtxt{0.8} & \gtxt{cos}   & $\sigma_\text{max} = 4.0$        & $70.0\pm0.8$ & $75.5\pm0.7$ \\
-        \gtxt{range}                              & \gtxt{paste$\to$crop} & orig.               & \gtxt{0.8} & \gtxt{cos}   & \gtxt{$\sigma_\text{max} = 4.0$} & $67.2\pm0.9$ & $69.9\pm1.0$ \\
-        \gtxt{range}                              & \gtxt{paste$\to$crop} & all                 & \gtxt{0.8} & \gtxt{cos}   & \gtxt{$\sigma_\text{max} = 4.0$} & $70.1\pm0.7$ & $77.5\pm0.6$ \\
-        \midrule
-        \multicolumn{6}{l}{\textbf{ImageNet}}     & \gtxt{-}              & \gtxt{$79.1\pm0.1$}                                                                                              \\
-        \gtxt{range}                              & \gtxt{paste$\to$crop} & \gtxt{same}         & \gtxt{0.8} & \gtxt{cos}   & \gtxt{-}                         & -            & $80.5\pm0.1$ \\
-        \gtxt{range}                              & \gtxt{paste$\to$crop} & \gtxt{same}         & \gtxt{0.8} & \gtxt{cos}   & $\sigma_\text{max} = 4.0$        & -            & $80.7\pm0.1$ \\
-        \gtxt{range}                              & \gtxt{paste$\to$crop} & all                 & \gtxt{0.8} & \gtxt{cos}   & \gtxt{$\sigma_\text{max} = 4.0$} & -            & $81.4\pm0.1$ \\
+        Aircraft & 512        & 500    & 3e-4          & 2         \\
+        Cars     & 1024       & 500    & 3e-4          & 4         \\
+        Flowers  & 256        & 500    & 3e-4          & 1         \\
+        Food     & 2048       & 100    & 3e-4          & 4         \\
+        Pets     & 512        & 500    & 3e-4          & 2         \\
        \bottomrule
    \end{tabular}
-    % }
 \end{table}
+On ImageNet we use the same training setup as \cite{Nauen2025} and \cite{Touvron2022} without pretraining for ViT, Swin, and ResNet.
+For DeiT, we train the same ViT architecture but using the data augmentation scheme and hyperparameters from \cite{Touvron2021b}.
+As our focus is on evaluating the changes in accuracy due to \schemename, like \cite{Nauen2025}, we stick to one set of hyperparameters for all models.
+We list the settings used for training on ImageNet in \Cref{tab:in-setup} and the ones used for finetuning those weights on the downstream datasets in \Cref{tab:downstream-setup}.
+Out implementation is using PyTorch \cite{Paszke2019} and the \emph{timm} library \cite{Wightman2019} for model architectures and basic functions.

-
-\textbf{Prompt.}
-% We present the ablation of our main design decisions in \Cref{tab:ablation}.
-First, we evaluate the type of prompt used to detect the foreground object.
-Here, the \emph{general} prompt, which contains the class and the more general object category, outperforms only having the class name (\emph{specific}).
-
-\textbf{Inpainting.} Among inpainting models, Attentive Eraser~\cite{Sun2025} produces slightly better results compared to LaMa~\cite{Suvorov2022} ($+0.5$ p.p. on average).
-For inpainting examples, see the supplementary material.
-% (see the supplementary material for examples).
-% When comparing the infill models, the GAN-based LaMa \cite{Suvorov2022} gets outperformed by the Attentive Eraser \cite{Sun2025}.
-
-\textbf{Foreground size}
-% We observe that LaMa's often infills unnatural textures compared to Attentive Eraser.
-% The size of foreground objects during training has a significant impact on the performance.
-% Here, using the greater variability of the \emph{range} strategy increases the performance by $\approx 1\%$ compared to the \emph{mean} strategy.
-significantly impacts performance.
-Employing a \emph{range} of sizes during recombination, rather than a fixed \emph{mean} size, boosts accuracy by approximately 1 p.p.
-This suggests that the added variability is beneficial.
-
-\textbf{Order of data augmentation.}
-% (1) Applying the image crop related augmentations \emph{before} pasting the foreground object and the color-based ones \emph{after} pasting or (2) applying all data augmentations after pasting the foreground object.
-% While results are ambiguous, we choose the second strategy, as it improves the performance of ViT-S, although not the one of ViT-Ti.
-Applying all augmentations after foreground-background recombination (\emph{paste$\to$crop$\to$color}) improves ViT-S's performance compared to applying crop-related augmentations before pasting (\emph{crop$\to$paste$\to$color}).
-ViT-Ti results are ambiguous.
-
-\textbf{Background pruning.}
-When it comes to the backgrounds to use, we test different pruning thresholds ($t_\text{prune}$) to exclude backgrounds with large inpainting.
-% and only use backgrounds with an relative size of the infilled region of at most $t_\text{prune}$ (exclusive).
-A threshold of $t_\text{prune}=1.0$ means that we use all backgrounds that are not fully infilled.
-% We find that the background pruning does not significantly impact the models' performance.
-% We choose $t_\text{prune}=0.8$ for the following experiments to exclude backgrounds that are mostly artificial.
-Varying $t_\text{prune}$ has minimal impact.
-We choose $t_\text{prune} = 0.8$ to exclude predominantly artificial backgrounds.
-
-% One of the most important design decisions is the mixing of the original dataset with \name.
-\textbf{Mixing} \schemename-augmented samples with the original ImageNet data proves crucial.
-While constant and linear mixing schedules improve performance over no mixing by $2-3$ p.p. compared to only augmented samples, the cosine annealing schedule proves optimal, boosting accuracy by $3-4$ p.p.
-
-\textbf{Edge smoothing.}
-We evaluate the impact of using Gaussian blurring to smooth the edges of the foreground masks.
-% Similarly, applying edge smoothing to foreground masks with Gaussian blurring actually hurts performance on Tiny\name, but slightly improves it on \name.
-For larger models, this gives us a slight performance boost on the full ImageNet (second to last line in \Cref{tab:ablation-recombine}).
-
-\textbf{Background strategy.}
-Another point is the allowed choice of background image for each foreground object.
-% We evaluate three different strategies.
-% (1) Picking the background from which that specific foreground was originally extracted.
-% The major difference to ImageNet when using this setup is the variability in size and position of the foreground object.
-% (2) Picking a background that originally had a foreground object of the same class in it.
-% Here, we have backgrounds where objects of this type can typically appear while also creating a wider variety of samples due to pairing each foreground object with different backgrounds each time.
-% (3) Picking any background.
-% This choice has the largest variety of backgrounds, but the backgrounds are not semantically related to the foreground object anymore.
-% We find in \Cref{fig:bg-strategy} that choosing only a foreground's original background is the worst choice.
-We compare using the original background, a background from the same class, and any background.
-These strategies go from low diversity and high shared information content between the foreground and background to high diversity and low shared information content.
-For \emph{ViT-Ti}, the latter two strategies perform comparably, while \emph{ViT-S} benefits from the added diversity of using any background.
-The same is true when training on the full ImageNet.
-
-
-\begin{table}
-    \caption{Accuracy of ViT-S on TinyImageNet (TIN) in percent using \schemename with different foreground position distributions by varying the Bates parameter $\eta$.
-        The best performance is achieved when using the uniform distribution ($\eta=1$) for training.}
-    \label{tbl:foreground-eta}
+\begin{table*}[h!]
    \centering
-    \small
-    % \resizebox{.9\columnwidth}{!}{
-    \begin{tabular}{ccccccc}
+    \caption{Hardware and Software specifics used for both training and evaluation.}
+    \label{tab:hw-sw-versions}
+    \begin{tabular}{ll}
        \toprule
-        \multirow{2.5}{*}{\makecell{Bates Parameter                  \\during training}} & \multirow{2.5}{*}{\makecell{TIN                                           \\w/o \schemename}} & \multicolumn{5}{c}{TIN w/ \schemename}                               \\
-        \cmidrule(l){3-7}
-                    &      & $\eta=-3$ & $-2$ & $1/-1$ & $2$  & $3$  \\
+        Parameter        & Value                                                \\
        \midrule
-        Baseline    & 68.9 & 60.5      & 60.2 & 60.8   & 62.6 & 63.1 \\
-        $\eta=-3$   & 71.3 & 79.3      & 79.5 & 79.1   & 79.3 & 79.1 \\
-        $\eta=-2$   & 71.5 & 80.0      & 78.7 & 79.3   & 79.1 & 78.8 \\
-        $\eta=1/-1$ & 72.3 & 79.5      & 78.9 & 80.2   & 79.7 & 80.4 \\
-        $\eta=2$    & 71.3 & 78.2      & 77.8 & 79.1   & 79.6 & 79.9 \\
-        $\eta=3$    & 71.4 & 77.2      & 76.9 & 78.6   & 79.6 & 79.7 \\
+        GPU              & NVIDIA A100/H100/H200                                \\
+        CPU              & 24 CPU cores (Intex Xenon) per GPU                   \\
+        Memory           & up to 120GB per GPU                                  \\
+        Operating System & Enroot container for SLURM based on Ubuntu 24.04 LTS \\
+        Python           & 3.12.3                                               \\
+        PyTorch          & 2.7.0                                                \\
+        TorchVision      & 0.22.0                                               \\
+        Timm             & 1.0.15                                               \\
        \bottomrule
    \end{tabular}
-    % }
-\end{table}
-
-\textbf{Foreground position.}
-Finally, we analyze the foreground object's positioning in the image, using a
-generalization of the Bates distribution~\cite{Bates1955} with parameter $\eta \in \Z$ (see \Cref{apdx:bates-distribution}).
-The Bates distribution presents an easy way to sample from a bounded domain with just one hyperparameter that controls its concentration.
-$\eta = 1/-1$ corresponds to the uniform distribution; $\eta > 1$ concentrates the distribution around the center; and for $\eta < -1$, the distribution is concentrated at the borders (see supplementary material for details).
-% We utilize an extended Bates distribution to sample the position of the foreground object.
-% The Bates distribution with parameter $\eta \geq 1$ is the mean of $\eta$ independent uniformly distributed random variables \cite{Jonhson1995}.
-% The larger $\eta$, the more concentrated the distribution is at the center, $\eta < -1$ concentrates the distribution at the edges.
-% We extend this concept to $\eta \leq -1$, shifting the distribution away from the center and towards the edges.
-When sampling more towards the center of the image, the difficulty of the task is reduced, which reduces performance on TinyImageNet (\Cref{tbl:foreground-eta}).
-This is reflected in the performance when evaluating using \schemename with $\eta=2$ and $\eta=3$ compared to $\eta=-1/1$.
-We observe a similar reduction for $\eta < -1$.
-% This experiment is conducted using the LaMa infill model.
-
-\begin{table}[t]
-    \caption{Dataset statistics for TinyImageNet and ImageNet with and without \schemename. For \schemename we report the number of foreground/background pairs.}
-    \label{tab:dataset-stats}
-    \centering
-    % \resizebox{.5\columnwidth}{!}{
-    \begin{tabular}{l S[table-format=4.0] S[table-format=7.0] S[table-format=5.0]}
-        \toprule
-        Dataset                    & {Classes} & {\makecell{Training         \\ Images}} & {\makecell{Validation \\ Images}} \\
-        \midrule
-        TinyImageNet               & 200       & 100000              & 10000 \\
-        TinyImageNet + \schemename & 200       & 99404               & 9915  \\
-        ImageNet                   & 1000      & 1281167             & 50000 \\
-        ImageNet + \schemename     & 1000      & 1274557             & 49751 \\
-        \bottomrule
-    \end{tabular}
-    % }
-\end{table}
-After fixing the optimal design parameters in \Cref{tab:ablation-segment,tab:ablation-recombine} (last rows), we run \schemename's segmentation step on the entire ImageNet dataset.
-\Cref{tab:dataset-stats} shows the resulting dataset statistics.
-% The slightly lower number of images in \name is due to \emph{Grounded SAM} returning no or invalid detections for some images.
-The slightly reduced image count for \schemename is due to instances where Grounded SAM fails to produce valid segmentation masks.
-
-\section{Robustness Evaluation on Corner-Cases}
-\begin{table}[t]
-    \centering
-    \caption{Evaluation on the Corner-Cases dataset. Objects cut from ImageNet evaluation bounding boxes are pasted onto infilled backgrounds. Objects have three sizes: $56$px, $84$px, and $112$px. Objects are places in the center (CeX) or corner (CoX) of an image its original background (XxO) or a random background (XxR).}
-    \label{tab:corner-cases}
-    \resizebox{\textwidth}{!}{
-        \begin{tabular}{lcccccccccccccc}
-            \toprule
-            \multirow{4}{*}{Model} & \multirow{4}{*}{w/ \schemename} & \multicolumn{12}{c}{Corner Cases Accuracy [\%]}                                                                                                                                                                                                                               \\
-            \cmidrule(l){3-14}
-                                   &                                 & \multicolumn{4}{c}{56}                          & \multicolumn{4}{c}{84} & \multicolumn{4}{c}{112}                                                                                                                                                                            \\
-            \cmidrule(lr){3-6} \cmidrule(lr){7-10} \cmidrule(l){11-14}
-                                   &                                 & CeO                                             & CoO                    & CeR                     & CoR              & CeO              & CoO              & CeR              & CoR              & CeO              & CoO              & CeR              & CoR              \\
-            \midrule
-            ViT-S                  & \xmark                          & $40.5 \pm 2.0$                                  & $28.6 \pm 0.8$         & $10.3 \pm 0.9$          & $6.4 \pm 0.2$    & $56.8 \pm 1.2$   & $47.6 \pm 1.0$   & $31.3 \pm 0.7$   & $25.5 \pm 0.5$   & $70.9 \pm 0.1$   & $66.9 \pm 1.6$   & $55.2 \pm 0.2$   & $51.1 \pm 0.8$   \\
-            ViT-S                  & \cmark                          & $49.4 \pm 0.6$                                  & $39.9 \pm 0.5$         & $22.7 \pm 0.4$          & $17.6 \pm 0.3$   & $66.3 \pm 0.3$   & $60.0 \pm 0.3$   & $47.7 \pm 0.7$   & $43.2 \pm 0.2$   & $76.5 \pm 0.2$   & $74.9 \pm 0.4$   & $66.8 \pm 0.6$   & $64.9 \pm 0.1$   \\
-                                   &                                 & \grntxt{$+8.9$}                                 & \grntxt{$+11.3$}       & \grntxt{$+12.4$}        & \grntxt{$+11.2$} & \grntxt{$+9.4$}  & \grntxt{$+12.4$} & \grntxt{$+16.4$} & \grntxt{$+17.7$} & \grntxt{$+5.6$}  & \grntxt{$+8.0$}  & \grntxt{$+11.6$} & \grntxt{$+13.7$} \\
-            \cmidrule(r){1-2}
-            ViT-B                  & \xmark                          & $37.9 \pm 1.4$                                  & $29.3 \pm 0.7$         & $14.0 \pm 1.7$          & $11.9 \pm 1.1$   & $51.5 \pm 0.7$   & $45.0 \pm 0.8$   & $27.3 \pm 0.8$   & $26.3 \pm 0.8$   & $64.7 \pm 0.3$   & $61.8 \pm 0.6$   & $46.3 \pm 0.3$   & $45.5 \pm 0.5$   \\
-            ViT-B                  & \cmark                          & $50.4 \pm 0.8$                                  & $42.4 \pm 0.6$         & $26.5 \pm 0.6$          & $22.8 \pm 0.8$   & $65.3 \pm 0.9$   & $60.9 \pm 0.6$   & $47.6 \pm 0.3$   & $45.6 \pm 0.1$   & $75.7 \pm 0.6$   & $74.0 \pm 0.6$   & $65.7 \pm 0.7$   & $64.3 \pm 0.5$   \\
-                                   &                                 & \grntxt{$+12.5$}                                & \grntxt{$+13.1$}       & \grntxt{$+12.4$}        & \grntxt{$+10.9$} & \grntxt{$+13.8$} & \grntxt{$+15.9$} & \grntxt{$+20.2$} & \grntxt{$+19.3$} & \grntxt{$+11.0$} & \grntxt{$+12.2$} & \grntxt{$+19.3$} & \grntxt{$+18.8$} \\
-            \cmidrule(r){1-2}
-            ViT-L                  & \xmark                          & $32.8 \pm 1.6$                                  & $24.8 \pm 1.1$         & $14.8 \pm 2.2$          & $9.7 \pm 1.2$    & $42.7 \pm 0.9$   & $33.8 \pm 0.7$   & $21.3 \pm 1.5$   & $16.3 \pm 1.0$   & $55.7 \pm 0.7$   & $49.7 \pm 0.7$   & $36.0 \pm 1.3$   & $32.5 \pm 0.9$   \\
-            ViT-L                  & \cmark                          & $45.7 \pm 0.6$                                  & $39.0 \pm 0.5$         & $25.6 \pm 0.6$          & $24.1 \pm 0.8$   & $59.1 \pm 0.3$   & $55.2 \pm 0.4$   & $41.9 \pm 1.0$   & $42.7 \pm 0.6$   & $71.4 \pm 0.3$   & $69.0 \pm 0.4$   & $60.7 \pm 1.0$   & $60.3 \pm 0.8$   \\
-                                   &                                 & \grntxt{$+12.9$}                                & \grntxt{$+14.2$}       & \grntxt{$+10.8$}        & \grntxt{$+14.4$} & \grntxt{$+16.3$} & \grntxt{$+21.5$} & \grntxt{$+20.5$} & \grntxt{$+26.4$} & \grntxt{$+15.7$} & \grntxt{$+19.3$} & \grntxt{$+24.7$} & \grntxt{$+27.8$} \\
-            \cmidrule(r){1-2}
-            DeiT-S                 & \xmark                          & $46.3 \pm 0.7$                                  & $38.1 \pm 0.3$         & $13.1 \pm 0.5$          & $9.9 \pm 0.1$    & $62.8 \pm 0.4$   & $58.2 \pm 0.2$   & $37.1 \pm 0.7$   & $34.3 \pm 0.5$   & $73.3 \pm 0.2$   & $73.9 \pm 0.4$   & $58.8 \pm 0.4$   & $59.4 \pm 0.6$   \\
-            DeiT-S                 & \cmark                          & $44.7 \pm 1.4$                                  & $37.1 \pm 1.4$         & $15.6 \pm 1.3$          & $12.1 \pm 0.9$   & $62.1 \pm 1.2$   & $57.8 \pm 1.1$   & $41.6 \pm 1.1$   & $37.9 \pm 1.2$   & $73.2 \pm 0.7$   & $73.3 \pm 0.4$   & $62.3 \pm 0.7$   & $61.4 \pm 0.9$   \\
-                                   &                                 & \rdtxt{$-1.6$}                                  & \rdtxt{$-1.1$}         & \grntxt{$+2.4$}         & \grntxt{$+2.2$}  & \rdtxt{$-0.7$}   & \rdtxt{$-0.4$}   & \grntxt{$+4.4$}  & \grntxt{$+3.5$}  & \gtxt{$-0.1$}    & \rdtxt{$-0.6$}   & \grntxt{$+3.5$}  & \grntxt{$+2.0$}  \\
-            \cmidrule(r){1-2}
-            DeiT-B                 & \xmark                          & $48.1 \pm 0.9$                                  & $40.4 \pm 2.0$         & $15.8 \pm 0.2$          & $12.9 \pm 0.6$   & $64.0 \pm 0.9$   & $59.5 \pm 1.3$   & $39.0 \pm 0.9$   & $37.2 \pm 0.8$   & $74.1 \pm 0.7$   & $74.8 \pm 0.7$   & $59.1 \pm 0.8$   & $60.0 \pm 0.6$   \\
-            DeiT-B                 & \cmark                          & $50.7 \pm 0.1$                                  & $44.0 \pm 0.4$         & $19.3 \pm 0.2$          & $16.3 \pm 0.2$   & $66.0 \pm 0.2$   & $62.0 \pm 0.3$   & $43.4 \pm 0.3$   & $40.9 \pm 0.4$   & $75.4 \pm 0.1$   & $76.4 \pm 0.3$   & $62.8 \pm 0.2$   & $63.9 \pm 0.2$   \\
-                                   &                                 & \grntxt{$+2.6$}                                 & \grntxt{$+3.6$}        & \grntxt{$+3.5$}         & \grntxt{$+3.5$}  & \grntxt{$+2.0$}  & \grntxt{$+2.5$}  & \grntxt{$+4.4$}  & \grntxt{$+3.8$}  & \grntxt{$+1.3$}  & \grntxt{$+1.6$}  & \grntxt{$+3.8$}  & \grntxt{$+3.9$}  \\
-            \cmidrule(r){1-2}
-            DeiT-L                 & \xmark                          & $39.2 \pm 2.6$                                  & $32.6 \pm 1.5$         & $10.5 \pm 2.8$          & $9.1 \pm 2.3$    & $55.7 \pm 2.5$   & $51.0 \pm 2.7$   & $30.3 \pm 4.0$   & $29.5 \pm 3.9$   & $68.5 \pm 2.1$   & $68.1 \pm 1.7$   & $51.7 \pm 3.1$   & $52.1 \pm 2.7$   \\
-            DeiT-L                 & \cmark                          & $51.9 \pm 0.7$                                  & $46.6 \pm 0.5$         & $21.5 \pm 1.3$          & $19.0 \pm 1.2$   & $66.6 \pm 0.6$   & $64.1 \pm 0.7$   & $45.3 \pm 1.3$   & $43.6 \pm 1.1$   & $75.6 \pm 0.4$   & $77.3 \pm 0.4$   & $63.8 \pm 0.8$   & $65.4 \pm 0.6$   \\
-                                   &                                 & \grntxt{$+12.8$}                                & \grntxt{$+14.0$}       & \grntxt{$+11.0$}        & \grntxt{$+9.9$}  & \grntxt{$+11.0$} & \grntxt{$+13.1$} & \grntxt{$+15.0$} & \grntxt{$+14.1$} & \grntxt{$+7.1$}  & \grntxt{$+9.2$}  & \grntxt{$+12.1$} & \grntxt{$+13.4$} \\
-            \cmidrule(r){1-2}
-            Swin-Ti                & \xmark                          & $41.2 \pm 1.8$                                  & $32.5 \pm 0.3$         & $17.4 \pm 2.6$          & $12.2 \pm 0.2$   & $60.0 \pm 1.6$   & $51.4 \pm 0.2$   & $39.6 \pm 2.6$   & $34.8 \pm 0.9$   & $71.7 \pm 0.8$   & $66.1 \pm 0.7$   & $58.2 \pm 1.1$   & $53.6 \pm 1.2$   \\
-            Swin-Ti                & \cmark                          & $49.8 \pm 0.6$                                  & $42.8 \pm 0.7$         & $24.2 \pm 0.7$          & $21.4 \pm 0.9$   & $66.4 \pm 0.6$   & $60.5 \pm 0.2$   & $47.8 \pm 0.5$   & $44.6 \pm 0.5$   & $76.0 \pm 0.3$   & $72.7 \pm 0.2$   & $65.7 \pm 0.5$   & $62.1 \pm 0.3$   \\
-                                   &                                 & \grntxt{$+8.5$}                                 & \grntxt{$+10.3$}       & \grntxt{$+6.8$}         & \grntxt{$+9.2$}  & \grntxt{$+6.4$}  & \grntxt{$+9.2$}  & \grntxt{$+8.2$}  & \grntxt{$+9.8$}  & \grntxt{$+4.3$}  & \grntxt{$+6.5$}  & \grntxt{$+7.5$}  & \grntxt{$+8.5$}  \\
-            \cmidrule(r){1-2}
-            Swin-S                 & \xmark                          & $41.3 \pm 0.6$                                  & $33.0 \pm 0.1$         & $18.4 \pm 0.7$          & $13.3 \pm 0.5$   & $59.2 \pm 0.1$   & $51.2 \pm 0.5$   & $39.1 \pm 0.2$   & $35.9 \pm 0.3$   & $71.5 \pm 0.2$   & $65.6 \pm 0.1$   & $56.8 \pm 0.5$   & $53.2 \pm 0.2$   \\
-            Swin-S                 & \cmark                          & $48.6 \pm 0.7$                                  & $39.9 \pm 1.6$         & $22.2 \pm 0.9$          & $16.8 \pm 1.1$   & $64.4 \pm 0.9$   & $57.9 \pm 1.5$   & $43.8 \pm 1.1$   & $42.3 \pm 1.0$   & $75.7 \pm 0.2$   & $71.8 \pm 0.8$   & $63.2 \pm 0.4$   & $60.6 \pm 0.6$   \\
-                                   &                                 & \grntxt{$+7.3$}                                 & \grntxt{$+7.0$}        & \grntxt{$+3.8$}         & \grntxt{$+3.6$}  & \grntxt{$+5.1$}  & \grntxt{$+6.7$}  & \grntxt{$+4.7$}  & \grntxt{$+6.4$}  & \grntxt{$+4.2$}  & \grntxt{$+6.2$}  & \grntxt{$+6.4$}  & \grntxt{$+7.4$}  \\
-            \cmidrule(r){1-2}
-            ResNet50               & \xmark                          & $48.6 \pm 0.6$                                  & $35.1 \pm 0.4$         & $23.0 \pm 0.7$          & $13.0 \pm 0.3$   & $65.8 \pm 0.4$   & $58.2 \pm 0.3$   & $44.4 \pm 0.6$   & $38.1 \pm 0.5$   & $73.2 \pm 0.2$   & $69.9 \pm 0.2$   & $56.9 \pm 0.1$   & $56.9 \pm 0.1$   \\
-            ResNet50               & \cmark                          & $52.3 \pm 0.6$                                  & $39.5 \pm 0.1$         & $27.4 \pm 0.6$          & $17.6 \pm 0.1$   & $68.5 \pm 0.3$   & $61.9 \pm 0.1$   & $48.5 \pm 0.4$   & $43.7 \pm 0.3$   & $75.2 \pm 0.1$   & $72.4 \pm 0.1$   & $61.7 \pm 0.3$   & $61.7 \pm 0.3$   \\
-                                   &                                 & \grntxt{$+3.7$}                                 & \grntxt{$+4.4$}        & \grntxt{$+4.4$}         & \grntxt{$+4.6$}  & \grntxt{$+2.8$}  & \grntxt{$+3.8$}  & \grntxt{$+4.2$}  & \grntxt{$+5.5$}  & \grntxt{$+2.0$}  & \grntxt{$+2.5$}  & \grntxt{$+4.8$}  & \grntxt{$+4.8$}  \\
-            \cmidrule(r){1-2}
-            ResNet101              & \xmark                          & $47.8 \pm 0.7$                                  & $37.2 \pm 0.5$         & $20.4 \pm 1.2$          & $14.2 \pm 0.3$   & $64.9 \pm 0.2$   & $58.6 \pm 0.5$   & $41.1 \pm 0.5$   & $38.3 \pm 0.7$   & $73.6 \pm 0.3$   & $70.5 \pm 0.3$   & $56.2 \pm 0.4$   & $57.0 \pm 0.5$   \\
-            ResNet101              & \cmark                          & $52.3 \pm 0.1$                                  & $42.2 \pm 0.1$         & $24.7 \pm 0.1$          & $19.2 \pm 0.4$   & $68.8 \pm 0.6$   & $62.9 \pm 0.3$   & $46.4 \pm 1.5$   & $44.3 \pm 0.9$   & $76.0 \pm 0.4$   & $73.7 \pm 0.3$   & $61.0 \pm 1.2$   & $62.6 \pm 0.5$   \\
-                                   &                                 & \grntxt{$+4.4$}                                 & \grntxt{$+5.0$}        & \grntxt{$+4.3$}         & \grntxt{$+5.0$}  & \grntxt{$+3.9$}  & \grntxt{$+4.3$}  & \grntxt{$+5.3$}  & \grntxt{$+6.0$}  & \grntxt{$+2.4$}  & \grntxt{$+3.2$}  & \grntxt{$+4.7$}  & \grntxt{$+5.7$}  \\
-            \bottomrule
-        \end{tabular}
-    }
-\end{table}
-
-\Cref{tab:corner-cases} reports accuracy on the corner-cases dataset~\cite{Fatima2025} for models trained with and without \schemename.
-The dataset is constructed by pasting objects cropped by their full bounding boxes (which are available for the ImageNet validation set) onto 224$\times$224 infilled backgrounds.
-The dataset has three factors: foreground size (56, 84, 112 pixels), spatial position (center, CeX, vs.\ corner, CoX), and background type (original image background, XxO, vs.\ a random background, XxR), yielding $3 \times 2 \times 2$ controlled configurations per model.
-
-Across all architectures, training with \schemename consistently improves robustness to these composition shifts.
-For ViT-S/B/L, gains range from roughly $+8$ to over $+27$ percentage points, with the largest improvements occurring in the most challenging settings with foregrounds placed in corners on random backgrounds (e.g., CoR and CeR).
-Swin and ResNet models also benefit across all configurations, with increases typically between $+3$ and $+10$ points.
-DeiT-S shows small drops on some same-background center cases (CeO/CoO), but still improves notably on random-background conditions (XxR), while DeiT-B/L gain across nearly all settings.
-
-Three trends are apparent.
-First, all baselines perform substantially worse when moving from original to random backgrounds and from centered to corner placements, indicating strong background and center biases.
-Second, \schemename reduces this sensitivity: the absolute gap between center and corner, and between original and random backgrounds, shrinks for almost all models and sizes.
-Third, the relative improvements are especially pronounced for smaller objects and off-center placements, suggesting that \schemename makes models more foreground-focused and less reliant on canonical object scale and position.
-
-
-\section{\schemename Segmentation Samples}
-\begin{figure}[t!]
-    \centering
-    \begin{subfigure}{.49\textwidth}
-
-        \includegraphics[width=\textwidth]{img/masked_image_examples_train.pdf}
-    \end{subfigure}
-    \hfill
-    \begin{subfigure}{.49\textwidth}
-
-        \includegraphics[width=\textwidth]{img/masked_image_examples.pdf}
-    \end{subfigure}
-    \caption{ImageNet validation samples (left) and training samples (right) of our segmentation masks with annotated bounding boxes.}
-    \label{fig:mask-examples}
-\end{figure}
-We show examples of the automatically generated segmentation masks for a diverse subset of object categories (``ant,'' ``busby,'' ``bell cote,'' ``pickelhaube,'' ``snorkel,'' ``stove,'' ``tennis ``ball,'' and ``volleyball'').
-Note that ``busby,'' ``bell cote,'' ``pickelhaube,'' and ``snorkel'' are the four classes with the \textbf{worst} mean box precision and box-to-box IoU on the validation set.
-\Cref{fig:mask-examples} (right) illustrates masks from the evaluation split, while \Cref{fig:mask-examples} (left) shows examples from the training split.
-Across both sets, the masks accurately isolate foreground objects with clean boundaries, despite large variations in object scale, shape, and appearance, supporting their use for background removal and resampling in our training pipeline.
-We find that the main failure cases are:
-(\textit{i}) When the ground-truth annotation corresponds to only a part of an object, the predicted mask often expands to cover the entire object rather than the annotated region.
-See for example ``busby'' or ``bell cote''.
-(\textit{ii}) In images containing multiple instances, some objects may be missed, resulting in incomplete foreground coverage.
-This is especially visible for ``busby'' and ``pickelhaube''.
-However, note that especially for ``pickelhaube'' the training distribution is noticeably different from the validation distribution, showing many images with just the head instead of groups of people wearing it.
-(\textit{iii}) In rare cases, the predicted mask degenerates and covers nearly the entire image, effectively eliminating the background.
-This happens in $<10\%$ of all training images, and we do not use the resulting backgrounds for recombination (see \Cref{apdx:infill-ratio}).
+\end{table*}
+\Cref{tab:hw-sw-versions} lists the specific hardware we use, as well as versions of the relevant software packages.

 \section{\schemename Sample Images}
-\begin{table*}[t!]
+\begin{table*}[h!]
    \centering
    \caption{Sample Images from using \schemename on ImageNet.}
    \label{tbl:example-images}
@@ -479,7 +199,7 @@ Images show a broad range of spatial placements and scales for the same object,
        \end{tabular}
    }
 \end{table*}
-We visualize example infilled images for both LaMa \cite{Suvorov2022} and Attentive Eraser \cite{Sun2025} in \Cref{tab:infill-examples}.
+We visualize example infilled images for both LaMa \cite{Suvorov2021} and Attentive Eraser \cite{Sun2024} in \Cref{tab:infill-examples}.
 The side‑by‑side examples show that both methods generally produce visually consistent infills, with many pairs appearing extremely similar at a glance.
 We qualitatively find that Attentive Eraser yields slightly sharper textures or more coherent local structure, while LaMa sometimes produces smoother or more homogenized regions.
 Across the table, fine‑detail areas such as foliage, bark, and ground textures reveal the most noticeable differences between the two methods.
@@ -488,7 +208,6 @@ Across the table, fine‑detail areas such as foliage, bark, and ground textures
 \FloatBarrier
 \newpage
 \section{Image Infill Ratio}
-\label{apdx:infill-ratio}
 \begin{table*}[h!]
    \centering
    \caption{Example infills with a large relative foreground area size that is infilled (infill ratio).}