Files
ForAug/sec/appendix.tex
2026-02-24 11:13:52 +01:00

531 lines
68 KiB
TeX
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
% !TeX root = ../supplementary.tex
\section{Training Setup}
\label{sec:training_setup}
\begin{table*}[h!]
\centering
\caption{Training setup and hyperparameters for our ImageNet training.}
\label{tab:in-setup}
\resizebox{\textwidth}{!}{
\begin{tabular}{lccc}
\toprule
Augmentation Pipeline: & Basic & 3-Augment~\cite{Touvron2022} & RandAugment~\cite{Touvron2021b} \\
\midrule
Image Resolution & \multicolumn{3}{c}{$224 \times 224$} \\
Epochs & \multicolumn{3}{c}{300} \\
Learning Rate & S/B: 1e-3, L: 5e-4 & 3e-3 & S/B: 1e-3, L: 5e-4 \\
Learning Rate Schedule & \multicolumn{3}{c}{cosine decay} \\
Batch Size & 1024 & 2048 & 1024 \\
GPUs & \multicolumn{3}{c}{$4\times$ NVIDIA A100/H100/H200} \\
Warmup Schedule & \multicolumn{3}{c}{linear} \\
Warmup Epochs & \multicolumn{3}{c}{3} \\
Weight Decay & 0.05 & 0.02 & 0.05 \\
Label Smoothing & \multicolumn{3}{c}{0.1} \\
Optimizer & AdamW & Lamb \cite{You2020} & AdamW \\
\midrule
Augmentations & \makecell{RandomResizedCrop \\ Horizontal Flip \\ ColorJitter} & \makecell{Resize \\ RandomCrop \\ Horizontal Flip \\ Grayscale \\ Solarize \\ Gaussian-Blur \\ Color Jitter} & \makecell{RandomResizedCrop \\ Horizontal Flip \\ RandomErase \cite{Zhong2020} \\ RandAugment \cite{Cubuk2020} \\ Color Jitter} \\
\bottomrule
\end{tabular}
}
\end{table*}
\begin{table}[h!]
\centering
\caption{Training setup for finetuning on different downstream datasets. Other settings are the same as in \Cref{tab:in-setup}. For finetuning, we always utilize 3-Augment and the related parameters from the \emph{ViT, Swin, ResNet} column of \Cref{tab:in-setup}}
\label{tab:downstream-setup}
\begin{tabular}{lcccc}
\toprule
Dataset & Batch Size & Epochs & Learning Rate & Num. GPUs \\
\midrule
Aircraft & 512 & 500 & 3e-4 & 2 \\
Cars & 1024 & 500 & 3e-4 & 4 \\
Flowers & 256 & 500 & 3e-4 & 1 \\
Food & 2048 & 100 & 3e-4 & 4 \\
Pets & 512 & 500 & 3e-4 & 2 \\
\bottomrule
\end{tabular}
\end{table}
On ImageNet, we test three different data augmentation pipelines and hyperparameter settings as shown in \Cref{tab:in-setup}: A basic pipeline, a pipeline using RandAugment based on the DeiT~\cite{Touvron2021b} setup and 3-Augment, as used in \cite{Touvron2022,Nauen2025}.
When comparing different architectures, ViT, Swin, and ResNet are trained with the 3-Augment pipeline and DeiT is trained with the RandAugment pipeline.
% On ImageNet we use the same training setup as \cite{Nauen2025} and \cite{Touvron2022} without pretraining for ViT, Swin, and ResNet.
% For DeiT, we train the same ViT architecture but using the data augmentation scheme and hyperparameters from \cite{Touvron2021b}.
As our focus is on evaluating the changes in accuracy due to \schemename, like \cite{Nauen2025}, we stick to one set of hyperparameters for all models.
We list the settings used for training on ImageNet in \Cref{tab:in-setup} and the ones used for finetuning those weights on the downstream datasets in \Cref{tab:downstream-setup}.
Our implementation is using PyTorch \cite{Paszke2019} and the \emph{timm} library \cite{Wightman2019} for model architectures and basic functions.
\begin{table*}[ht!]
\centering
\caption{Hardware and Software specifics used for both training and evaluation.}
\label{tab:hw-sw-versions}
\begin{tabular}{ll}
\toprule
Parameter & Value \\
\midrule
GPU & $4 \times$ NVIDIA A100/H100/H200 \\
CPU & 24 CPU cores (Intel Xenon) per GPU \\
Memory & up to 120 GB per GPU \\
Operating System & Enroot container for SLURM based on Ubuntu 24.04 LTS \\
Python & 3.12.3 \\
PyTorch & 2.7.0 \\
TorchVision & 0.22.0 \\
Timm & 1.0.15 \\
\bottomrule
\end{tabular}
\end{table*}
\Cref{tab:hw-sw-versions} lists the specific hardware we use, as well as versions of the relevant software packages.
\section{Resource Usage of \schemename}
To utilize the proposed \schemename, specific computational resources are necessary, particularly for computing and storing for the output of the segmentation stage and for on-the-fly processing of the recombination stage.
\paragraph{Segmentation.}
% While calculating the segmentations and infills takes a lot of compute, this is effort that has to be spent only once per dataset.
\schemename involves a computationally expensive segmentation and infill stage, which is a one-time calculation per dataset.
Once computed, the segmentation and infill results can be perpetually reused, amortizing the initial cost over all subsequent experiments and applications.
On NVIDIA H100 GPUs, the segmentation stage will compute at a rate of $374.3 \frac{\text{img}}{\text{GPU} \times \text{h}}$ when using Attentive Eraser or $5 338.6 \frac{\text{img}}{\text{GPU} \times \text{h}}$ for LaMa.
For ImageNet this comes down to just under 9 days (Attentive Eraser) or 16 hours (LaMa) on two 8 GPU nodes.
To facilitate immediate use and reproduction of results, we publicly provide the precalculated segmentation stage output for the ImageNet dataset for download\footnote{Link will go here.}.
The output of \schemename's segmentation step on ImageNet dataset requires 73 GB of additional disk space for the segmentation output, which is separate from the base 147 GB ImageNet size.
\paragraph{Recombination.}
The recombination step of \schemename is implemented as a based data loader operation.
It's thus offloaded to the CPU, where it can be heavily parallelized and thus only results in a very minor increase in the training step-time.
For example, using a ViT-B model on an NVIDIA A100 GPU, the average update step-time increased by $1\%$, from $528 \pm 2$ ms to $534 \pm 1$ ms.
\section{Extended Bates Distribution}
\label{apdx:bates-distribution}
\begin{figure}[h!]
\centering
\includegraphics[width=.5\columnwidth]{img/bates.pdf}
\caption{Plot of the probability distribution function (PDF) of the extended Bates distribution for different parameters $\eta$. Higher values of $\eta$ concentrate the distribution around the center.}
\label{fig:bates-pdf}
\end{figure}
We introduce an extension of the Bates distribution~\cite{Bates1955} to include negative parameters, enabling sampling of foreground object positions away from the image center.
The standard Bates distribution, for $\eta \in \N$, is defined as the mean of $\eta$ independent random variables drawn from a uniform distribution \cite{Jonhson1995}.
A larger $\eta$ value increases the concentration of samples around the distribution's mean, which in this case is the image center.
To achieve an opposite effect--concentrating samples at the image borders--we extend the distribution to $\eta \leq 1$.
\begin{align*}
X \sim \text{Bates}(\eta) :\Leftrightarrow s(X) \sim \text{Bates}(-\eta)
\end{align*}
This is accomplished by sampling from a standard Bates distribution with parameter $-\eta \geq 1$ and then applying a sawtooth function.
The sawtooth function on the interval $[0,1]$ is defined as
\begin{align}
s(x) = \begin{cases}
x + 0.5 & \text{if } 0 < x < 0.5 \\
x - 0.5 & \text{if } 0.5 \leq x \leq 1
\end{cases}
\end{align}
This function effectively maps the central portion of the interval to the edges and the edge portions to the center.
For example, a value of 0.3 (central-left) is mapped to 0.8 (edge-right), while 0.8 (edge-right) is mapped to 0.3 (central-left).
This transformation inverts the distribution's concentration, shifting the probability mass from the center to the borders.
We visualize the distribution function of the extended Bates distribution in \Cref{fig:bates-pdf}.
Both $\eta = 1$ and $\eta = -1$ result in a uniform distribution across the image.
\section{Design Choices of \schemename}
\label{sec:ablation}
We start by ablating the design choices of \schemename on TinyImageNet~\cite{Le2015}, a subset of ImageNet containing 200 categories with 500 images each. %, and Tiny\name, the application of \schemename to TinyImageNet.
% \Cref{tab:ablation} presents the results of these ablations.
\Cref{tab:ablation-segment} presents ablations for segmentation and \Cref{tab:ablation-recombine} for recombination.
\begin{table}
\caption{Ablation of the design decisions in the segmentation phase of \schemename on TinyImageNet.
The first line is our baseline, while the other lines are using \schemename.
We use basic settings with the \emph{same} background strategy during recombination for this experiment.
}
\label{tab:ablation-segment}
\centering
\small
% \resizebox{.9\columnwidth}{!}{
\begin{tabular}{llcc}
\toprule
\multirow{2.5}{*}{\makecell{Detect. \\Prompt}} & \multirow{2.5}{*}{\makecell{Infill \\ Model}} & \multicolumn{2}{c}{TinyImageNet Accuracy [\%]} \\
\cmidrule{3-4}
& & ViT-Ti & ViT-S \\
\midrule
\multicolumn{2}{l}{\textbf{TinyImageNet}} & $66.1 \pm 0.5$ & $68.3 \pm 0.7$ \\
specific & LaMa \cite{Suvorov2022} & $65.5 \pm 0.4$ & $71.2 \pm 0.5$ \\
general & \gtxt{LaMa \cite{Suvorov2022}} & $66.4 \pm 0.6$ & $72.9 \pm 0.6$ \\
\gtxt{general} & Att. Eraser \cite{Sun2025} & $67.5 \pm 1.2$ & $72.4 \pm 0.5$ \\
\bottomrule
\end{tabular}
% }
\end{table}
\begin{table}[t]
\caption{Ablation of the recombination phase of \schemename on TinyImageNet (top) and ImageNet (bottom). The first experiments use the initial segmentation settings with LaMa \cite{Suvorov2022}.}
\label{tab:ablation-recombine}
\centering
% \resizebox{.9\columnwidth}{!}{
\begin{tabular}{ccccccccccc}
\toprule
% FG. & Augment. & BG. & BG. & Edge & Original & \multicolumn{2}{c}{Accuracy [\%]} \\
% Size & Order & Strat. & Prune & Smoothing & Mixing & ViT-Ti & ViT-S \\
\multirow{2.5}{*}{\makecell{FG. \\size}} & \multirow{2.5}{*}{\makecell{Augment.\\Order}} & \multirow{2.5}{*}{\makecell{BG\\Strat.}} & \multirow{2.5}{*}{\makecell{BG.\\Prune}} & \multirow{2.5}{*}{\makecell{Original\\Mixing}} & \multirow{2.5}{*}{\makecell{Edge\\Smooth.}} & \multicolumn{2}{c}{Accuracy [\%]} \\
\cmidrule{7-8}
& & & & & & ViT-Ti & ViT-S \\
\midrule
% TinyImageNet & & & & & & & $66.1\pm0.5$ & $68.3\pm0.7$ \\
\multicolumn{6}{l}{\textbf{TinyImageNet}} & \gtxt{$66.1\pm0.5$} & \gtxt{$68.3\pm0.7$} \\
mean & crop$\to$paste & same & - & - & \gtxt{-} & $64.6\pm0.5$ & $70.0\pm0.6$ \\
range & \gtxt{crop$\to$paste} & \gtxt{same} & \gtxt{-} & \gtxt{-} & \gtxt{-} & $65.5\pm0.4$ & $71.2\pm0.5$ \\
\midrule
% \gtxt{range} & \gtxt{crop$\to$paste} & \gtxt{same} & \gtxt{-} & \gtxt{-} & \gtxt{-} & $66.4\pm0.6$ & $72.9\pm0.6$ \\
{range} & {crop$\to$paste} & {same} & {-} & {-} & {-} & $67.5\pm1.2$ & $72.4\pm0.5$ \\
\gtxt{range} & paste$\to$crop & \gtxt{same} & \gtxt{-} & \gtxt{-} & \gtxt{-} & $67.1\pm1.2$ & $72.9\pm0.5$ \\
\gtxt{range} & \gtxt{paste$\to$crop} & \gtxt{same} & 1.0 & \gtxt{-} & \gtxt{-} & $67.0\pm1.2$ & $73.0\pm0.3$ \\
\gtxt{range} & \gtxt{paste$\to$crop} & \gtxt{same} & 0.8 & \gtxt{-} & \gtxt{-} & $67.2\pm1.2$ & $72.9\pm0.8$ \\
\gtxt{range} & \gtxt{paste$\to$crop} & \gtxt{same} & 0.6 & \gtxt{-} & \gtxt{-} & $67.5\pm1.0$ & $72.8\pm0.7$ \\
% \gtxt{range} & \gtxt{paste$\to$crop} & \gtxt{same} & \gtxt{0.8} & $\sigma_\text{max} = 2.0$ & \gtxt{-} & $67.2\pm0.4$ & $72.9\pm0.5$ \\
% \gtxt{range} & \gtxt{paste$\to$crop} & \gtxt{same} & \gtxt{0.8} & $\sigma_\text{max} = 4.0$ & \gtxt{-} & $65.9\pm0.5$ & $72.4\pm0.6$ \\
\gtxt{range} & \gtxt{paste$\to$crop} & \gtxt{same} & \gtxt{0.8} & $p=0.2$ & \gtxt{-} & $69.8\pm0.5$ & $75.0\pm0.3$ \\
\gtxt{range} & \gtxt{paste$\to$crop} & \gtxt{same} & \gtxt{0.8} & $p=0.33$ & \gtxt{-} & $69.5\pm0.4$ & $75.2\pm1.0$ \\
\gtxt{range} & \gtxt{paste$\to$crop} & \gtxt{same} & \gtxt{0.8} & $p=0.5$ & \gtxt{-} & $70.3\pm1.0$ & $74.2\pm0.2$ \\
\gtxt{range} & \gtxt{paste$\to$crop} & \gtxt{same} & \gtxt{0.8} & linear & \gtxt{-} & $70.1\pm0.7$ & $74.9\pm0.8$ \\
\gtxt{range} & \gtxt{paste$\to$crop} & \gtxt{same} & \gtxt{0.8} & reverse lin. & \gtxt{-} & $67.6\pm0.2$ & $73.2\pm0.3$ \\
\gtxt{range} & \gtxt{paste$\to$crop} & \gtxt{same} & \gtxt{0.8} & cos & \gtxt{-} & $71.3\pm1.0$ & $75.7\pm0.8$ \\
\gtxt{range} & \gtxt{paste$\to$crop} & \gtxt{same} & \gtxt{0.8} & \gtxt{cos} & $\sigma_\text{max} = 4.0$ & $70.0\pm0.8$ & $75.5\pm0.7$ \\
\gtxt{range} & \gtxt{paste$\to$crop} & orig. & \gtxt{0.8} & \gtxt{cos} & \gtxt{$\sigma_\text{max} = 4.0$} & $67.2\pm0.9$ & $69.9\pm1.0$ \\
\gtxt{range} & \gtxt{paste$\to$crop} & all & \gtxt{0.8} & \gtxt{cos} & \gtxt{$\sigma_\text{max} = 4.0$} & $70.1\pm0.7$ & $77.5\pm0.6$ \\
\midrule
\multicolumn{6}{l}{\textbf{ImageNet}} & \gtxt{-} & \gtxt{$79.1\pm0.1$} \\
\gtxt{range} & \gtxt{paste$\to$crop} & \gtxt{same} & \gtxt{0.8} & \gtxt{cos} & \gtxt{-} & - & $80.5\pm0.1$ \\
\gtxt{range} & \gtxt{paste$\to$crop} & \gtxt{same} & \gtxt{0.8} & \gtxt{cos} & $\sigma_\text{max} = 4.0$ & - & $80.7\pm0.1$ \\
\gtxt{range} & \gtxt{paste$\to$crop} & all & \gtxt{0.8} & \gtxt{cos} & \gtxt{$\sigma_\text{max} = 4.0$} & - & $81.4\pm0.1$ \\
\bottomrule
\end{tabular}
% }
\end{table}
\textbf{Prompt.}
% We present the ablation of our main design decisions in \Cref{tab:ablation}.
First, we evaluate the type of prompt used to detect the foreground object.
Here, the \emph{general} prompt, which contains the class and the more general object category, outperforms only having the class name (\emph{specific}).
\textbf{Inpainting.} Among inpainting models, Attentive Eraser~\cite{Sun2025} produces slightly better results compared to LaMa~\cite{Suvorov2022} ($+0.5$ p.p. on average).
For inpainting examples, see the supplementary material.
% (see the supplementary material for examples).
% When comparing the infill models, the GAN-based LaMa \cite{Suvorov2022} gets outperformed by the Attentive Eraser \cite{Sun2025}.
\textbf{Foreground size}
% We observe that LaMa's often infills unnatural textures compared to Attentive Eraser.
% The size of foreground objects during training has a significant impact on the performance.
% Here, using the greater variability of the \emph{range} strategy increases the performance by $\approx 1\%$ compared to the \emph{mean} strategy.
significantly impacts performance.
Employing a \emph{range} of sizes during recombination, rather than a fixed \emph{mean} size, boosts accuracy by approximately 1 p.p.
This suggests that the added variability is beneficial.
\textbf{Order of data augmentation.}
% (1) Applying the image crop related augmentations \emph{before} pasting the foreground object and the color-based ones \emph{after} pasting or (2) applying all data augmentations after pasting the foreground object.
% While results are ambiguous, we choose the second strategy, as it improves the performance of ViT-S, although not the one of ViT-Ti.
Applying all augmentations after foreground-background recombination (\emph{paste$\to$crop$\to$color}) improves ViT-S's performance compared to applying crop-related augmentations before pasting (\emph{crop$\to$paste$\to$color}).
ViT-Ti results are ambiguous.
\textbf{Background pruning.}
When it comes to the backgrounds to use, we test different pruning thresholds ($t_\text{prune}$) to exclude backgrounds with large inpainting.
% and only use backgrounds with an relative size of the infilled region of at most $t_\text{prune}$ (exclusive).
A threshold of $t_\text{prune}=1.0$ means that we use all backgrounds that are not fully infilled.
% We find that the background pruning does not significantly impact the models' performance.
% We choose $t_\text{prune}=0.8$ for the following experiments to exclude backgrounds that are mostly artificial.
Varying $t_\text{prune}$ has minimal impact.
We choose $t_\text{prune} = 0.8$ to exclude predominantly artificial backgrounds.
% One of the most important design decisions is the mixing of the original dataset with \name.
\textbf{Mixing} \schemename-augmented samples with the original ImageNet data proves crucial.
While constant and linear mixing schedules improve performance over no mixing by $2-3$ p.p. compared to only augmented samples, the cosine annealing schedule proves optimal, boosting accuracy by $3-4$ p.p.
\textbf{Edge smoothing.}
We evaluate the impact of using Gaussian blurring to smooth the edges of the foreground masks.
% Similarly, applying edge smoothing to foreground masks with Gaussian blurring actually hurts performance on Tiny\name, but slightly improves it on \name.
For larger models, this gives us a slight performance boost on the full ImageNet (second to last line in \Cref{tab:ablation-recombine}).
\textbf{Background strategy.}
Another point is the allowed choice of background image for each foreground object.
% We evaluate three different strategies.
% (1) Picking the background from which that specific foreground was originally extracted.
% The major difference to ImageNet when using this setup is the variability in size and position of the foreground object.
% (2) Picking a background that originally had a foreground object of the same class in it.
% Here, we have backgrounds where objects of this type can typically appear while also creating a wider variety of samples due to pairing each foreground object with different backgrounds each time.
% (3) Picking any background.
% This choice has the largest variety of backgrounds, but the backgrounds are not semantically related to the foreground object anymore.
% We find in \Cref{fig:bg-strategy} that choosing only a foreground's original background is the worst choice.
We compare using the original background, a background from the same class, and any background.
These strategies go from low diversity and high shared information content between the foreground and background to high diversity and low shared information content.
For \emph{ViT-Ti}, the latter two strategies perform comparably, while \emph{ViT-S} benefits from the added diversity of using any background.
The same is true when training on the full ImageNet.
\begin{table}
\caption{Accuracy of ViT-S on TinyImageNet (TIN) in percent using \schemename with different foreground position distributions by varying the Bates parameter $\eta$.
The best performance is achieved when using the uniform distribution ($\eta=1$) for training.}
\label{tbl:foreground-eta}
\centering
\small
% \resizebox{.9\columnwidth}{!}{
\begin{tabular}{ccccccc}
\toprule
\multirow{2.5}{*}{\makecell{Bates Parameter \\during training}} & \multirow{2.5}{*}{\makecell{TIN \\w/o \schemename}} & \multicolumn{5}{c}{TIN w/ \schemename} \\
\cmidrule(l){3-7}
& & $\eta=-3$ & $-2$ & $1/-1$ & $2$ & $3$ \\
\midrule
Baseline & 68.9 & 60.5 & 60.2 & 60.8 & 62.6 & 63.1 \\
$\eta=-3$ & 71.3 & 79.3 & 79.5 & 79.1 & 79.3 & 79.1 \\
$\eta=-2$ & 71.5 & 80.0 & 78.7 & 79.3 & 79.1 & 78.8 \\
$\eta=1/-1$ & 72.3 & 79.5 & 78.9 & 80.2 & 79.7 & 80.4 \\
$\eta=2$ & 71.3 & 78.2 & 77.8 & 79.1 & 79.6 & 79.9 \\
$\eta=3$ & 71.4 & 77.2 & 76.9 & 78.6 & 79.6 & 79.7 \\
\bottomrule
\end{tabular}
% }
\end{table}
\textbf{Foreground position.}
Finally, we analyze the foreground object's positioning in the image, using a
generalization of the Bates distribution~\cite{Bates1955} with parameter $\eta \in \Z$ (see \Cref{apdx:bates-distribution}).
The Bates distribution presents an easy way to sample from a bounded domain with just one hyperparameter that controls its concentration.
$\eta = 1/-1$ corresponds to the uniform distribution; $\eta > 1$ concentrates the distribution around the center; and for $\eta < -1$, the distribution is concentrated at the borders (see supplementary material for details).
% We utilize an extended Bates distribution to sample the position of the foreground object.
% The Bates distribution with parameter $\eta \geq 1$ is the mean of $\eta$ independent uniformly distributed random variables \cite{Jonhson1995}.
% The larger $\eta$, the more concentrated the distribution is at the center, $\eta < -1$ concentrates the distribution at the edges.
% We extend this concept to $\eta \leq -1$, shifting the distribution away from the center and towards the edges.
When sampling more towards the center of the image, the difficulty of the task is reduced, which reduces performance on TinyImageNet (\Cref{tbl:foreground-eta}).
This is reflected in the performance when evaluating using \schemename with $\eta=2$ and $\eta=3$ compared to $\eta=-1/1$.
We observe a similar reduction for $\eta < -1$.
% This experiment is conducted using the LaMa infill model.
\begin{table}[t]
\caption{Dataset statistics for TinyImageNet and ImageNet with and without \schemename. For \schemename we report the number of foreground/background pairs.}
\label{tab:dataset-stats}
\centering
% \resizebox{.5\columnwidth}{!}{
\begin{tabular}{l S[table-format=4.0] S[table-format=7.0] S[table-format=5.0]}
\toprule
Dataset & {Classes} & {\makecell{Training \\ Images}} & {\makecell{Validation \\ Images}} \\
\midrule
TinyImageNet & 200 & 100000 & 10000 \\
TinyImageNet + \schemename & 200 & 99404 & 9915 \\
ImageNet & 1000 & 1281167 & 50000 \\
ImageNet + \schemename & 1000 & 1274557 & 49751 \\
\bottomrule
\end{tabular}
% }
\end{table}
After fixing the optimal design parameters in \Cref{tab:ablation-segment,tab:ablation-recombine} (last rows), we run \schemename's segmentation step on the entire ImageNet dataset.
\Cref{tab:dataset-stats} shows the resulting dataset statistics.
% The slightly lower number of images in \name is due to \emph{Grounded SAM} returning no or invalid detections for some images.
The slightly reduced image count for \schemename is due to instances where Grounded SAM fails to produce valid segmentation masks.
\section{Robustness Evaluation on Corner-Cases}
\begin{table}[t]
\centering
\caption{Evaluation on the Corner-Cases dataset. Objects cut from ImageNet evaluation bounding boxes are pasted onto infilled backgrounds. Objects have three sizes: $56$px, $84$px, and $112$px. Objects are places in the center (CeX) or corner (CoX) of an image its original background (XxO) or a random background (XxR).}
\label{tab:corner-cases}
\resizebox{\textwidth}{!}{
\begin{tabular}{lcccccccccccccc}
\toprule
\multirow{4}{*}{Model} & \multirow{4}{*}{w/ \schemename} & \multicolumn{12}{c}{Corner Cases Accuracy [\%]} \\
\cmidrule(l){3-14}
& & \multicolumn{4}{c}{56} & \multicolumn{4}{c}{84} & \multicolumn{4}{c}{112} \\
\cmidrule(lr){3-6} \cmidrule(lr){7-10} \cmidrule(l){11-14}
& & CeO & CoO & CeR & CoR & CeO & CoO & CeR & CoR & CeO & CoO & CeR & CoR \\
\midrule
ViT-S & \xmark & $40.5 \pm 2.0$ & $28.6 \pm 0.8$ & $10.3 \pm 0.9$ & $6.4 \pm 0.2$ & $56.8 \pm 1.2$ & $47.6 \pm 1.0$ & $31.3 \pm 0.7$ & $25.5 \pm 0.5$ & $70.9 \pm 0.1$ & $66.9 \pm 1.6$ & $55.2 \pm 0.2$ & $51.1 \pm 0.8$ \\
ViT-S & \cmark & $49.4 \pm 0.6$ & $39.9 \pm 0.5$ & $22.7 \pm 0.4$ & $17.6 \pm 0.3$ & $66.3 \pm 0.3$ & $60.0 \pm 0.3$ & $47.7 \pm 0.7$ & $43.2 \pm 0.2$ & $76.5 \pm 0.2$ & $74.9 \pm 0.4$ & $66.8 \pm 0.6$ & $64.9 \pm 0.1$ \\
& & \grntxt{$+8.9$} & \grntxt{$+11.3$} & \grntxt{$+12.4$} & \grntxt{$+11.2$} & \grntxt{$+9.4$} & \grntxt{$+12.4$} & \grntxt{$+16.4$} & \grntxt{$+17.7$} & \grntxt{$+5.6$} & \grntxt{$+8.0$} & \grntxt{$+11.6$} & \grntxt{$+13.7$} \\
\cmidrule(r){1-2}
ViT-B & \xmark & $37.9 \pm 1.4$ & $29.3 \pm 0.7$ & $14.0 \pm 1.7$ & $11.9 \pm 1.1$ & $51.5 \pm 0.7$ & $45.0 \pm 0.8$ & $27.3 \pm 0.8$ & $26.3 \pm 0.8$ & $64.7 \pm 0.3$ & $61.8 \pm 0.6$ & $46.3 \pm 0.3$ & $45.5 \pm 0.5$ \\
ViT-B & \cmark & $50.4 \pm 0.8$ & $42.4 \pm 0.6$ & $26.5 \pm 0.6$ & $22.8 \pm 0.8$ & $65.3 \pm 0.9$ & $60.9 \pm 0.6$ & $47.6 \pm 0.3$ & $45.6 \pm 0.1$ & $75.7 \pm 0.6$ & $74.0 \pm 0.6$ & $65.7 \pm 0.7$ & $64.3 \pm 0.5$ \\
& & \grntxt{$+12.5$} & \grntxt{$+13.1$} & \grntxt{$+12.4$} & \grntxt{$+10.9$} & \grntxt{$+13.8$} & \grntxt{$+15.9$} & \grntxt{$+20.2$} & \grntxt{$+19.3$} & \grntxt{$+11.0$} & \grntxt{$+12.2$} & \grntxt{$+19.3$} & \grntxt{$+18.8$} \\
\cmidrule(r){1-2}
ViT-L & \xmark & $32.8 \pm 1.6$ & $24.8 \pm 1.1$ & $14.8 \pm 2.2$ & $9.7 \pm 1.2$ & $42.7 \pm 0.9$ & $33.8 \pm 0.7$ & $21.3 \pm 1.5$ & $16.3 \pm 1.0$ & $55.7 \pm 0.7$ & $49.7 \pm 0.7$ & $36.0 \pm 1.3$ & $32.5 \pm 0.9$ \\
ViT-L & \cmark & $45.7 \pm 0.6$ & $39.0 \pm 0.5$ & $25.6 \pm 0.6$ & $24.1 \pm 0.8$ & $59.1 \pm 0.3$ & $55.2 \pm 0.4$ & $41.9 \pm 1.0$ & $42.7 \pm 0.6$ & $71.4 \pm 0.3$ & $69.0 \pm 0.4$ & $60.7 \pm 1.0$ & $60.3 \pm 0.8$ \\
& & \grntxt{$+12.9$} & \grntxt{$+14.2$} & \grntxt{$+10.8$} & \grntxt{$+14.4$} & \grntxt{$+16.3$} & \grntxt{$+21.5$} & \grntxt{$+20.5$} & \grntxt{$+26.4$} & \grntxt{$+15.7$} & \grntxt{$+19.3$} & \grntxt{$+24.7$} & \grntxt{$+27.8$} \\
\cmidrule(r){1-2}
DeiT-S & \xmark & $46.3 \pm 0.7$ & $38.1 \pm 0.3$ & $13.1 \pm 0.5$ & $9.9 \pm 0.1$ & $62.8 \pm 0.4$ & $58.2 \pm 0.2$ & $37.1 \pm 0.7$ & $34.3 \pm 0.5$ & $73.3 \pm 0.2$ & $73.9 \pm 0.4$ & $58.8 \pm 0.4$ & $59.4 \pm 0.6$ \\
DeiT-S & \cmark & $44.7 \pm 1.4$ & $37.1 \pm 1.4$ & $15.6 \pm 1.3$ & $12.1 \pm 0.9$ & $62.1 \pm 1.2$ & $57.8 \pm 1.1$ & $41.6 \pm 1.1$ & $37.9 \pm 1.2$ & $73.2 \pm 0.7$ & $73.3 \pm 0.4$ & $62.3 \pm 0.7$ & $61.4 \pm 0.9$ \\
& & \rdtxt{$-1.6$} & \rdtxt{$-1.1$} & \grntxt{$+2.4$} & \grntxt{$+2.2$} & \rdtxt{$-0.7$} & \rdtxt{$-0.4$} & \grntxt{$+4.4$} & \grntxt{$+3.5$} & \gtxt{$-0.1$} & \rdtxt{$-0.6$} & \grntxt{$+3.5$} & \grntxt{$+2.0$} \\
\cmidrule(r){1-2}
DeiT-B & \xmark & $48.1 \pm 0.9$ & $40.4 \pm 2.0$ & $15.8 \pm 0.2$ & $12.9 \pm 0.6$ & $64.0 \pm 0.9$ & $59.5 \pm 1.3$ & $39.0 \pm 0.9$ & $37.2 \pm 0.8$ & $74.1 \pm 0.7$ & $74.8 \pm 0.7$ & $59.1 \pm 0.8$ & $60.0 \pm 0.6$ \\
DeiT-B & \cmark & $50.7 \pm 0.1$ & $44.0 \pm 0.4$ & $19.3 \pm 0.2$ & $16.3 \pm 0.2$ & $66.0 \pm 0.2$ & $62.0 \pm 0.3$ & $43.4 \pm 0.3$ & $40.9 \pm 0.4$ & $75.4 \pm 0.1$ & $76.4 \pm 0.3$ & $62.8 \pm 0.2$ & $63.9 \pm 0.2$ \\
& & \grntxt{$+2.6$} & \grntxt{$+3.6$} & \grntxt{$+3.5$} & \grntxt{$+3.5$} & \grntxt{$+2.0$} & \grntxt{$+2.5$} & \grntxt{$+4.4$} & \grntxt{$+3.8$} & \grntxt{$+1.3$} & \grntxt{$+1.6$} & \grntxt{$+3.8$} & \grntxt{$+3.9$} \\
\cmidrule(r){1-2}
DeiT-L & \xmark & $39.2 \pm 2.6$ & $32.6 \pm 1.5$ & $10.5 \pm 2.8$ & $9.1 \pm 2.3$ & $55.7 \pm 2.5$ & $51.0 \pm 2.7$ & $30.3 \pm 4.0$ & $29.5 \pm 3.9$ & $68.5 \pm 2.1$ & $68.1 \pm 1.7$ & $51.7 \pm 3.1$ & $52.1 \pm 2.7$ \\
DeiT-L & \cmark & $51.9 \pm 0.7$ & $46.6 \pm 0.5$ & $21.5 \pm 1.3$ & $19.0 \pm 1.2$ & $66.6 \pm 0.6$ & $64.1 \pm 0.7$ & $45.3 \pm 1.3$ & $43.6 \pm 1.1$ & $75.6 \pm 0.4$ & $77.3 \pm 0.4$ & $63.8 \pm 0.8$ & $65.4 \pm 0.6$ \\
& & \grntxt{$+12.8$} & \grntxt{$+14.0$} & \grntxt{$+11.0$} & \grntxt{$+9.9$} & \grntxt{$+11.0$} & \grntxt{$+13.1$} & \grntxt{$+15.0$} & \grntxt{$+14.1$} & \grntxt{$+7.1$} & \grntxt{$+9.2$} & \grntxt{$+12.1$} & \grntxt{$+13.4$} \\
\cmidrule(r){1-2}
Swin-Ti & \xmark & $41.2 \pm 1.8$ & $32.5 \pm 0.3$ & $17.4 \pm 2.6$ & $12.2 \pm 0.2$ & $60.0 \pm 1.6$ & $51.4 \pm 0.2$ & $39.6 \pm 2.6$ & $34.8 \pm 0.9$ & $71.7 \pm 0.8$ & $66.1 \pm 0.7$ & $58.2 \pm 1.1$ & $53.6 \pm 1.2$ \\
Swin-Ti & \cmark & $49.8 \pm 0.6$ & $42.8 \pm 0.7$ & $24.2 \pm 0.7$ & $21.4 \pm 0.9$ & $66.4 \pm 0.6$ & $60.5 \pm 0.2$ & $47.8 \pm 0.5$ & $44.6 \pm 0.5$ & $76.0 \pm 0.3$ & $72.7 \pm 0.2$ & $65.7 \pm 0.5$ & $62.1 \pm 0.3$ \\
& & \grntxt{$+8.5$} & \grntxt{$+10.3$} & \grntxt{$+6.8$} & \grntxt{$+9.2$} & \grntxt{$+6.4$} & \grntxt{$+9.2$} & \grntxt{$+8.2$} & \grntxt{$+9.8$} & \grntxt{$+4.3$} & \grntxt{$+6.5$} & \grntxt{$+7.5$} & \grntxt{$+8.5$} \\
\cmidrule(r){1-2}
Swin-S & \xmark & $41.3 \pm 0.6$ & $33.0 \pm 0.1$ & $18.4 \pm 0.7$ & $13.3 \pm 0.5$ & $59.2 \pm 0.1$ & $51.2 \pm 0.5$ & $39.1 \pm 0.2$ & $35.9 \pm 0.3$ & $71.5 \pm 0.2$ & $65.6 \pm 0.1$ & $56.8 \pm 0.5$ & $53.2 \pm 0.2$ \\
Swin-S & \cmark & $48.6 \pm 0.7$ & $39.9 \pm 1.6$ & $22.2 \pm 0.9$ & $16.8 \pm 1.1$ & $64.4 \pm 0.9$ & $57.9 \pm 1.5$ & $43.8 \pm 1.1$ & $42.3 \pm 1.0$ & $75.7 \pm 0.2$ & $71.8 \pm 0.8$ & $63.2 \pm 0.4$ & $60.6 \pm 0.6$ \\
& & \grntxt{$+7.3$} & \grntxt{$+7.0$} & \grntxt{$+3.8$} & \grntxt{$+3.6$} & \grntxt{$+5.1$} & \grntxt{$+6.7$} & \grntxt{$+4.7$} & \grntxt{$+6.4$} & \grntxt{$+4.2$} & \grntxt{$+6.2$} & \grntxt{$+6.4$} & \grntxt{$+7.4$} \\
\cmidrule(r){1-2}
ResNet50 & \xmark & $48.6 \pm 0.6$ & $35.1 \pm 0.4$ & $23.0 \pm 0.7$ & $13.0 \pm 0.3$ & $65.8 \pm 0.4$ & $58.2 \pm 0.3$ & $44.4 \pm 0.6$ & $38.1 \pm 0.5$ & $73.2 \pm 0.2$ & $69.9 \pm 0.2$ & $56.9 \pm 0.1$ & $56.9 \pm 0.1$ \\
ResNet50 & \cmark & $52.3 \pm 0.6$ & $39.5 \pm 0.1$ & $27.4 \pm 0.6$ & $17.6 \pm 0.1$ & $68.5 \pm 0.3$ & $61.9 \pm 0.1$ & $48.5 \pm 0.4$ & $43.7 \pm 0.3$ & $75.2 \pm 0.1$ & $72.4 \pm 0.1$ & $61.7 \pm 0.3$ & $61.7 \pm 0.3$ \\
& & \grntxt{$+3.7$} & \grntxt{$+4.4$} & \grntxt{$+4.4$} & \grntxt{$+4.6$} & \grntxt{$+2.8$} & \grntxt{$+3.8$} & \grntxt{$+4.2$} & \grntxt{$+5.5$} & \grntxt{$+2.0$} & \grntxt{$+2.5$} & \grntxt{$+4.8$} & \grntxt{$+4.8$} \\
\cmidrule(r){1-2}
ResNet101 & \xmark & $47.8 \pm 0.7$ & $37.2 \pm 0.5$ & $20.4 \pm 1.2$ & $14.2 \pm 0.3$ & $64.9 \pm 0.2$ & $58.6 \pm 0.5$ & $41.1 \pm 0.5$ & $38.3 \pm 0.7$ & $73.6 \pm 0.3$ & $70.5 \pm 0.3$ & $56.2 \pm 0.4$ & $57.0 \pm 0.5$ \\
ResNet101 & \cmark & $52.3 \pm 0.1$ & $42.2 \pm 0.1$ & $24.7 \pm 0.1$ & $19.2 \pm 0.4$ & $68.8 \pm 0.6$ & $62.9 \pm 0.3$ & $46.4 \pm 1.5$ & $44.3 \pm 0.9$ & $76.0 \pm 0.4$ & $73.7 \pm 0.3$ & $61.0 \pm 1.2$ & $62.6 \pm 0.5$ \\
& & \grntxt{$+4.4$} & \grntxt{$+5.0$} & \grntxt{$+4.3$} & \grntxt{$+5.0$} & \grntxt{$+3.9$} & \grntxt{$+4.3$} & \grntxt{$+5.3$} & \grntxt{$+6.0$} & \grntxt{$+2.4$} & \grntxt{$+3.2$} & \grntxt{$+4.7$} & \grntxt{$+5.7$} \\
\bottomrule
\end{tabular}
}
\end{table}
\Cref{tab:corner-cases} reports accuracy on the corner-cases dataset~\cite{Fatima2025} for models trained with and without \schemename.
The dataset is constructed by pasting objects cropped by their full bounding boxes (which are available for the ImageNet validation set) onto 224$\times$224 infilled backgrounds.
The dataset has three factors: foreground size (56, 84, 112 pixels), spatial position (center, CeX, vs.\ corner, CoX), and background type (original image background, XxO, vs.\ a random background, XxR), yielding $3 \times 2 \times 2$ controlled configurations per model.
Across all architectures, training with \schemename consistently improves robustness to these composition shifts.
For ViT-S/B/L, gains range from roughly $+8$ to over $+27$ percentage points, with the largest improvements occurring in the most challenging settings with foregrounds placed in corners on random backgrounds (e.g., CoR and CeR).
Swin and ResNet models also benefit across all configurations, with increases typically between $+3$ and $+10$ points.
DeiT-S shows small drops on some same-background center cases (CeO/CoO), but still improves notably on random-background conditions (XxR), while DeiT-B/L gain across nearly all settings.
Three trends are apparent.
First, all baselines perform substantially worse when moving from original to random backgrounds and from centered to corner placements, indicating strong background and center biases.
Second, \schemename reduces this sensitivity: the absolute gap between center and corner, and between original and random backgrounds, shrinks for almost all models and sizes.
Third, the relative improvements are especially pronounced for smaller objects and off-center placements, suggesting that \schemename makes models more foreground-focused and less reliant on canonical object scale and position.
\section{\schemename Segmentation Samples}
\begin{figure}[t!]
\centering
\begin{subfigure}{.49\textwidth}
\includegraphics[width=\textwidth]{img/masked_image_examples_train.pdf}
\end{subfigure}
\hfill
\begin{subfigure}{.49\textwidth}
\includegraphics[width=\textwidth]{img/masked_image_examples.pdf}
\end{subfigure}
\caption{ImageNet validation samples (left) and training samples (right) of our segmentation masks with annotated bounding boxes.}
\label{fig:mask-examples}
\end{figure}
We show examples of the automatically generated segmentation masks for a diverse subset of object categories (``ant,'' ``busby,'' ``bell cote,'' ``pickelhaube,'' ``snorkel,'' ``stove,'' ``tennis ``ball,'' and ``volleyball'').
Note that ``busby,'' ``bell cote,'' ``pickelhaube,'' and ``snorkel'' are the four classes with the \textbf{worst} mean box precision and box-to-box IoU on the validation set.
\Cref{fig:mask-examples} (right) illustrates masks from the evaluation split, while \Cref{fig:mask-examples} (left) shows examples from the training split.
Across both sets, the masks accurately isolate foreground objects with clean boundaries, despite large variations in object scale, shape, and appearance, supporting their use for background removal and resampling in our training pipeline.
We find that the main failure cases are:
(\textit{i}) When the ground-truth annotation corresponds to only a part of an object, the predicted mask often expands to cover the entire object rather than the annotated region.
See for example ``busby'' or ``bell cote''.
(\textit{ii}) In images containing multiple instances, some objects may be missed, resulting in incomplete foreground coverage.
This is especially visible for ``busby'' and ``pickelhaube''.
However, note that especially for ``pickelhaube'' the training distribution is noticeably different from the validation distribution, showing many images with just the head instead of groups of people wearing it.
(\textit{iii}) In rare cases, the predicted mask degenerates and covers nearly the entire image, effectively eliminating the background.
This happens in $<10\%$ of all training images, and we do not use the resulting backgrounds for recombination (see \Cref{apdx:infill-ratio}).
\section{\schemename Sample Images}
\begin{table*}[t!]
\centering
\caption{Sample Images from using \schemename on ImageNet.}
\label{tbl:example-images}
\resizebox{.93\textwidth}{!}{
\begin{tabular}{ccccc}
\toprule
Class & \makecell{Original \\Image} & \makecell{Extracted \\Foreground} & \makecell{Infilled \\Background} & \schemename's Recombinations \\
\midrule
\makecell{n01531178 \\Goldfinch} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_v0_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_v0_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_recombined_v11.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_recombined_v13.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_recombined_v14.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_recombined_v18.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_recombined_v26.JPEG} \\
\makecell{n01818515 \\Macaw} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_v1_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_v1_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_recombined_v0.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_recombined_v10.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_recombined_v12.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_recombined_v16.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_recombined_v28.JPEG} \\
\makecell{n01943899 \\Conch} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_recombined_v0.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_recombined_v1.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_recombined_v10.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_recombined_v27.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_recombined_v18.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_recombined_v15.JPEG} \\
\makecell{n01986214 \\ Hermit Crab} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_recombined_v12.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_recombined_v18.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_recombined_v21.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_recombined_v9.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_recombined_v8.JPEG} \\
\makecell{n02190166 \\Fly} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_recombined_v1.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_recombined_v18.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_recombined_v23.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_recombined_v7.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_recombined_v9.JPEG} \\
\makecell{n02229544 \\Cricket} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_recombined_v1.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_recombined_v17.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_recombined_v18.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_recombined_v19.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_recombined_v25.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_recombined_v5.JPEG} \\
\makecell{n02443484 \\Black-Footed \\Ferret} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_recombined_v16.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_recombined_v24.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_recombined_v27.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_recombined_v3.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_recombined_v4.JPEG} \\
\makecell{n03201208 \\Dining Table} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_recombined_v0.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_recombined_v11.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_recombined_v15.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_recombined_v19.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_recombined_v21.JPEG} \\
\makecell{n03424325 \\Gasmask} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_recombined_v10.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_recombined_v11.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_recombined_v12.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_recombined_v13.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_recombined_v15.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_recombined_v26.JPEG} \\
\makecell{n03642806 \\Laptop} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_recombined_v11.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_recombined_v12.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_recombined_v15.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_recombined_v17.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_recombined_v25.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_recombined_v29.JPEG} \\
\makecell{n04141975 \\Scale} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_recombined_v10.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_recombined_v13.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_recombined_v14.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_recombined_v23.JPEG}\includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_recombined_v25.JPEG} \\
\makecell{n07714990 \\Broccoli} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_recombined_v1.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_recombined_v13.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_recombined_v15.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_recombined_v17.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_recombined_v27.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_recombined_v29.JPEG} \\
\makecell{n07749582 \\Lemon} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_recombined_v1.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_recombined_v15.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_recombined_v17.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_recombined_v24.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_recombined_v26.JPEG} \\
\makecell{n09332890 \\Lakeside} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_recombined_v0.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_recombined_v12.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_recombined_v13.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_recombined_v14.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_recombined_v18.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_recombined_v20.JPEG} \\
\bottomrule
\end{tabular}
}
\end{table*}
We show some example images of \schemename's recombinations for 14 random classes of ImageNet \cite{Deng2009} in \Cref{tbl:example-images}.
% \schemename visibly varies the background, size, and position of the objects.
The recombined samples display substantial visual diversity, with each extracted foreground appearing in multiple, clearly different background contexts.
Foreground objects remain sharp and wellpreserved across recombinations, while backgrounds vary in texture, color, and scene type
Images show a broad range of spatial placements and scales for the same object, resulting in noticeably different overall layouts.
\FloatBarrier
\section{Infill Model Comparison}
\begin{table*}[h!]
\centering
\caption{Example infills of LaMa and Attentive Eraser.}
\label{tab:infill-examples}
\resizebox{.9\textwidth}{!}{
\begin{tabular}{cc@{\hskip 0.3in}cc}
\toprule
LaMa & Att. Eraser & LaMa & Att. Eraser \\
\midrule
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00000090.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00000090.JPEG} &
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00000890.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00000890.JPEG} \\
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00002106.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00002106.JPEG} &
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00005045.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00005045.JPEG} \\
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00007437.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00007437.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00008542.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00008542.JPEG} \\
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00009674.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00009674.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00002743.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00002743.JPEG} \\
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00003097.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00003097.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00011629.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00011629.JPEG} \\
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00000547.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00000547.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00025256.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00025256.JPEG} \\
\bottomrule
\end{tabular}
}
\end{table*}
We visualize example infilled images for both LaMa \cite{Suvorov2022} and Attentive Eraser \cite{Sun2025} in \Cref{tab:infill-examples}.
The sidebyside examples show that both methods generally produce visually consistent infills, with many pairs appearing extremely similar at a glance.
We qualitatively find that Attentive Eraser yields slightly sharper textures or more coherent local structure, while LaMa sometimes produces smoother or more homogenized regions.
Across the table, finedetail areas such as foliage, bark, and ground textures reveal the most noticeable differences between the two methods.
% We qualitatively find that while LaMa often leaves repeated textures of blurry spots where the object was erased, Attentive Eraser produces slightly cleaner and more coherent infills of the background.
\FloatBarrier
\newpage
\section{Image Infill Ratio}
\label{apdx:infill-ratio}
\begin{table*}[h!]
\centering
\caption{Example infills with a large relative foreground area size that is infilled (infill ratio).}
\label{tbl:high-rat}
\resizebox{.8\textwidth}{!}{
\begin{tabular}{ccc}
\toprule
Infill Ratio & LaMa & Att. Eraser \\
\midrule
83.7 & \raisebox{-50pt}{\includegraphics[width=.3\columnwidth]{img/lama_infills/high_rat/ILSVRC2012_val_00022522.JPEG}} & \raisebox{-50pt}{\includegraphics[width=.3\columnwidth]{img/att_err_infills/high_rat/ILSVRC2012_val_00022522.JPEG}} \\ \\
88.2 & \raisebox{-50pt}{\includegraphics[width=.3\columnwidth]{img/lama_infills/high_rat/ILSVRC2012_val_00026530.JPEG}} & \raisebox{-50pt}{\includegraphics[width=.3\columnwidth]{img/att_err_infills/high_rat/ILSVRC2012_val_00026530.JPEG}} \\ \\
93.7 & \raisebox{-60pt}{\includegraphics[width=.3\columnwidth]{img/lama_infills/high_rat/ILSVRC2012_val_00003735.JPEG}} & \raisebox{-60pt}{\includegraphics[width=.3\columnwidth]{img/att_err_infills/high_rat/ILSVRC2012_val_00003735.JPEG}} \\ \\
95.7 & \raisebox{-60pt}{\includegraphics[width=.3\columnwidth]{img/lama_infills/high_rat/ILSVRC2012_val_00012151.JPEG}} & \raisebox{-60pt}{\includegraphics[width=.3\columnwidth]{img/att_err_infills/high_rat/ILSVRC2012_val_00012151.JPEG}}
\end{tabular}}
\end{table*}
\begin{figure}
\centering
\includegraphics[width=.9\textwidth]{img/infill_distr.pdf}
\caption{We plot the distribution of the relative size of the detected foreground object that is infilled in our Segmentation step of ImageNet.
While most images contain objects of smaller size, there is a peak where Grounded~SAM~\cite{Ren2024} detects almost the whole image as the foreground object. For examples of such large infills, see \Cref{tbl:high-rat}.
}
\label{fig:infill-distr}
\end{figure}
\Cref{tbl:high-rat} shows infills for images where Grounded SAM \cite{Ren2024} marks a high percentile of the image as the foreground object (Infill Ratio), that has to be erased by the infill models.
The examples show that when the infilled region becomes large, both methods begin to lose coherent global structure, with outputs dominated by repetitive or texturelike patterns.
LaMa tends to produce smoother, more uniform surfaces, like we saw in \Cref{tab:infill-examples}, while Attentive Eraser often generates denser, more regular texture patterns.
Across the rows, increasing infill ratio corresponds to increasingly homogeneous results, with only faint hints of original scene cues remaining.
% While LaMa tends to fill those spots with mostly black or gray and textures similar to what we saw in \Cref{tab:infill-examples}, Attentive Eraser tends to create novel patterns by copying what is left of the background all over the rest of the image.
% We filter out such mostly infilled background using our background pruning hyperparameter $t_\text{prune} = 0.8$.
\Cref{fig:infill-distr} plots the distribution of infill ratios in \schemename.
While there is a smooth curve of the number of detections decreasing with the infill ratio until $\approx 90\%$, there is an additional peak at $\approx 100\%$ infill ratio.
We hypothesize that this peak is made up of failure cases of Grounded~SAM.
We filter out all backgrounds that have an infill ratio larger than our pruning threshold $t_\text{prune} = 0.8$, which translates to $10\%$ of backgrounds.