Files
ForAug/sec/appendix.tex
Tobias Christian Nauen e7c0b531d6 cvpr submission
2026-02-24 12:01:26 +01:00

250 lines
35 KiB
TeX
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
% !TeX root = ../supplementary.tex
\section{Extended Bates Distribution}
\begin{figure}[h!]
\centering
\includegraphics[width=.5\columnwidth]{img/bates.pdf}
\caption{Plot of the probability distribution function (PDF) of the extended Bates distribution for different parameters $\eta$. Higher values of $\eta$ concentrate the distribution around the center.}
\label{fig:bates-pdf}
\end{figure}
% Finally, we analyze the foreground object's positioning in the image.
% We utilize an extended Bates distribution to sample the position of the foreground object.
% The Bates distribution~\cite{Bates1955} with parameter $\eta \geq 1$ is the mean of $\eta$ independent uniformly distributed random variables \cite{Jonhson1995}.
% Therefore, the larger $\eta$, the more concentrated the distribution is around the center.
% We extend this concept to $\eta \leq -1$ by shifting the distribution away from the center and towards the edges.
% We extend this concept to $\eta \leq -1$ by defining
% \begin{align*}
% X \sim \text{Bates}(\eta) :\Leftrightarrow s(X) \sim \text{Bates}(-\eta)
% \end{align*}
% for $\eta \leq 1$ with $s$ being the sawtooth function on $[0, 1]$:
% \begin{align}
% s(x) = \begin{cases}
% x + 0.5 & \text{if } 0 < x < 0.5 \\
% x - 0.5 & \text{if } 0.5 \leq x \leq 1
% \end{cases}
% \end{align}
% Note that $s \circ s = \id$ on $[0, 1]$.
% This way, distributions with $\eta \leq -1$ are more concentrated around the borders.
% $\eta = 1$ and $\eta = -1$ both correspond to the uniform distribution.
% The PDF of this extended Bates distribution is visualized in \Cref{fig:bates-pdf}.
We introduce an extension of the Bates distribution~\cite{Bates1955} to include negative parameters, enabling sampling of foreground object positions away from the image center.
The standard Bates distribution, for $\eta \in \N$, is defined as the mean of $\eta$ independent random variables drawn from a uniform distribution \cite{Jonhson1995}.
A larger $\eta$ value increases the concentration of samples around the distribution's mean, which in this case is the image center.
To achieve an opposite effect--concentrating samples at the image borders--we extend the distribution to $\eta \leq 1$.
\begin{align*}
X \sim \text{Bates}(\eta) :\Leftrightarrow s(X) \sim \text{Bates}(-\eta)
\end{align*}
This is accomplished by sampling from a standard Bates distribution with parameter $-\eta \geq 1$ and then applying a sawtooth function.
The sawtooth function on the interval $[0,1]$ is defined as
\begin{align}
s(x) = \begin{cases}
x + 0.5 & \text{if } 0 < x < 0.5 \\
x - 0.5 & \text{if } 0.5 \leq x \leq 1
\end{cases}
\end{align}
This function effectively maps the central portion of the interval to the edges and the edge portions to the center.
For example, a value of 0.3 (central-left) is mapped to 0.8 (edge-right), while 0.8 (edge-right) is mapped to 0.3 (central-left).
This transformation inverts the distribution's concentration, shifting the probability mass from the center to the borders.
We visualize the distribution function of the extended Bates distribution in \Cref{fig:bates-pdf}.
Both $\eta = 1$ and $\eta = -1$ result in a uniform distribution across the image.
\section{Resource Usage of \schemename}
To utilize the proposed \schemename, specific computational resources are necessary, particularly for computing and storing for the output of the segmentation stage and for on-the-fly processing of the recombination stage.
\paragraph{Segmentation.}
% While calculating the segmentations and infills takes a lot of compute, this is effort that has to be spent only once per dataset.
\schemename involves a computationally expensive segmentation and infill stage, which is a one-time calculation per dataset.
Once computed, the segmentation and infill results can be perpetually reused, amortizing the initial cost over all subsequent experiments and applications.
On NVIDIA H100 GPUs, the segmentation stage will compute at a rate of $374.3 \frac{\text{img}}{\text{GPU} \times \text{h}}$ when using Attentive Eraser or $5 338.6 \frac{\text{img}}{\text{GPU} \times \text{h}}$ for LaMa.
For ImageNet this comes down to just under 9 days (Attentive Eraser) or 16 hours (LaMa) on two 8 GPU nodes.
To facilitate immediate use and reproduction of results, we publicly provide the precalculated segmentation stage output for the ImageNet dataset for download\footnote{Link will go here.}.
The output of \schemename's segmentation step on ImageNet dataset requires 73 GB of additional disk space for the segmentation output, which is separate from the base 147 GB ImageNet size.
\paragraph{Recombination.}
The recombination step of \schemename is implemented as a based data loader operation.
It's thus offloaded to the CPU, where it can be heavily parallelized and thus only results in a very minor increase in the training step-time.
For example, using a ViT-B model on an NVIDIA A100 GPU, the average update step-time increased by $1\%$, from $528 \pm 2$ ms to $534 \pm 1$ ms.
\section{Training Setup}
\label{sec:training_setup}
\begin{table*}[h!]
\centering
\caption{Training setup and hyperparameters for our ImageNet training.}
\label{tab:in-setup}
\begin{tabular}{lcc}
\toprule
Parameter & ViT, Swin, ResNet & DeiT \\
\midrule
Image Resolution & $224 \times 224$ & $224 \times 224$ \\
Epochs & 300 & 300 \\
Learning Rate & 3e-3 & S/B: 1e-3, L: 5e-4 \\
Learning Rate Schedule & cosine decay & cosine decay \\
Batch Size & 2048 & 1024 \\
GPUs & $4\times$ NVIDIA A100/H100/H200 & $4\times$ NVIDIA A100/H100/H200 \\
Warmup Schedule & linear & linear \\
Warmup Epochs & 3 & 3 \\
Weight Decay & 0.02 & 0.05 \\
Label Smoothing & 0.1 & 0.1 \\
Optimizer & Lamb \cite{You2020} & AdamW \\
\cmidrule(r){1-1}
Data Augmentation Policy & \textbf{3-Augment \cite{Touvron2022}} & \textbf{DeiT \cite{Touvron2021b}} \\
Augmentations & \makecell{Resize \\ RandomCrop \\ HorizontalFlip \\ Grayscale \\ Solarize \\ GaussianBlur \\ ColorJitter \\ CutMix \cite{Yun2019}} & \makecell{RandomResizedCrop \\ HorizontalFlip \\ RandomErase \cite{Zhong2017} \\ RandAugment \cite{Cubuk2019} \\ ColorJitter \\ Mixup \cite{Zhang2018a} \\ CutMix \cite{Yun2019}} \\
\bottomrule
\end{tabular}
\end{table*}
\begin{table}[h!]
\centering
\caption{Training setup for finetuning on different downstream datasets. Other settings are the same as in \Cref{tab:in-setup}. For finetuning, we always utilize 3-Augment and the related parameters from the \emph{ViT, Swin, ResNet} column of \Cref{tab:in-setup}}
\label{tab:downstream-setup}
\begin{tabular}{lcccc}
\toprule
Dataset & Batch Size & Epochs & Learning Rate & Num. GPUs \\
\midrule
Aircraft & 512 & 500 & 3e-4 & 2 \\
Cars & 1024 & 500 & 3e-4 & 4 \\
Flowers & 256 & 500 & 3e-4 & 1 \\
Food & 2048 & 100 & 3e-4 & 4 \\
Pets & 512 & 500 & 3e-4 & 2 \\
\bottomrule
\end{tabular}
\end{table}
On ImageNet we use the same training setup as \cite{Nauen2025} and \cite{Touvron2022} without pretraining for ViT, Swin, and ResNet.
For DeiT, we train the same ViT architecture but using the data augmentation scheme and hyperparameters from \cite{Touvron2021b}.
As our focus is on evaluating the changes in accuracy due to \schemename, like \cite{Nauen2025}, we stick to one set of hyperparameters for all models.
We list the settings used for training on ImageNet in \Cref{tab:in-setup} and the ones used for finetuning those weights on the downstream datasets in \Cref{tab:downstream-setup}.
Out implementation is using PyTorch \cite{Paszke2019} and the \emph{timm} library \cite{Wightman2019} for model architectures and basic functions.
\begin{table*}[h!]
\centering
\caption{Hardware and Software specifics used for both training and evaluation.}
\label{tab:hw-sw-versions}
\begin{tabular}{ll}
\toprule
Parameter & Value \\
\midrule
GPU & NVIDIA A100/H100/H200 \\
CPU & 24 CPU cores (Intex Xenon) per GPU \\
Memory & up to 120GB per GPU \\
Operating System & Enroot container for SLURM based on Ubuntu 24.04 LTS \\
Python & 3.12.3 \\
PyTorch & 2.7.0 \\
TorchVision & 0.22.0 \\
Timm & 1.0.15 \\
\bottomrule
\end{tabular}
\end{table*}
\Cref{tab:hw-sw-versions} lists the specific hardware we use, as well as versions of the relevant software packages.
\section{\schemename Sample Images}
\begin{table*}[h!]
\centering
\caption{Sample Images from using \schemename on ImageNet.}
\label{tbl:example-images}
\resizebox{.93\textwidth}{!}{
\begin{tabular}{ccccc}
\toprule
Class & \makecell{Original \\Image} & \makecell{Extracted \\Foreground} & \makecell{Infilled \\Background} & \schemename's Recombinations \\
\midrule
\makecell{n01531178 \\Goldfinch} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_v0_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_v0_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_recombined_v11.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_recombined_v13.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_recombined_v14.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_recombined_v18.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_recombined_v26.JPEG} \\
\makecell{n01818515 \\Macaw} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_v1_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_v1_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_recombined_v0.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_recombined_v10.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_recombined_v12.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_recombined_v16.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_recombined_v28.JPEG} \\
\makecell{n01943899 \\Conch} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_recombined_v0.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_recombined_v1.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_recombined_v10.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_recombined_v27.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_recombined_v18.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_recombined_v15.JPEG} \\
\makecell{n01986214 \\ Hermit Crab} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_recombined_v12.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_recombined_v18.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_recombined_v21.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_recombined_v9.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_recombined_v8.JPEG} \\
\makecell{n02190166 \\Fly} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_recombined_v1.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_recombined_v18.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_recombined_v23.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_recombined_v7.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_recombined_v9.JPEG} \\
\makecell{n02229544 \\Cricket} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_recombined_v1.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_recombined_v17.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_recombined_v18.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_recombined_v19.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_recombined_v25.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_recombined_v5.JPEG} \\
\makecell{n02443484 \\Black-Footed \\Ferret} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_recombined_v16.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_recombined_v24.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_recombined_v27.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_recombined_v3.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_recombined_v4.JPEG} \\
\makecell{n03201208 \\Dining Table} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_recombined_v0.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_recombined_v11.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_recombined_v15.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_recombined_v19.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_recombined_v21.JPEG} \\
\makecell{n03424325 \\Gasmask} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_recombined_v10.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_recombined_v11.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_recombined_v12.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_recombined_v13.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_recombined_v15.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_recombined_v26.JPEG} \\
\makecell{n03642806 \\Laptop} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_recombined_v11.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_recombined_v12.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_recombined_v15.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_recombined_v17.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_recombined_v25.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_recombined_v29.JPEG} \\
\makecell{n04141975 \\Scale} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_recombined_v10.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_recombined_v13.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_recombined_v14.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_recombined_v23.JPEG}\includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_recombined_v25.JPEG} \\
\makecell{n07714990 \\Broccoli} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_recombined_v1.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_recombined_v13.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_recombined_v15.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_recombined_v17.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_recombined_v27.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_recombined_v29.JPEG} \\
\makecell{n07749582 \\Lemon} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_recombined_v1.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_recombined_v15.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_recombined_v17.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_recombined_v24.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_recombined_v26.JPEG} \\
\makecell{n09332890 \\Lakeside} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_recombined_v0.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_recombined_v12.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_recombined_v13.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_recombined_v14.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_recombined_v18.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_recombined_v20.JPEG} \\
\bottomrule
\end{tabular}
}
\end{table*}
We show some example images of \schemename's recombinations for 14 random classes of ImageNet \cite{Deng2009} in \Cref{tbl:example-images}.
% \schemename visibly varies the background, size, and position of the objects.
The recombined samples display substantial visual diversity, with each extracted foreground appearing in multiple, clearly different background contexts.
Foreground objects remain sharp and wellpreserved across recombinations, while backgrounds vary in texture, color, and scene type
Images show a broad range of spatial placements and scales for the same object, resulting in noticeably different overall layouts.
\FloatBarrier
\section{Infill Model Comparison}
\begin{table*}[h!]
\centering
\caption{Example infills of LaMa and Attentive Eraser.}
\label{tab:infill-examples}
\resizebox{.9\textwidth}{!}{
\begin{tabular}{cc@{\hskip 0.3in}cc}
\toprule
LaMa & Att. Eraser & LaMa & Att. Eraser \\
\midrule
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00000090.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00000090.JPEG} &
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00000890.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00000890.JPEG} \\
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00002106.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00002106.JPEG} &
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00005045.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00005045.JPEG} \\
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00007437.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00007437.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00008542.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00008542.JPEG} \\
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00009674.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00009674.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00002743.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00002743.JPEG} \\
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00003097.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00003097.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00011629.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00011629.JPEG} \\
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00000547.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00000547.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00025256.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00025256.JPEG} \\
\bottomrule
\end{tabular}
}
\end{table*}
We visualize example infilled images for both LaMa \cite{Suvorov2021} and Attentive Eraser \cite{Sun2024} in \Cref{tab:infill-examples}.
The sidebyside examples show that both methods generally produce visually consistent infills, with many pairs appearing extremely similar at a glance.
We qualitatively find that Attentive Eraser yields slightly sharper textures or more coherent local structure, while LaMa sometimes produces smoother or more homogenized regions.
Across the table, finedetail areas such as foliage, bark, and ground textures reveal the most noticeable differences between the two methods.
% We qualitatively find that while LaMa often leaves repeated textures of blurry spots where the object was erased, Attentive Eraser produces slightly cleaner and more coherent infills of the background.
\FloatBarrier
\newpage
\section{Image Infill Ratio}
\begin{table*}[h!]
\centering
\caption{Example infills with a large relative foreground area size that is infilled (infill ratio).}
\label{tbl:high-rat}
\resizebox{.8\textwidth}{!}{
\begin{tabular}{ccc}
\toprule
Infill Ratio & LaMa & Att. Eraser \\
\midrule
83.7 & \raisebox{-50pt}{\includegraphics[width=.3\columnwidth]{img/lama_infills/high_rat/ILSVRC2012_val_00022522.JPEG}} & \raisebox{-50pt}{\includegraphics[width=.3\columnwidth]{img/att_err_infills/high_rat/ILSVRC2012_val_00022522.JPEG}} \\ \\
88.2 & \raisebox{-50pt}{\includegraphics[width=.3\columnwidth]{img/lama_infills/high_rat/ILSVRC2012_val_00026530.JPEG}} & \raisebox{-50pt}{\includegraphics[width=.3\columnwidth]{img/att_err_infills/high_rat/ILSVRC2012_val_00026530.JPEG}} \\ \\
93.7 & \raisebox{-60pt}{\includegraphics[width=.3\columnwidth]{img/lama_infills/high_rat/ILSVRC2012_val_00003735.JPEG}} & \raisebox{-60pt}{\includegraphics[width=.3\columnwidth]{img/att_err_infills/high_rat/ILSVRC2012_val_00003735.JPEG}} \\ \\
95.7 & \raisebox{-60pt}{\includegraphics[width=.3\columnwidth]{img/lama_infills/high_rat/ILSVRC2012_val_00012151.JPEG}} & \raisebox{-60pt}{\includegraphics[width=.3\columnwidth]{img/att_err_infills/high_rat/ILSVRC2012_val_00012151.JPEG}}
\end{tabular}}
\end{table*}
\begin{figure}
\centering
\includegraphics[width=.9\textwidth]{img/infill_distr.pdf}
\caption{We plot the distribution of the relative size of the detected foreground object that is infilled in our Segmentation step of ImageNet.
While most images contain objects of smaller size, there is a peak where Grounded~SAM~\cite{Ren2024} detects almost the whole image as the foreground object. For examples of such large infills, see \Cref{tbl:high-rat}.
}
\label{fig:infill-distr}
\end{figure}
\Cref{tbl:high-rat} shows infills for images where Grounded SAM \cite{Ren2024} marks a high percentile of the image as the foreground object (Infill Ratio), that has to be erased by the infill models.
The examples show that when the infilled region becomes large, both methods begin to lose coherent global structure, with outputs dominated by repetitive or texturelike patterns.
LaMa tends to produce smoother, more uniform surfaces, like we saw in \Cref{tab:infill-examples}, while Attentive Eraser often generates denser, more regular texture patterns.
Across the rows, increasing infill ratio corresponds to increasingly homogeneous results, with only faint hints of original scene cues remaining.
% While LaMa tends to fill those spots with mostly black or gray and textures similar to what we saw in \Cref{tab:infill-examples}, Attentive Eraser tends to create novel patterns by copying what is left of the background all over the rest of the image.
% We filter out such mostly infilled background using our background pruning hyperparameter $t_\text{prune} = 0.8$.
\Cref{fig:infill-distr} plots the distribution of infill ratios in \schemename.
While there is a smooth curve of the number of detections decreasing with the infill ratio until $\approx 90\%$, there is an additional peak at $\approx 100\%$ infill ratio.
We hypothesize that this peak is made up of failure cases of Grounded~SAM.
We filter out all backgrounds that have an infill ratio larger than our pruning threshold $t_\text{prune} = 0.8$, which translates to $10\%$ of backgrounds.