250 lines
35 KiB
TeX
250 lines
35 KiB
TeX
% !TeX root = ../supplementary.tex
|
||
|
||
\section{Extended Bates Distribution}
|
||
\begin{figure}[h!]
|
||
\centering
|
||
\includegraphics[width=.5\columnwidth]{img/bates.pdf}
|
||
\caption{Plot of the probability distribution function (PDF) of the extended Bates distribution for different parameters $\eta$. Higher values of $\eta$ concentrate the distribution around the center.}
|
||
\label{fig:bates-pdf}
|
||
\end{figure}
|
||
|
||
% Finally, we analyze the foreground object's positioning in the image.
|
||
% We utilize an extended Bates distribution to sample the position of the foreground object.
|
||
% The Bates distribution~\cite{Bates1955} with parameter $\eta \geq 1$ is the mean of $\eta$ independent uniformly distributed random variables \cite{Jonhson1995}.
|
||
% Therefore, the larger $\eta$, the more concentrated the distribution is around the center.
|
||
% We extend this concept to $\eta \leq -1$ by shifting the distribution away from the center and towards the edges.
|
||
% We extend this concept to $\eta \leq -1$ by defining
|
||
% \begin{align*}
|
||
% X \sim \text{Bates}(\eta) :\Leftrightarrow s(X) \sim \text{Bates}(-\eta)
|
||
% \end{align*}
|
||
% for $\eta \leq 1$ with $s$ being the sawtooth function on $[0, 1]$:
|
||
% \begin{align}
|
||
% s(x) = \begin{cases}
|
||
% x + 0.5 & \text{if } 0 < x < 0.5 \\
|
||
% x - 0.5 & \text{if } 0.5 \leq x \leq 1
|
||
% \end{cases}
|
||
% \end{align}
|
||
% Note that $s \circ s = \id$ on $[0, 1]$.
|
||
% This way, distributions with $\eta \leq -1$ are more concentrated around the borders.
|
||
% $\eta = 1$ and $\eta = -1$ both correspond to the uniform distribution.
|
||
% The PDF of this extended Bates distribution is visualized in \Cref{fig:bates-pdf}.
|
||
|
||
We introduce an extension of the Bates distribution~\cite{Bates1955} to include negative parameters, enabling sampling of foreground object positions away from the image center.
|
||
The standard Bates distribution, for $\eta \in \N$, is defined as the mean of $\eta$ independent random variables drawn from a uniform distribution \cite{Jonhson1995}.
|
||
A larger $\eta$ value increases the concentration of samples around the distribution's mean, which in this case is the image center.
|
||
|
||
To achieve an opposite effect--concentrating samples at the image borders--we extend the distribution to $\eta \leq 1$.
|
||
\begin{align*}
|
||
X \sim \text{Bates}(\eta) :\Leftrightarrow s(X) \sim \text{Bates}(-\eta)
|
||
\end{align*}
|
||
This is accomplished by sampling from a standard Bates distribution with parameter $-\eta \geq 1$ and then applying a sawtooth function.
|
||
The sawtooth function on the interval $[0,1]$ is defined as
|
||
\begin{align}
|
||
s(x) = \begin{cases}
|
||
x + 0.5 & \text{if } 0 < x < 0.5 \\
|
||
x - 0.5 & \text{if } 0.5 \leq x \leq 1
|
||
\end{cases}
|
||
\end{align}
|
||
This function effectively maps the central portion of the interval to the edges and the edge portions to the center.
|
||
For example, a value of 0.3 (central-left) is mapped to 0.8 (edge-right), while 0.8 (edge-right) is mapped to 0.3 (central-left).
|
||
This transformation inverts the distribution's concentration, shifting the probability mass from the center to the borders.
|
||
We visualize the distribution function of the extended Bates distribution in \Cref{fig:bates-pdf}.
|
||
Both $\eta = 1$ and $\eta = -1$ result in a uniform distribution across the image.
|
||
|
||
\section{Resource Usage of \schemename}
|
||
To utilize the proposed \schemename, specific computational resources are necessary, particularly for computing and storing for the output of the segmentation stage and for on-the-fly processing of the recombination stage.
|
||
|
||
\paragraph{Segmentation.}
|
||
% While calculating the segmentations and infills takes a lot of compute, this is effort that has to be spent only once per dataset.
|
||
\schemename involves a computationally expensive segmentation and infill stage, which is a one-time calculation per dataset.
|
||
Once computed, the segmentation and infill results can be perpetually reused, amortizing the initial cost over all subsequent experiments and applications.
|
||
On NVIDIA H100 GPUs, the segmentation stage will compute at a rate of $374.3 \frac{\text{img}}{\text{GPU} \times \text{h}}$ when using Attentive Eraser or $5 338.6 \frac{\text{img}}{\text{GPU} \times \text{h}}$ for LaMa.
|
||
For ImageNet this comes down to just under 9 days (Attentive Eraser) or 16 hours (LaMa) on two 8 GPU nodes.
|
||
To facilitate immediate use and reproduction of results, we publicly provide the precalculated segmentation stage output for the ImageNet dataset for download\footnote{Link will go here.}.
|
||
The output of \schemename's segmentation step on ImageNet dataset requires 73 GB of additional disk space for the segmentation output, which is separate from the base 147 GB ImageNet size.
|
||
|
||
\paragraph{Recombination.}
|
||
The recombination step of \schemename is implemented as a based data loader operation.
|
||
It's thus offloaded to the CPU, where it can be heavily parallelized and thus only results in a very minor increase in the training step-time.
|
||
For example, using a ViT-B model on an NVIDIA A100 GPU, the average update step-time increased by $1\%$, from $528 \pm 2$ ms to $534 \pm 1$ ms.
|
||
|
||
|
||
\section{Training Setup}
|
||
\label{sec:training_setup}
|
||
|
||
\begin{table*}[h!]
|
||
\centering
|
||
\caption{Training setup and hyperparameters for our ImageNet training.}
|
||
\label{tab:in-setup}
|
||
\begin{tabular}{lcc}
|
||
\toprule
|
||
Parameter & ViT, Swin, ResNet & DeiT \\
|
||
\midrule
|
||
Image Resolution & $224 \times 224$ & $224 \times 224$ \\
|
||
Epochs & 300 & 300 \\
|
||
Learning Rate & 3e-3 & S/B: 1e-3, L: 5e-4 \\
|
||
Learning Rate Schedule & cosine decay & cosine decay \\
|
||
Batch Size & 2048 & 1024 \\
|
||
GPUs & $4\times$ NVIDIA A100/H100/H200 & $4\times$ NVIDIA A100/H100/H200 \\
|
||
Warmup Schedule & linear & linear \\
|
||
Warmup Epochs & 3 & 3 \\
|
||
Weight Decay & 0.02 & 0.05 \\
|
||
Label Smoothing & 0.1 & 0.1 \\
|
||
Optimizer & Lamb \cite{You2020} & AdamW \\
|
||
\cmidrule(r){1-1}
|
||
Data Augmentation Policy & \textbf{3-Augment \cite{Touvron2022}} & \textbf{DeiT \cite{Touvron2021b}} \\
|
||
Augmentations & \makecell{Resize \\ RandomCrop \\ HorizontalFlip \\ Grayscale \\ Solarize \\ GaussianBlur \\ ColorJitter \\ CutMix \cite{Yun2019}} & \makecell{RandomResizedCrop \\ HorizontalFlip \\ RandomErase \cite{Zhong2017} \\ RandAugment \cite{Cubuk2019} \\ ColorJitter \\ Mixup \cite{Zhang2018a} \\ CutMix \cite{Yun2019}} \\
|
||
\bottomrule
|
||
\end{tabular}
|
||
\end{table*}
|
||
|
||
\begin{table}[h!]
|
||
\centering
|
||
\caption{Training setup for finetuning on different downstream datasets. Other settings are the same as in \Cref{tab:in-setup}. For finetuning, we always utilize 3-Augment and the related parameters from the \emph{ViT, Swin, ResNet} column of \Cref{tab:in-setup}}
|
||
\label{tab:downstream-setup}
|
||
\begin{tabular}{lcccc}
|
||
\toprule
|
||
Dataset & Batch Size & Epochs & Learning Rate & Num. GPUs \\
|
||
\midrule
|
||
Aircraft & 512 & 500 & 3e-4 & 2 \\
|
||
Cars & 1024 & 500 & 3e-4 & 4 \\
|
||
Flowers & 256 & 500 & 3e-4 & 1 \\
|
||
Food & 2048 & 100 & 3e-4 & 4 \\
|
||
Pets & 512 & 500 & 3e-4 & 2 \\
|
||
\bottomrule
|
||
\end{tabular}
|
||
\end{table}
|
||
On ImageNet we use the same training setup as \cite{Nauen2025} and \cite{Touvron2022} without pretraining for ViT, Swin, and ResNet.
|
||
For DeiT, we train the same ViT architecture but using the data augmentation scheme and hyperparameters from \cite{Touvron2021b}.
|
||
As our focus is on evaluating the changes in accuracy due to \schemename, like \cite{Nauen2025}, we stick to one set of hyperparameters for all models.
|
||
We list the settings used for training on ImageNet in \Cref{tab:in-setup} and the ones used for finetuning those weights on the downstream datasets in \Cref{tab:downstream-setup}.
|
||
Out implementation is using PyTorch \cite{Paszke2019} and the \emph{timm} library \cite{Wightman2019} for model architectures and basic functions.
|
||
|
||
\begin{table*}[h!]
|
||
\centering
|
||
\caption{Hardware and Software specifics used for both training and evaluation.}
|
||
\label{tab:hw-sw-versions}
|
||
\begin{tabular}{ll}
|
||
\toprule
|
||
Parameter & Value \\
|
||
\midrule
|
||
GPU & NVIDIA A100/H100/H200 \\
|
||
CPU & 24 CPU cores (Intex Xenon) per GPU \\
|
||
Memory & up to 120GB per GPU \\
|
||
Operating System & Enroot container for SLURM based on Ubuntu 24.04 LTS \\
|
||
Python & 3.12.3 \\
|
||
PyTorch & 2.7.0 \\
|
||
TorchVision & 0.22.0 \\
|
||
Timm & 1.0.15 \\
|
||
\bottomrule
|
||
\end{tabular}
|
||
\end{table*}
|
||
\Cref{tab:hw-sw-versions} lists the specific hardware we use, as well as versions of the relevant software packages.
|
||
|
||
\section{\schemename Sample Images}
|
||
\begin{table*}[h!]
|
||
\centering
|
||
\caption{Sample Images from using \schemename on ImageNet.}
|
||
\label{tbl:example-images}
|
||
\resizebox{.93\textwidth}{!}{
|
||
\begin{tabular}{ccccc}
|
||
\toprule
|
||
Class & \makecell{Original \\Image} & \makecell{Extracted \\Foreground} & \makecell{Infilled \\Background} & \schemename's Recombinations \\
|
||
\midrule
|
||
\makecell{n01531178 \\Goldfinch} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_v0_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_v0_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_recombined_v11.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_recombined_v13.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_recombined_v14.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_recombined_v18.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01531178_4963_recombined_v26.JPEG} \\
|
||
\makecell{n01818515 \\Macaw} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_v1_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_v1_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_recombined_v0.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_recombined_v10.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_recombined_v12.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_recombined_v16.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01818515_31507_recombined_v28.JPEG} \\
|
||
\makecell{n01943899 \\Conch} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_recombined_v0.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_recombined_v1.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_recombined_v10.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_recombined_v27.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_recombined_v18.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01943899_20070_recombined_v15.JPEG} \\
|
||
\makecell{n01986214 \\ Hermit Crab} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_recombined_v12.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_recombined_v18.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_recombined_v21.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_recombined_v9.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n01986214_4117_recombined_v8.JPEG} \\
|
||
\makecell{n02190166 \\Fly} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_recombined_v1.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_recombined_v18.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_recombined_v23.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_recombined_v7.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02190166_1208_recombined_v9.JPEG} \\
|
||
\makecell{n02229544 \\Cricket} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_recombined_v1.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_recombined_v17.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_recombined_v18.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_recombined_v19.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_recombined_v25.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02229544_6170_recombined_v5.JPEG} \\
|
||
\makecell{n02443484 \\Black-Footed \\Ferret} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_recombined_v16.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_recombined_v24.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_recombined_v27.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_recombined_v3.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n02443484_5430_recombined_v4.JPEG} \\
|
||
\makecell{n03201208 \\Dining Table} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_recombined_v0.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_recombined_v11.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_recombined_v15.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_recombined_v19.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03201208_21000_recombined_v21.JPEG} \\
|
||
\makecell{n03424325 \\Gasmask} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_recombined_v10.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_recombined_v11.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_recombined_v12.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_recombined_v13.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_recombined_v15.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03424325_21435_recombined_v26.JPEG} \\
|
||
\makecell{n03642806 \\Laptop} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_recombined_v11.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_recombined_v12.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_recombined_v15.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_recombined_v17.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_recombined_v25.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n03642806_3615_recombined_v29.JPEG} \\
|
||
\makecell{n04141975 \\Scale} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_recombined_v10.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_recombined_v13.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_recombined_v14.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_recombined_v23.JPEG}\includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n04141975_11426_recombined_v25.JPEG} \\
|
||
\makecell{n07714990 \\Broccoli} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_recombined_v1.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_recombined_v13.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_recombined_v15.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_recombined_v17.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_recombined_v27.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07714990_7596_recombined_v29.JPEG} \\
|
||
\makecell{n07749582 \\Lemon} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_recombined_v1.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_recombined_v15.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_recombined_v17.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_recombined_v20.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_recombined_v24.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n07749582_17601_recombined_v26.JPEG} \\
|
||
\makecell{n09332890 \\Lakeside} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_fg.PNG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_bg.JPEG} & \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_recombined_v0.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_recombined_v12.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_recombined_v13.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_recombined_v14.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_recombined_v18.JPEG} \includegraphics[max width=.1\columnwidth, max height=2cm, valign=c]{img/appendix_examples/n09332890_27898_recombined_v20.JPEG} \\
|
||
\bottomrule
|
||
\end{tabular}
|
||
}
|
||
\end{table*}
|
||
We show some example images of \schemename's recombinations for 14 random classes of ImageNet \cite{Deng2009} in \Cref{tbl:example-images}.
|
||
% \schemename visibly varies the background, size, and position of the objects.
|
||
The recombined samples display substantial visual diversity, with each extracted foreground appearing in multiple, clearly different background contexts.
|
||
Foreground objects remain sharp and well‑preserved across recombinations, while backgrounds vary in texture, color, and scene type
|
||
Images show a broad range of spatial placements and scales for the same object, resulting in noticeably different overall layouts.
|
||
|
||
|
||
\FloatBarrier
|
||
\section{Infill Model Comparison}
|
||
\begin{table*}[h!]
|
||
\centering
|
||
\caption{Example infills of LaMa and Attentive Eraser.}
|
||
\label{tab:infill-examples}
|
||
\resizebox{.9\textwidth}{!}{
|
||
\begin{tabular}{cc@{\hskip 0.3in}cc}
|
||
\toprule
|
||
LaMa & Att. Eraser & LaMa & Att. Eraser \\
|
||
\midrule
|
||
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00000090.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00000090.JPEG} &
|
||
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00000890.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00000890.JPEG} \\
|
||
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00002106.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00002106.JPEG} &
|
||
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00005045.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00005045.JPEG} \\
|
||
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00007437.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00007437.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00008542.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00008542.JPEG} \\
|
||
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00009674.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00009674.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00002743.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00002743.JPEG} \\
|
||
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00003097.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00003097.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00011629.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00011629.JPEG} \\
|
||
\includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00000547.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00000547.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/lama_infills/comp/ILSVRC2012_val_00025256.JPEG} & \includegraphics[width=.23\columnwidth, valign=c]{img/att_err_infills/comp/ILSVRC2012_val_00025256.JPEG} \\
|
||
\bottomrule
|
||
\end{tabular}
|
||
}
|
||
\end{table*}
|
||
We visualize example infilled images for both LaMa \cite{Suvorov2021} and Attentive Eraser \cite{Sun2024} in \Cref{tab:infill-examples}.
|
||
The side‑by‑side examples show that both methods generally produce visually consistent infills, with many pairs appearing extremely similar at a glance.
|
||
We qualitatively find that Attentive Eraser yields slightly sharper textures or more coherent local structure, while LaMa sometimes produces smoother or more homogenized regions.
|
||
Across the table, fine‑detail areas such as foliage, bark, and ground textures reveal the most noticeable differences between the two methods.
|
||
% We qualitatively find that while LaMa often leaves repeated textures of blurry spots where the object was erased, Attentive Eraser produces slightly cleaner and more coherent infills of the background.
|
||
|
||
\FloatBarrier
|
||
\newpage
|
||
\section{Image Infill Ratio}
|
||
\begin{table*}[h!]
|
||
\centering
|
||
\caption{Example infills with a large relative foreground area size that is infilled (infill ratio).}
|
||
\label{tbl:high-rat}
|
||
\resizebox{.8\textwidth}{!}{
|
||
\begin{tabular}{ccc}
|
||
\toprule
|
||
Infill Ratio & LaMa & Att. Eraser \\
|
||
\midrule
|
||
83.7 & \raisebox{-50pt}{\includegraphics[width=.3\columnwidth]{img/lama_infills/high_rat/ILSVRC2012_val_00022522.JPEG}} & \raisebox{-50pt}{\includegraphics[width=.3\columnwidth]{img/att_err_infills/high_rat/ILSVRC2012_val_00022522.JPEG}} \\ \\
|
||
88.2 & \raisebox{-50pt}{\includegraphics[width=.3\columnwidth]{img/lama_infills/high_rat/ILSVRC2012_val_00026530.JPEG}} & \raisebox{-50pt}{\includegraphics[width=.3\columnwidth]{img/att_err_infills/high_rat/ILSVRC2012_val_00026530.JPEG}} \\ \\
|
||
93.7 & \raisebox{-60pt}{\includegraphics[width=.3\columnwidth]{img/lama_infills/high_rat/ILSVRC2012_val_00003735.JPEG}} & \raisebox{-60pt}{\includegraphics[width=.3\columnwidth]{img/att_err_infills/high_rat/ILSVRC2012_val_00003735.JPEG}} \\ \\
|
||
95.7 & \raisebox{-60pt}{\includegraphics[width=.3\columnwidth]{img/lama_infills/high_rat/ILSVRC2012_val_00012151.JPEG}} & \raisebox{-60pt}{\includegraphics[width=.3\columnwidth]{img/att_err_infills/high_rat/ILSVRC2012_val_00012151.JPEG}}
|
||
\end{tabular}}
|
||
\end{table*}
|
||
|
||
\begin{figure}
|
||
\centering
|
||
\includegraphics[width=.9\textwidth]{img/infill_distr.pdf}
|
||
\caption{We plot the distribution of the relative size of the detected foreground object that is infilled in our Segmentation step of ImageNet.
|
||
While most images contain objects of smaller size, there is a peak where Grounded~SAM~\cite{Ren2024} detects almost the whole image as the foreground object. For examples of such large infills, see \Cref{tbl:high-rat}.
|
||
}
|
||
\label{fig:infill-distr}
|
||
\end{figure}
|
||
|
||
\Cref{tbl:high-rat} shows infills for images where Grounded SAM \cite{Ren2024} marks a high percentile of the image as the foreground object (Infill Ratio), that has to be erased by the infill models.
|
||
The examples show that when the infilled region becomes large, both methods begin to lose coherent global structure, with outputs dominated by repetitive or texture‑like patterns.
|
||
LaMa tends to produce smoother, more uniform surfaces, like we saw in \Cref{tab:infill-examples}, while Attentive Eraser often generates denser, more regular texture patterns.
|
||
Across the rows, increasing infill ratio corresponds to increasingly homogeneous results, with only faint hints of original scene cues remaining.
|
||
% While LaMa tends to fill those spots with mostly black or gray and textures similar to what we saw in \Cref{tab:infill-examples}, Attentive Eraser tends to create novel patterns by copying what is left of the background all over the rest of the image.
|
||
% We filter out such mostly infilled background using our background pruning hyperparameter $t_\text{prune} = 0.8$.
|
||
\Cref{fig:infill-distr} plots the distribution of infill ratios in \schemename.
|
||
While there is a smooth curve of the number of detections decreasing with the infill ratio until $\approx 90\%$, there is an additional peak at $\approx 100\%$ infill ratio.
|
||
We hypothesize that this peak is made up of failure cases of Grounded~SAM.
|
||
|
||
We filter out all backgrounds that have an infill ratio larger than our pruning threshold $t_\text{prune} = 0.8$, which translates to $10\%$ of backgrounds.
|
||
|
||
|
||
|