185 lines
15 KiB
TeX
185 lines
15 KiB
TeX
|
|
\section{Extended Bates Distribution}
|
|
\begin{figure}[h!]
|
|
\centering
|
|
\includegraphics[width=.5\columnwidth]{img/bates.pdf}
|
|
\caption{Plot of the probability distribution function (PDF) of the extended Bates distribution for different parameters $\eta$. Higher values of $\eta$ concentrate the distribution around the center.}
|
|
\label{fig:bates-pdf}
|
|
\end{figure}
|
|
|
|
|
|
We introduce an extension of the Bates distribution~\cite{Bates1955} to include negative parameters, enabling sampling of foreground object positions away from the image center.
|
|
The standard Bates distribution, for $\eta \in \N$, is defined as the mean of $\eta$ independent random variables drawn from a uniform distribution \cite{Jonhson1995}.
|
|
A larger $\eta$ value increases the concentration of samples around the distribution's mean, which in this case is the image center.
|
|
|
|
To achieve an opposite effect--concentrating samples at the image borders--we extend the distribution to $\eta \leq 1$.
|
|
\begin{align*}
|
|
X \sim \text{Bates}(\eta) :\Leftrightarrow s(X) \sim \text{Bates}(-\eta)
|
|
\end{align*}
|
|
This is accomplished by sampling from a standard Bates distribution with parameter $-\eta \geq 1$ and then applying a sawtooth function.
|
|
The sawtooth function on the interval $[0,1]$ is defined as
|
|
\begin{align}
|
|
s(x) = \begin{cases}
|
|
x + 0.5 & \text{if } 0 < x < 0.5 \\
|
|
x - 0.5 & \text{if } 0.5 \leq x \leq 1
|
|
\end{cases}
|
|
\end{align}
|
|
This function effectively maps the central portion of the interval to the edges and the edge portions to the center.
|
|
For example, a value of 0.3 (central-left) is mapped to 0.8 (edge-right), while 0.8 (edge-right) is mapped to 0.3 (central-left).
|
|
This transformation inverts the distribution's concentration, shifting the probability mass from the center to the borders.
|
|
We visualize the distribution function of the extended Bates distribution in \Cref{fig:bates-pdf}.
|
|
Both $\eta = 1$ and $\eta = -1$ result in a uniform distribution across the image.
|
|
|
|
\section{Resource Usage of \schemename}
|
|
To utilize the proposed \schemename, specific computational resources are necessary, particularly for computing and storing for the output of the segmentation stage and for on-the-fly processing of the recombination stage.
|
|
|
|
\paragraph{Segmentation.}
|
|
\schemename involves a computationally expensive segmentation and infill stage, which is a one-time calculation per dataset.
|
|
Once computed, the segmentation and infill results can be perpetually reused, amortizing the initial cost over all subsequent experiments and applications.
|
|
On NVIDIA H100 GPUs, the segmentation stage will compute at a rate of $374.3 \frac{\text{img}}{\text{GPU} \times \text{h}}$ when using Attentive Eraser or $5 338.6 \frac{\text{img}}{\text{GPU} \times \text{h}}$ for LaMa.
|
|
For ImageNet this comes down to just under 9 days (Attentive Eraser) or 16 hours (LaMa) on two 8 GPU nodes.
|
|
To facilitate immediate use and reproduction of results, we publicly provide the precalculated segmentation stage output for the ImageNet dataset for download\footnote{Link will go here.}.
|
|
The output of \schemename's segmentation step on ImageNet dataset requires 73 GB of additional disk space for the segmentation output, which is separate from the base 147 GB ImageNet size.
|
|
|
|
\paragraph{Recombination.}
|
|
The recombination step of \schemename is implemented as a based data loader operation.
|
|
It's thus offloaded to the CPU, where it can be heavily parallelized and thus only results in a very minor increase in the training step-time.
|
|
For example, using a ViT-B model on an NVIDIA A100 GPU, the average update step-time increased by $1\%$, from $528 \pm 2$ ms to $534 \pm 1$ ms.
|
|
|
|
|
|
\section{Training Setup}
|
|
\label{sec:training_setup}
|
|
|
|
\begin{table*}[h!]
|
|
\centering
|
|
\caption{Training setup and hyperparameters for our ImageNet training.}
|
|
\label{tab:in-setup}
|
|
\begin{tabular}{lcc}
|
|
\toprule
|
|
Parameter & ViT, Swin, ResNet & DeiT \\
|
|
\midrule
|
|
Image Resolution & $224 \times 224$ & $224 \times 224$ \\
|
|
Epochs & 300 & 300 \\
|
|
Learning Rate & 3e-3 & S/B: 1e-3, L: 5e-4 \\
|
|
Learning Rate Schedule & cosine decay & cosine decay \\
|
|
Batch Size & 2048 & 1024 \\
|
|
GPUs & $4\times$ NVIDIA A100/H100/H200 & $4\times$ NVIDIA A100/H100/H200 \\
|
|
Warmup Schedule & linear & linear \\
|
|
Warmup Epochs & 3 & 3 \\
|
|
Weight Decay & 0.02 & 0.05 \\
|
|
Label Smoothing & 0.1 & 0.1 \\
|
|
Optimizer & Lamb \cite{You2020} & AdamW \\
|
|
\cmidrule(r){1-1}
|
|
Data Augmentation Policy & \textbf{3-Augment \cite{Touvron2022}} & \textbf{DeiT \cite{Touvron2021b}} \\
|
|
Augmentations & \makecell{Resize \\ RandomCrop \\ HorizontalFlip \\ Grayscale \\ Solarize \\ GaussianBlur \\ ColorJitter \\ CutMix \cite{Yun2019}} & \makecell{RandomResizedCrop \\ HorizontalFlip \\ RandomErase \cite{Zhong2017} \\ RandAugment \cite{Cubuk2019} \\ ColorJitter \\ Mixup \cite{Zhang2018a} \\ CutMix \cite{Yun2019}} \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\end{table*}
|
|
|
|
\begin{table}[h!]
|
|
\centering
|
|
\caption{Training setup for finetuning on different downstream datasets. Other settings are the same as in \Cref{tab:in-setup}. For finetuning, we always utilize 3-Augment and the related parameters from the \emph{ViT, Swin, ResNet} column of \Cref{tab:in-setup}}
|
|
\label{tab:downstream-setup}
|
|
\begin{tabular}{lcccc}
|
|
\toprule
|
|
Dataset & Batch Size & Epochs & Learning Rate & Num. GPUs \\
|
|
\midrule
|
|
Aircraft & 512 & 500 & 3e-4 & 2 \\
|
|
Cars & 1024 & 500 & 3e-4 & 4 \\
|
|
Flowers & 256 & 500 & 3e-4 & 1 \\
|
|
Food & 2048 & 100 & 3e-4 & 4 \\
|
|
Pets & 512 & 500 & 3e-4 & 2 \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\end{table}
|
|
On ImageNet we use the same training setup as \cite{Nauen2025} and \cite{Touvron2022} without pretraining for ViT, Swin, and ResNet.
|
|
For DeiT, we train the same ViT architecture but using the data augmentation scheme and hyperparameters from \cite{Touvron2021b}.
|
|
As our focus is on evaluating the changes in accuracy due to \schemename, like \cite{Nauen2025}, we stick to one set of hyperparameters for all models.
|
|
We list the settings used for training on ImageNet in \Cref{tab:in-setup} and the ones used for finetuning those weights on the downstream datasets in \Cref{tab:downstream-setup}.
|
|
Out implementation is using PyTorch \cite{Paszke2019} and the \emph{timm} library \cite{Wightman2019} for model architectures and basic functions.
|
|
|
|
\begin{table*}[h!]
|
|
\centering
|
|
\caption{Hardware and Software specifics used for both training and evaluation.}
|
|
\label{tab:hw-sw-versions}
|
|
\begin{tabular}{ll}
|
|
\toprule
|
|
Parameter & Value \\
|
|
\midrule
|
|
GPU & NVIDIA A100/H100/H200 \\
|
|
CPU & 24 CPU cores (Intex Xenon) per GPU \\
|
|
Memory & up to 120GB per GPU \\
|
|
Operating System & Enroot container for SLURM based on Ubuntu 24.04 LTS \\
|
|
Python & 3.12.3 \\
|
|
PyTorch & 2.7.0 \\
|
|
TorchVision & 0.22.0 \\
|
|
Timm & 1.0.15 \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\end{table*}
|
|
\Cref{tab:hw-sw-versions} lists the specific hardware we use, as well as versions of the relevant software packages.
|
|
|
|
|
|
\section{Infill Model Comparison}
|
|
\begin{table*}[h!]
|
|
\centering
|
|
\caption{Example infills of LaMa and Attentive Eraser.}
|
|
\label{tab:infill-examples}
|
|
\resizebox{.9\textwidth}{!}{
|
|
\begin{tabular}{cc@{\hskip 0.3in}cc}
|
|
\toprule
|
|
LaMa & Att. Eraser & LaMa & Att. Eraser \\
|
|
\midrule
|
|
\includegraphics[width=.23\columnwidth]{img/lama_infills/comp/ILSVRC2012_val_00000090.JPEG} & \includegraphics[width=.23\columnwidth]{img/att_err_infills/comp/ILSVRC2012_val_00000090.JPEG} &
|
|
\includegraphics[width=.23\columnwidth]{img/lama_infills/comp/ILSVRC2012_val_00000890.JPEG} & \includegraphics[width=.23\columnwidth]{img/att_err_infills/comp/ILSVRC2012_val_00000890.JPEG} \\
|
|
\includegraphics[width=.23\columnwidth]{img/lama_infills/comp/ILSVRC2012_val_00002106.JPEG} & \includegraphics[width=.23\columnwidth]{img/att_err_infills/comp/ILSVRC2012_val_00002106.JPEG} &
|
|
\includegraphics[width=.23\columnwidth]{img/lama_infills/comp/ILSVRC2012_val_00005045.JPEG} & \includegraphics[width=.23\columnwidth]{img/att_err_infills/comp/ILSVRC2012_val_00005045.JPEG} \\
|
|
\includegraphics[width=.23\columnwidth]{img/lama_infills/comp/ILSVRC2012_val_00007437.JPEG} & \includegraphics[width=.23\columnwidth]{img/att_err_infills/comp/ILSVRC2012_val_00007437.JPEG} & \includegraphics[width=.23\columnwidth]{img/lama_infills/comp/ILSVRC2012_val_00008542.JPEG} & \includegraphics[width=.23\columnwidth]{img/att_err_infills/comp/ILSVRC2012_val_00008542.JPEG} \\
|
|
\includegraphics[width=.23\columnwidth]{img/lama_infills/comp/ILSVRC2012_val_00009674.JPEG} & \includegraphics[width=.23\columnwidth]{img/att_err_infills/comp/ILSVRC2012_val_00009674.JPEG} & \includegraphics[width=.23\columnwidth]{img/lama_infills/comp/ILSVRC2012_val_00002743.JPEG} & \includegraphics[width=.23\columnwidth]{img/att_err_infills/comp/ILSVRC2012_val_00002743.JPEG} \\
|
|
\includegraphics[width=.23\columnwidth]{img/lama_infills/comp/ILSVRC2012_val_00003097.JPEG} & \includegraphics[width=.23\columnwidth]{img/att_err_infills/comp/ILSVRC2012_val_00003097.JPEG} & \includegraphics[width=.23\columnwidth]{img/lama_infills/comp/ILSVRC2012_val_00011629.JPEG} & \includegraphics[width=.23\columnwidth]{img/att_err_infills/comp/ILSVRC2012_val_00011629.JPEG} \\
|
|
\includegraphics[width=.23\columnwidth]{img/lama_infills/comp/ILSVRC2012_val_00000547.JPEG} & \includegraphics[width=.23\columnwidth]{img/att_err_infills/comp/ILSVRC2012_val_00000547.JPEG} & \includegraphics[width=.23\columnwidth]{img/lama_infills/comp/ILSVRC2012_val_00025256.JPEG} & \includegraphics[width=.23\columnwidth]{img/att_err_infills/comp/ILSVRC2012_val_00025256.JPEG} \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
}
|
|
\end{table*}
|
|
We visualize example infilled images for both LaMa \cite{Suvorov2021} and Attentive Eraser \cite{Sun2024} in \Cref{tab:infill-examples}.
|
|
We qualitatively find that while LaMa often leaves repeated textures of blurry spots where the object was erased, Attentive Eraser produces slightly cleaner and more coherent infills of the background.
|
|
|
|
\newpage
|
|
\section{Image Infill Ratio}
|
|
\begin{table*}[h!]
|
|
\centering
|
|
\caption{Example infills with a large relative foreground area size that is infilled (infill ratio).}
|
|
\label{tbl:high-rat}
|
|
\resizebox{.8\textwidth}{!}{
|
|
\begin{tabular}{ccc}
|
|
\toprule
|
|
Infill Ratio & LaMa & Att. Eraser \\
|
|
\midrule
|
|
93.7 & \raisebox{-60pt}{\includegraphics[width=.3\columnwidth]{img/lama_infills/high_rat/ILSVRC2012_val_00003735.JPEG}} & \raisebox{-60pt}{\includegraphics[width=.3\columnwidth]{img/att_err_infills/high_rat/ILSVRC2012_val_00003735.JPEG}} \\ \\
|
|
95.7 & \raisebox{-60pt}{\includegraphics[width=.3\columnwidth]{img/lama_infills/high_rat/ILSVRC2012_val_00012151.JPEG}} & \raisebox{-60pt}{\includegraphics[width=.3\columnwidth]{img/att_err_infills/high_rat/ILSVRC2012_val_00012151.JPEG}} \\ \\
|
|
83.7 & \raisebox{-50pt}{\includegraphics[width=.3\columnwidth]{img/lama_infills/high_rat/ILSVRC2012_val_00022522.JPEG}} & \raisebox{-50pt}{\includegraphics[width=.3\columnwidth]{img/att_err_infills/high_rat/ILSVRC2012_val_00022522.JPEG}} \\ \\
|
|
88.2 & \raisebox{-50pt}{\includegraphics[width=.3\columnwidth]{img/lama_infills/high_rat/ILSVRC2012_val_00026530.JPEG}} & \raisebox{-50pt}{\includegraphics[width=.3\columnwidth]{img/att_err_infills/high_rat/ILSVRC2012_val_00026530.JPEG}}
|
|
\end{tabular}}
|
|
\end{table*}
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[width=.9\textwidth]{img/infill_distr.pdf}
|
|
\caption{We plot the distribution of the relative size of the detected foreground object that is infilled in our Segmentation step of ImageNet.
|
|
While most images contain objects of smaller size, there is a peak where Grounded~SAM~\cite{Ren2024} detects almost the whole image as the foreground object. For examples of such large infills, see \Cref{tbl:high-rat}.
|
|
}
|
|
\label{fig:infill-distr}
|
|
\end{figure}
|
|
|
|
\Cref{tbl:high-rat} shows infills for images where Grounded SAM \cite{Ren2024} marks a high percentile of the image as the foreground object (Infill Ratio), that has to be erased by the infill models.
|
|
While LaMa tends to fill those spots with mostly black or gray and textures similar to what we saw in \Cref{tab:infill-examples}, Attentive Eraser tends to create novel patterns by copying what is left of the background all over the rest of the image.
|
|
\Cref{fig:infill-distr} plots the distribution of infill ratios in \schemename.
|
|
While there is a smooth curve of the number of detections decreasing with the infill ratio until $\approx 90\%$, there is an additional peak at $\approx 100\%$ infill ratio.
|
|
We believe that this peak is made up of failure cases of Grounded~SAM.
|
|
|
|
We filter out all backgrounds that have an infill ratio larger than our pruning threshold $t_\text{prune} = 0.8$, which translates to $10\%$ of backgrounds.
|
|
|
|
|
|
|