arxiv V2

2026-02-24 11:57:25 +01:00
parent 7e66c96a60
commit e8cc0ee8a6
275 changed files with 16336 additions and 836 deletions
--- a/sec/method.tex
+++ b/sec/method.tex
@@ -1,6 +1,13 @@
 % !TeX root = ../main.tex

-\section{RecombiNet (Method)}
+%\begin{figure*}[ht!]
+%    \centering
+%    \includegraphics[width=.9\textwidth]{img/fig-2.pdf}
+%    \caption{Overview of \name. The data creation consists of two stages: (1, offline) Segmentation, where we segment the foreground objects from the background and fill in the background. (2, online) Recombination, where we combine the foreground objects with different backgrounds to create new samples. After recombination, we apply strong, commonly used augmentation policies.}
+%    \label{fig:method}
+%\end{figure*}
+
+\section{\schemename (Method)}
 \label{sec:method}

 % \begin{itemize}
@@ -19,21 +26,21 @@
 %     \item Dealing with other data augmentations/transformations
 % \end{itemize}

-\begin{figure*}
-    \centering
-    \includegraphics[width=\textwidth]{img/fig-2.pdf}
-    \caption{Overview of \name. The data creation consists of two stages: (1, offline) Segmentation, where we segment the foreground objects from the background and fill in the background. (2, online) Recombination, where we combine the foreground objects with different backgrounds to create new samples.}
-    \label{fig:method}
-\end{figure*}
-
 % We propose a novel dataset, called \name, that improves image classification performance by explicitly separating and recombining foreground objects and plain backgrounds.
 % \name consists of two stages: Segmentation and recombination. Both are visualized in \Cref{fig:method}.
-We introduce \schemename, a data augmentation scheme designed to enhance Transformer training by explicitly separating and recombining foreground objects and backgrounds.
-\schemename involves two stages: Segmentation and Recombination, both visualized in \Cref{fig:method}.
+% We introduce \schemename, a data augmentation scheme designed to enhance Transformer training by explicitly separating and recombining foreground objects and backgrounds.
+% \schemename enhances transformer training by explicitly encoding spatial invariances that these need to learn explicitly in the data.
+% \schemename involves two stages: Segmentation and Recombination, both visualized in \Cref{fig:method}.
+We introduce \schemename, a data augmentation designed to enhance Transformer training by embedding spatial invariances--which Transformers would otherwise need to learn implicitly--directly into the training data.
+% It operates by explicitly segmenting and recombining foreground objects and backgrounds.
+\schemename comprises two distinct stages: Segmentation and Recombination. Both stages are illustrated in \Cref{fig:method}.

-\subsubsection*{Segmentation}
+
+\subsection{Segmentation}
+\label{sec:segmentation}
 The segmentation stage isolates the foreground objects and their corresponding backgrounds.
-We then fill in the background in a visually plausible way~\cite{Sun2024} using a pretrained object-removal model.
+% We then fill in the background in a visually plausible way~\cite{Sun2024} using a pretrained object-removal model.
+We then fill the background using a pretrained object-removal model, producing visually plausible~\cite{Sun2024}, neutral scenes ready for recombination.
 This stage is computed once offline and the results are stored for the recombination stage.

 First, foreground objects are detected and segmented from their backgrounds using a prompt-based segmentation model to exploit the classification datasets labels.
@@ -43,32 +50,39 @@ The \code{<object category>} guides the segmentation model towards the correct o
 This can be the case with prompts like ``sorrel'' or ``guenon'', where the more general name ``horse'' or ``monkey'' is more helpful.
 We derive the \code{<object category>} from the WordNet hierarchy, using the immediate hypernym.

-We iteratively extract up to $n$ foreground masks for each dataset-image, using different more and more general prompts based on the more general synsets of WordNet (e.g. ``a sorrel, a type of horse'', ``a horse, a type of equine'', ...).
+% We iteratively extract up to $n$ foreground masks for each dataset-image, using different more and more general prompts based on the more general synsets of WordNet (e.g. ``a sorrel, a type of horse'', ``a horse, a type of equine'', ...).
+We iteratively extract $n$ foreground masks for each dataset-image, creating prompts by going one hypernym up the WordNet-tree each step (e.g. ``a sorrel, a type of horse'', ``a horse, a type of equine'', ...).
 Masks that are very similar, with a pairwise IoU of at least $0.9$, are merged.
 The output is a set of masks delineating the foreground objects and the backgrounds.
 We select the best mask per image (according to \Cref{eq:filtering-score}) in a later filtering step, described below.

-An inpainting model that is specifically optimized to remove objects from images, such as LaMa~\cite{Suvorov2021} or Attentive Eraser~\cite{Sun2024}, is used to inpaint the foreground regions in the backgrounds.
-To ensure the quality of the foreground and background images (for each dataset-image), we select a foreground/background pair from the $\leq n$ variants we have extracted and infilled in the previous steps.
-Using an ensemble of six ViT, ResNet, and Swin Transformer models pretrained on the original dataset, we select the foreground/background pair that maximizes foreground performance while minimizing the performance on the background and size of the foreground according to:
+First, an inpainting model that is specifically optimized to remove objects from images, such as LaMa~\cite{Suvorov2021} or Attentive Eraser~\cite{Sun2024}, is used to inpaint the foreground regions in the backgrounds.
+Then, to ensure the quality of the foregrounds and the neutral background images, we select a foreground/background pair (for each dataset-image) from the $\leq n$ variants we have extracted and infilled in the previous steps.
+Using an ensemble $E$ of six ViT, ResNet, and Swin Transformer models pretrained on the original dataset, we select the foreground/background pair that maximizes foreground performance while minimizing the performance on the background and size of the foreground.
+For each model $m \in E$, we predict the score of the ground truth class $c$ on the foreground $\mathrm{fg}$ and background $\mathrm{bg}$ and weigh these with the size $\operatorname{size}(\cdot)$ in number of pixels according to:
+% $c$ is the correct foreground class, $\mathrm{fg}$, and $\mathrm{bg}$ are the foreground and background and $\operatorname{size}(\cdot)$ is the size in number of pixels.
 \begin{align} \begin{split} \label{eq:filtering-score}
        \text{score}(\mathrm{fg}, \mathrm{bg}, c) &= \log \left( \frac{1}{\abs{E}} \sum_{m \in E} \P[m(\mathrm{fg}) = c] \right)                                                \\
        & + \log \left( 1 - \frac{1}{\abs E} \sum_{m \in E} \P[m(\mathrm{bg}) = c] \right)                                             \\
        & + \lambda \log \left( 1 - \abs{\frac{\operatorname{size}(\mathrm{fg})}{\operatorname{size}(\mathrm{bg})} - \eps} \right).
    \end{split} \end{align}
-Here, $E$ is the ensemble of models and $m$ is a pretrained model, $c$ is the correct foreground class, $\mathrm{fg}$, and $\mathrm{bg}$ are the foreground and background and $\operatorname{size}(\cdot)$ is the size in number of pixels.
-We ran a hyperparameter search using a manually annotated subset of foreground/background variants to find the factors in \Cref{eq:filtering-score}: $\lambda = 2$ and $\eps = 0.1$.
-The \textit{optimal foreground size} of $10\%$ of the full image balances the smallest possible foreground size that encompasses all the respective class information in the image with still conveying the foreground information after pasting it onto another background.
-This filtering step ensures we segment all the relevant foreground objects.
+% We use $E$ is the ensemble of models and $m$ is a pretrained model, $c$ is the correct foreground class, $\mathrm{fg}$, and $\mathrm{bg}$ are the foreground and background and $\operatorname{size}(\cdot)$ is the size in number of pixels.
+We run a hyperparameter search using a manually annotated subset of foreground/background variants to find the factors in \Cref{eq:filtering-score}: $\lambda = 2$ and $\eps = 0.1$.
+% The \textit{optimal foreground size} of $10\%$ of the full image balances the smallest possible foreground size that encompasses all the respective class information in the image with still conveying the foreground information after pasting it onto another background.
+% This filtering step ensures we segment all the relevant foreground objects.

-Finally, we filter out backgrounds that are more than $80\%$ infilled, as these tend to be overly synthetic, plain and don't carry much information (see \Cref{sec:high-infill-ratio}).
-We ablate this choice in \Cref{sec:ablation}.
+Finally, we filter out backgrounds that are largely infilled, as these tend to be overly synthetic and do not carry much information (see the supplementary material).
+% We ablate this choice in \Cref{sec:ablation}.
+% While the computational cost for the segmentation stage is significant, this is a one-time calculation whose results can be reused in subsequent experiments (see the supplementary material for details).
+Although the segmentation stage is computational overhead, it is a one-time cost with results that can be reused across experiments (see the supplementary material for details).
 In summary, we factorize the dataset into a set of foreground objects with a transparent background and a set of diverse backgrounds per class.
-The next step is to recombine them as data augmentation before applying common data augmentation operations during training.
+The next step is to recombine these, before applying other common data augmentation operations during training.

-\subsubsection*{Recombination}
-The recombination stage, which is performed online, combines the foreground objects with different backgrounds to create new training samples.
-For each object, we follow the pipeline of: Pick an appropriate background, resize it to a fitting size, place it in the background image, smooth the transition edge, and apply other data augmentations.
+\subsection{Recombination}
+\label{sec:recombination}
+The recombination stage, performed online during training, combines the foreground objects with different backgrounds to create new training samples.
+For each object, we follow the pipeline of: Pick an appropriate background, resize it to a fitting size, and place it in the background image.
+Through this step, we expose the model to variations beyond the image compositions of the dataset.

 For each foreground object, we sample a background using one of the following strategies:
 (1) the original image background, (2) the set of backgrounds from the same class, or (3) the set of all possible backgrounds.
@@ -76,26 +90,24 @@ These sets are trading off the amount of information the model can learn from th
 In each epoch, each foreground object is seen exactly once, but a background may appear multiple times.

 The selected foreground is resized based on its relative size within its original image and the relative size of the original foreground in the selected background image.
-The final size is randomly selected from a 30\% range around upper and lower limits ($s_u$ and $s_l$), based on the original sizes:
-\begin{align}
-    s \sim \mathcal U \left[ (1 - 0.3)  s_l, (1 + 0.3)  s_u \right].
-\end{align}
+The final size is randomly selected from a 30\% range around upper and lower limits ($s_u$ and $s_l$), based on the original sizes.
+% \begin{align}
+%     s \sim \mathcal U \left[ (1 - 0.3)  s_l, (1 + 0.3)  s_u \right].
+% \end{align}
 To balance the size of the foreground and that of the backgrounds original foreground, the upper and lower limit $s_u$ and $s_l$ are set to the mean or range of both sizes, depending on the foreground size strategy: \emph{mean} or \emph{range}.

 The resized foreground is then placed at a random position within the background image.
-This position is sampled from a generalization of the Bates distribution~\cite{Bates1955} with parameter $\eta \in \N$, visualized in \Cref{fig:bates-pdf}.
-We choose the bates distribution, as it presents an easy way to sample from a bounded domain with just one hyperparameter that controls the concentration of the distribution.
-$\eta = 1$ corresponds to the uniform distribution; $\eta > 1$ concentrates the distribution around the center; and for $\eta < -1$, the distribution is concentrated at the borders.
 To more seamlessly integrate the foreground, we apply a Gaussian blur with ${\sigma \in [\frac{\sigma_{\text{max}}}{10}, \sigma_{\text{max}}]}$, inspired by the standard range for the Gaussian blur operation in \cite{Touvron2022}, to the foreground's alpha-mask.

 We can apply standard data augmentation techniques in two modes:
 Either we apply all augmentations to the recombined image, or we apply the cropping and resizing to the background only and then apply the other augmentations after recombination.
 % While for the second mode, the foreground object will always be fully visible, the first mode uses the data augmentations in the same way they would be used for the baseline dataset.
-The second mode ensures the foreground object remains fully visible, while the first mode mirrors standard data augmentation practices.
-
+% The second mode ensures the foreground object remains fully visible, while the first mode mirrors standard data augmentation practices.
+The first mode mirrors standard augmentation practice, whereas the second one ensures the foreground object remains fully visible.

 We experiment with a constant mixing ratio, or a linear or cosine annealing schedule that increases the amount of images from the original dataset over time.
 The mixing ratio acts as a probability of selecting an image from the original dataset;
-otherwise, an image with the same foreground is recombined using \schemename.
-Thus, we still ensure each foreground is seen once per epoch.
+otherwise, an image with the same foreground is recombined using \schemename, ensuring each object is seen once per epoch.
+% Thus, we still ensure each foreground is seen once per epoch.
+The recombination stage is designed to be parallelized on the CPU during training and thus does not impact training time (see supplementary material for details).