72 lines
7.0 KiB
TeX
72 lines
7.0 KiB
TeX
|
|
|
|
\section{\schemename (Method)}
|
|
\label{sec:method}
|
|
|
|
|
|
We introduce \schemename, a data augmentation designed to enhance Transformer training by embedding spatial invariances--which Transformers would otherwise need to learn implicitly--directly into the training data.
|
|
\schemename comprises two distinct stages: Segmentation and Recombination. Both stages are illustrated in \Cref{fig:method}.
|
|
|
|
|
|
\subsection{Segmentation}
|
|
\label{sec:segmentation}
|
|
The segmentation stage isolates the foreground objects and their corresponding backgrounds.
|
|
We then fill the background using a pretrained object-removal model, producing visually plausible~\cite{Sun2024}, neutral scenes ready for recombination.
|
|
This stage is computed once offline and the results are stored for the recombination stage.
|
|
|
|
First, foreground objects are detected and segmented from their backgrounds using a prompt-based segmentation model to exploit the classification datasets labels.
|
|
We use the state-of-the-art Grounded SAM~\cite{Ren2024}, which is based on Grounding DINO~\cite{Liu2023e} and SAM~\cite{Kirillov2023}.
|
|
The prompt we use is ``\code{a <class name>, a type of <object category>}'', where \code{<class name>} is the specific name of the objects class as defined by the dataset and \code{<object category>} is a the broader category of the object.
|
|
The \code{<object category>} guides the segmentation model towards the correct object in case the \code{<class name>} alone is too specific.
|
|
This can be the case with prompts like ``sorrel'' or ``guenon'', where the more general name ``horse'' or ``monkey'' is more helpful.
|
|
We derive the \code{<object category>} from the WordNet hierarchy, using the immediate hypernym.
|
|
|
|
We iteratively extract $n$ foreground masks for each dataset-image, creating prompts by going one hypernym up the WordNet-tree each step (e.g. ``a sorrel, a type of horse'', ``a horse, a type of equine'', ...).
|
|
Masks that are very similar, with a pairwise IoU of at least $0.9$, are merged.
|
|
The output is a set of masks delineating the foreground objects and the backgrounds.
|
|
We select the best mask per image (according to \Cref{eq:filtering-score}) in a later filtering step, described below.
|
|
|
|
First, an inpainting model that is specifically optimized to remove objects from images, such as LaMa~\cite{Suvorov2021} or Attentive Eraser~\cite{Sun2024}, is used to inpaint the foreground regions in the backgrounds.
|
|
Then, to ensure the quality of the foregrounds and the neutral background images, we select a foreground/background pair (for each dataset-image) from the $\leq n$ variants we have extracted and infilled in the previous steps.
|
|
Using an ensemble $E$ of six ViT, ResNet, and Swin Transformer models pretrained on the original dataset, we select the foreground/background pair that maximizes foreground performance while minimizing the performance on the background and size of the foreground.
|
|
For each model $m \in E$, we predict the score of the ground truth class $c$ on the foreground $\mathrm{fg}$ and background $\mathrm{bg}$ and weigh these with the size $\operatorname{size}(\cdot)$ in number of pixels according to:
|
|
\begin{align} \begin{split} \label{eq:filtering-score}
|
|
\text{score}(\mathrm{fg}, \mathrm{bg}, c) &= \log \left( \frac{1}{\abs{E}} \sum_{m \in E} \P[m(\mathrm{fg}) = c] \right) \\
|
|
& + \log \left( 1 - \frac{1}{\abs E} \sum_{m \in E} \P[m(\mathrm{bg}) = c] \right) \\
|
|
& + \lambda \log \left( 1 - \abs{\frac{\operatorname{size}(\mathrm{fg})}{\operatorname{size}(\mathrm{bg})} - \eps} \right).
|
|
\end{split} \end{align}
|
|
We run a hyperparameter search using a manually annotated subset of foreground/background variants to find the factors in \Cref{eq:filtering-score}: $\lambda = 2$ and $\eps = 0.1$.
|
|
|
|
Finally, we filter out backgrounds that are largely infilled, as these tend to be overly synthetic and do not carry much information (see the supplementary material).
|
|
Although the segmentation stage is computational overhead, it is a one-time cost with results that can be reused across experiments (see the supplementary material for details).
|
|
In summary, we factorize the dataset into a set of foreground objects with a transparent background and a set of diverse backgrounds per class.
|
|
The next step is to recombine these, before applying other common data augmentation operations during training.
|
|
|
|
\subsection{Recombination}
|
|
\label{sec:recombination}
|
|
The recombination stage, performed online during training, combines the foreground objects with different backgrounds to create new training samples.
|
|
For each object, we follow the pipeline of: Pick an appropriate background, resize it to a fitting size, and place it in the background image.
|
|
Through this step, we expose the model to variations beyond the image compositions of the dataset.
|
|
|
|
For each foreground object, we sample a background using one of the following strategies:
|
|
(1) the original image background, (2) the set of backgrounds from the same class, or (3) the set of all possible backgrounds.
|
|
These sets are trading off the amount of information the model can learn from the background against the diversity of new images created.
|
|
In each epoch, each foreground object is seen exactly once, but a background may appear multiple times.
|
|
|
|
The selected foreground is resized based on its relative size within its original image and the relative size of the original foreground in the selected background image.
|
|
The final size is randomly selected from a 30\% range around upper and lower limits ($s_u$ and $s_l$), based on the original sizes.
|
|
To balance the size of the foreground and that of the backgrounds original foreground, the upper and lower limit $s_u$ and $s_l$ are set to the mean or range of both sizes, depending on the foreground size strategy: \emph{mean} or \emph{range}.
|
|
|
|
The resized foreground is then placed at a random position within the background image.
|
|
To more seamlessly integrate the foreground, we apply a Gaussian blur with ${\sigma \in [\frac{\sigma_{\text{max}}}{10}, \sigma_{\text{max}}]}$, inspired by the standard range for the Gaussian blur operation in \cite{Touvron2022}, to the foreground's alpha-mask.
|
|
|
|
We can apply standard data augmentation techniques in two modes:
|
|
Either we apply all augmentations to the recombined image, or we apply the cropping and resizing to the background only and then apply the other augmentations after recombination.
|
|
The first mode mirrors standard augmentation practice, whereas the second one ensures the foreground object remains fully visible.
|
|
|
|
We experiment with a constant mixing ratio, or a linear or cosine annealing schedule that increases the amount of images from the original dataset over time.
|
|
The mixing ratio acts as a probability of selecting an image from the original dataset;
|
|
otherwise, an image with the same foreground is recombined using \schemename, ensuring each object is seen once per epoch.
|
|
The recombination stage is designed to be parallelized on the CPU during training and thus does not impact training time (see supplementary material for details).
|
|
|