% !TeX root = ../main.tex %\begin{figure*}[ht!] % \centering % \includegraphics[width=.9\textwidth]{img/fig-2.pdf} % \caption{Overview of \name. The data creation consists of two stages: (1, offline) Segmentation, where we segment the foreground objects from the background and fill in the background. (3, online) Recombination, where we combine the foreground objects with different backgrounds to create new samples. After recombination, we apply strong, commonly used augmentation policies.} % \label{fig:method} %\end{figure*} \begin{figure*}[t] \centering \includegraphics[width=\textwidth]{img/fig-2.pdf} \caption{Overview of \schemename. We segment the foreground object and inpaint the removed region to obtain a neutral background (Offline, \Cref{sec:segmentation}). We then paste the foreground onto a sampled background while controlling position and scale, then apply standard strong traditional augmentations (Online, \Cref{sec:recombination}).} \label{fig:method} \end{figure*} \section{\schemename} \label{sec:method} % \begin{itemize} % \item[1.] Segment ImageNet % \item Detect and Cutout Foreground % \item Multiple foreground possibilities % \item Foreground mask merging % \item Background infills % \item Foreground/Background Filtering % \item [2.] Recombination % \item Which foreground \& Background % \item Background pruning % \item size % \item positioning % \item Border smoothing % \item Dealing with other data augmentations/transformations % \end{itemize} % We propose a novel dataset, called \name, that improves image classification performance by explicitly separating and recombining foreground objects and plain backgrounds. % \name consists of two stages: Segmentation and recombination. Both are visualized in \Cref{fig:method}. % We introduce \schemename, a data augmentation scheme designed to enhance Transformer training by explicitly separating and recombining foreground objects and backgrounds. % \schemename enhances transformer training by explicitly encoding spatial invariances that these need to learn explicitly in the data. % \schemename involves two stages: Segmentation and Recombination, both visualized in \Cref{fig:method}. We introduce \schemename, a data augmentation designed to enhance training by embedding spatial invariances, which Transformers would otherwise need to learn implicitly, directly into the training data. % It operates by explicitly segmenting and recombining foreground objects and backgrounds. \schemename comprises two distinct stages: Segmentation and Recombination. Both are illustrated in \Cref{fig:method}. \subsection{Segmentation} \label{sec:segmentation} The offline segmentation stage produces reusable assets for recombination. % The segmentation stage isolates the foreground objects and their corresponding backgrounds. For each labeled training image, we create a pair $(\mathrm{fg},\mathrm{bg})$ consisting of (\textit{i}) a foreground cut-out $\mathrm{fg}$ with an alpha mask and (\textit{ii}) an inpainted background image $\mathrm{bg}$ where the foreground region has been removed. This stage is computed once offline and the results are stored for the recombination stage. \textbf{Generate candidate foreground masks.} We obtain foreground candidates with Grounded SAM~\cite{Ren2024} (Grounding DINO~\cite{Liu2024a} + SAM~\cite{Kirillov2023}). We leverage the dataset label by prompting the model with ``\code{a , a type of }''. Here \code{} is the immediate WordNet hypernym of the class (e.g., ``sorrel'' $\rightarrow$ ``horse''), which improves robustness when the class name is rare or overly specific. This can be the case with prompts like ``sorrel'' or ``guenon'', where the more general name ``horse'' or ``monkey'' is more ubiquitous. To increase recall, we generate up to $N=3$ masks per image by iteratively moving one level up the hypernym chain (e.g., ``sorrel'' $\rightarrow$ ``horse'' $\rightarrow$ ``equine'' $\dots$). We merge near-duplicate masks with pairwise IoU $\ge 0.9$, yielding a small set of $n_i