14 lines
1.6 KiB
TeX
14 lines
1.6 KiB
TeX
|
|
\begin{abstract}
|
|
Transformers, particularly Vision Transformers (ViTs), have achieved state-of-the-art performance in large-scale image classification.
|
|
However, they often require large amounts of data and can exhibit biases, such as center or size bias, that limit their robustness and generalizability.
|
|
This paper introduces \schemename, a novel data augmentation operation that addresses these challenges by explicitly imposing invariances into the training data, which are otherwise part of the neural network architecture.
|
|
\schemename is constructed by using pretrained foundation models to separate and recombine foreground objects with different backgrounds.
|
|
This recombination step enables us to take fine-grained control over object position and size, as well as background selection.
|
|
We demonstrate that using \schemename significantly improves the accuracy of ViTs and other architectures by up to 4.5 percentage points (p.p.) on ImageNet, which translates to 7.3 p.p. on downstream tasks.
|
|
Importantly, \schemename not only improves accuracy but also opens new ways to analyze model behavior and quantify biases.
|
|
Namely, we introduce metrics for background robustness, foreground focus, center bias, and size bias and show that using \schemename during training substantially reduces these biases.
|
|
In summary, \schemename provides a valuable tool for analyzing and mitigating biases, enabling the development of more robust and reliable computer vision models.
|
|
Our code and dataset are publicly available at \code{https://github.com/tobna/ForAug}.
|
|
\end{abstract}
|