-
[DomainBed] In Search of Lost Domain GeneralizationResearch/NLP_YS2024 2024. 12. 5. 10:53
https://arxiv.org/pdf/2007.01434
https://github.com/facebookresearch/DomainBed?tab=readme-ov-file
어떻게 실험 setting을 해서 domain generalization ability를 증명할 것인가.
평소에 뭔가 개운하지 않았던, 간과하였던 부분.
model 간의 performance 차이가 정말로 model의 generalization capability에 기인하는 것인지, 아니면 hyperparameter search 혹은 다른 실험적 요소에 의한 차이인지 명확하게 구분할 수가 없어서 답답했던 부분.
이게 정말 fair comparison인가. 어디다가 어떻게 비교를 해야 기존 대비 성능이 향상되었다고 말할 수 있는가.
어떤 randomness를 얼마나 포함해야 experimental results가 robust하다고 말할 수 있는가.
이 논문을 보니 정리가 되는 기분.
Abstract
The goal of domain generalization algorithms is to predict well on distributions different from those seen during training. While a myriad of domain generalization algorithms exist, inconsistencies in experimental conditions—datasets, architectures, and model selection criteria—render fair and realistic comparisons difficult. In this paper, we are interested in understanding how useful domain generalization algorithms are in realistic settings. As a first step, we realize that model selection is non-trivial for domain generalization tasks. Contrary to prior work, we argue that domain generalization algorithms without a model selection strategy should be regarded as incomplete. Next, we implement DOMAINBED, a testbed for domain generalization including seven multi-domain datasets, nine baseline algorithms, and three model selection criteria. We conduct extensive experiments using DOMAINBED and find that, when carefully implemented, empirical risk minimization shows state-of-the-art performance across all datasets. Looking forward, we hope that the release of DOMAINBED, along with contributions from fellow researchers, will streamline reproducible and rigorous research in domain generalization.
1. Introduction
Machine learning systems often fail to generalize out-of-distribution, crashing in spectacular ways when tested outside the domain of training examples [Torralba and Efros, 2011]. The overreliance of learning systems on the training distribution manifests widely. For instance, self-driving car systems struggle to perform under conditions different to those of training, including variations in light [Dai and Van Gool, 2018], weather [Volk et al., 2019], and object poses [Alcorn et al., 2019]. As another example, systems trained on medical data collected in one hospital do not generalize to other health centers [Castro et al., 2019, AlBadawy et al., 2018, Perone et al., 2019, Heaven, 2020]. Arjovsky et al. [2019] suggest that failing to generalize out-of-distribution is failing to capture the causal factors of variation in data, clinging instead to easier-to-fit spurious correlations, which are prone to change from training to testing domains. Examples of spurious correlations commonly absorbed by learning machines include racial biases [Stock and Cisse, 2018], texture statistics [Geirhos et al., 2018], and object backgrounds [Beery et al., 2018]. Alas, the capricious behaviour of machine learning systems out-of-distribution is a roadblock to their deployment in critical applications.
Aware of this problem, the research community has spent significant effort during the last decade to develop algorithms able to generalize out-of-distribution. In particular, the literature in domain generalization assumes access to multiple datasets during training, each of them containing examples about the same task, but collected under a different domain or environment [Blanchard et al., 2011, Muandet et al., 2013]. The goal of domain generalization algorithms is to incorporate the invariances across these training datasets into a classifier, in hopes that such invariances also hold in novel test domains. Different domain generalization solutions assume different types of invariances and propose algorithms to estimate them from data.
Despite the enormous importance of domain generalization, the literature is scattered: a plethora of different algorithms appear yearly, and these are evaluated under different datasets and model selection criteria. Borrowing from the success of standard computer vision benchmarks such as ImageNet [Russakovsky et al., 2015], the purpose of this work is to perform a standardized, rigorous comparison of domain generalization algorithms. In particular, we ask: how useful are domain generalization algorithms in realistic settings? Towards answering this question, we first study model selection criteria for domain generalization methods, resulting in the recommendation:
A domain generalization algorithm should be responsible for specifying a model selection method.
We then carefully implement nine domain generalization algorithms on seven multi-domain datasets and three model selection criteria, leading us to the conclusion reflected in Tables 1 and 4:
When equipped with modern neural network architectures and data augmentation techniques, empirical risk minimization achieves state-of-the-art performance in domain generalization.
As a result of our research, we release DOMAINBED, a framework to streamline rigorous and reproducible experimentation in domain generalization. Using DOMAINBED, adding a new algorithm or dataset is a matter of a few lines of code; a single command runs all the experiments, performs all the model selections, and auto-generates all the tables included in this work. Moreover, our motivation is to keep DOMAINBED alive, welcoming pull requests from our fellow colleagues to update the available algorithms, datasets, model selection criteria, and result tables.
Section 2 kicks off our exposition with a review of the domain generalization setup. Section 3 discusses the difficulties of model selection in domain generalization and makes recommendations for a path forward. Section 4 introduces DOMAINBED, describing the algorithms and datasets contained in the initial release. Section 5 discusses the experimental results of running the entire DOMAINBED suite; these illustrate the strength of ERM and the importance of model selection criteria. Finally, Section 6 offers our view on future research directions in domain generalization. Our Appendices review one hundred articles spanning a decade of research in this topic, collecting the experimental performance of over thirty published algorithms.
2. The problem of domain generalization
3. Model selection as part of the learning problem
Here we discuss issues surrounding model selection (choosing hyperparameters, training checkpoints, architecture variants) in domain generalization and make specific recommendations for a path forward. Because we lack access to a validation set identically distributed to the test data, model selection in domain generalization is not as straightforward as in supervised learning. Some works adopt heuristic strategies whose behavior is not well-studied, while others simply omit a description of how to choose hyperparameters. This leaves open the possibility that hyperparameters were chosen using the test data, which is not methodologically sound. Differences in results arising from inconsistent tuning practices may be misattributed to the algorithms under study, complicating fair assessments.
We believe that much of the confusion surrounding model selection in domain generalization arises from treating it as a question of experimental design. In reality, selecting hyperparameters is a learning problem at least as hard as fitting the model (in as much as we may interpret any model parameter as a hyperparameter). Like all learning problems, model selection requires assumptions about how the test data relates to the training data. Different domain generalization algorithms make different assumptions, and it is not clear a priori what assumptions are correct, or how these assumptions influence the model selection criterion. Indeed, choosing reasonable assumptions is at the heart of domain generalization research. Therefore, a domain generalization algorithm without a strategy to choose its hyperparameters remains incomplete.
Recommendation 1 A domain generalization algorithm should be responsible for specifying a model selection method.
While algorithms without well-justified model selection methods are incomplete, they may be useful as stepping-stones in a research agenda. In this case, instead of using an ad-hoc model selection method, we can evaluate incomplete algorithms by considering an oracle model selection method, where we select hyperparameters on the test domain. Of course, it is important that we avoid invalid comparisons between oracle results and baselines tuned without an oracle method. Also, unless we restrict access to the test domain data somehow, we risk obtaining meaningless results. For instance, we could just train on such test domain data using supervised learning.
Recommendation 2 Researchers should disclaim any oracle-selection results as such and specify policies to limit access to the test domain.
3.1. Three model selection methods
Having made broad recommendations, we review and justify three methods for model selection in domain generalization, often used but rarely discerned.
Training-domain validation set
We split each training domain into training and validation subsets. Then, we pool the validation subsets of each training domain to create an overall validation set. Finally, we choose the model maximizing the accuracy on the overall validation set.
This strategy assumes that the training and test examples follow similar distributions. For example, Ben-David et al. [2010] bound the test domain error of a classifier by the training domain error, plus a divergence measure between the training and test domains.
Leave-one-domain-out cross-validation
Given d_tr training domains, we train d_tr models with equal hyperparameters, each holding one of the training domains out. We evaluate each model on its held-out domain, and average the accuracies of these models over their held-out domains. Finally, we choose the model maximizing this average accuracy, re-trained on all d_tr domains.
This strategy assumes that training and test domains are drawn from a meta-distribution over domains, and that our goal is to maximize the expected performance under this meta-distribution.
Test-domain validation set (oracle)
We choose the model maximizing the accuracy on a validation set that follows the distribution of the test domain. Following our earlier recommendation to limit test domain access, we allow 20 queries per algorithm (one query per choice of hyperparameters in our random search). This means that we do not allow early stopping based on the validation set. Instead, we train all models for the same fixed number of steps and consider only the final checkpoint. Recall that we do not consider this a valid benchmarking methodology, since it requires access to the test domain. Oracle-selection results can be either optimistic, because we access the test distribution, or pessimistic, because the query limit reduces the number of considered hyperparameter combinations.
As an alternative to limiting the number of queries, we could borrow tools from differential privacy, previously applied to enable multiple re-uses of validation sets in standard supervised learning [Dwork et al., 2015]. In a nutshell, differential privacy tools add Laplace noise to the accuracy statistic of the algorithm before reporting it to the practitioner.
3.2. Considerations from the literature
Some references in prior work discuss additional strategies to choose hyperparemeters in domain generalization problems. For instance, Krueger et al. [2020, Appendix B.1] suggest choosing hyperparameters to maximize the performance across all domains of an external dataset. The validity of this strategy depends on the relatedness between datasets. Albuquerque et al. [2019, Section 5.3.2] suggest performing model selection based on the loss function (which often incorporates an algorithm-specific regularizer), and DInnocente and Caputo [2018, Section 3] derive an strategy specific to their algorithm.
4. DomainBed: A PyTorch testbed for domain generalization
At the heart of our large scale experimentation is DOMAINBED, a PyTorch [Paszke et al., 2019] testbed to streamline reproducible and rigorous research in domain generalization:
https://github.com/facebookresearch/DomainBed
The initial release comprises nine algorithms, seven datasets, and three model selection methods (described in Section 3), as well as the infrastructure to run all the experiments and generate all the LATEX tables below with a single command. DOMAINBED is a living project: we expect to update the above repository with new results, algorithms, and datasets. Contributions via pull requests from fellow researchers are welcome. Adding a new algorithm or dataset to DOMAINBED is a matter of a few lines of code (see Appendix E for an example).
4.1. Datasets
DOMAINBED includes downloaders and loaders for seven multi-domain image classification tasks: Colored MNIST [Arjovsky et al., 2019], Rotated MNIST [Ghifary et al., 2015], PACS [Li et al., 2017], VLCS [Fang et al., 2013], Office-Home [Venkateswara et al., 2017], Terra Incognita [Beery et al., 2018], and DomainNet [Peng et al., 2019]. We list and show example images from each dataset in Table 3, and provide their full details in Appendix C.
The datasets differ in many ways but two are particularly important. The first difference is between synthetic and real datasets. In Rotated MNIST and Colored MNIST, domains are synthetically constructed such that we know what features will generalize a priori, so using too much prior knowledge (e.g. by augmenting with rotations) is off-limits, whereas the other datasets contain domains arising from natural processes, making it sensible to use prior knowledge. The second difference is about what changes across domains. On one hand, in datasets other than Colored MNIST, the domain changes the distribution of images, but likely bears no information about the true image-to-label mapping. On the other hand, in Colored MNIST, the domain influences the true image-to-label mapping, biasing algorithms that try to estimate this function directly.
4.2. Algorithms
The initial release of DOMAINBED includes implementations of nine baseline algorithms:
• Empirical Risk Minimization (ERM, Vapnik [1998]) minimizes the sum of errors across domains and examples.
• Group Distributionally Robust Optimization (DRO, Sagawa et al. [2019]) performs ERM while increasing the importance of domains with larger errors.
• Inter-domain Mixup (Mixup, Xu et al. [2019], Yan et al. [2020], Wang et al. [2020]) performs ERM on linear interpolations of examples from random pairs of domains and their labels.
• Meta-Learning for Domain Generalization (MLDG, Li et al. [2018a]) leverages MAML [Finn et al., 2017] to meta-learn how to generalize across domains.
• Different variants of the popular algorithm of Ganin et al. [2016] to learn features φ(X^d) with distributions matching across domains:
– Domain-Adversarial Neural Networks (DANN, Ganin et al. [2016]) employ an adversarial network to match feature distributions.
– Class-conditional DANN (C-DANN, Li et al. [2018d]) is a variant of DANN matching the conditional distributions P(φ(X^d)|Y^d = y) across domains, for all labels y.
– CORAL [Sun and Saenko, 2016] matches the mean and covariance of feature distributions.
– MMD [Li et al., 2018b] matches the MMD [Gretton et al., 2012] of feature distributions.
• Invariant Risk Minimization (IRM [Arjovsky et al., 2019]) learns a feature representation φ(X^d) such that the optimal linear classifier on top of that representation matches across domains.
Appendix D describes the network architectures and hyperparameter search spaces for all algorithms.
4.3. Implementation choices for realistic evaluation
Our goal is a realistic evaluation of domain generalization algorithms. To that end, we make several implementation choices which depart from prior work, explained below.
Large models
Most prior work on VLCS and PACS borrows features from or finetune ResNet-18 models [He et al., 2016]. Since larger ResNets are known to generalize better, we opt to finetune ResNet-50 models for all datasets except Rotated MNIST and Colored MNIST, where we use a smaller CNN architecture (see Appendix D).
Data augmentation
Data augmentation is a standard ingredient to train image classification models. In domain generalization, data augmentation can play an especially important role when augmentations can approximate some of the variations between domains. Therefore, for all non-MNIST datasets, we train using the following data augmentations: crops of random size and aspect ratio, resizing to 224 × 224 pixels, random horizontal flips, random color jitter, grayscaling the image with 10% probability, and normalization using the ImageNet channel means and standard deviations. For MNIST datasets, we use no data augmentation.
Using all available data
In Rotated MNIST, whereas the usual version of the dataset constructs all domains from the same set of 1000 digits, we divide all the MNIST digits evenly among domains. We deviate from standard practice for two reasons: we believe that using the same digits across training and test domains amounts to leaking test data, and we believe that artificially restricting the available training domain data complicates the task in an unrealistic way.
5. Experiments
We run experiments for all algorithms (Section 4.2), datasets (Section 4.1), and model selection criteria (Section 3) shipped in DOMAINBED. We consider all configurations of a dataset where we hide one domain for testing and train on the remaining ones.
Hyperparameter search
For each algorithm and test environment, we conduct a random search [Bergstra and Bengio, 2012] of 20 trials over the hyperparameter distribution (see Appendix D). We use each model selection method from Section 3 to select amongst the 20 models from the random search. We split the data from each domain into 80% and 20% splits. We use the larger splits for training and final evaluation, and the smaller splits to select hyperparameters.
Standard error bars
While some domain generalization literature reports error bars across seeds, randomness arising from model selection is often ignored. While this is acceptable if the goal is a best-versus-best comparison, it prohibits nuanced analyses. For instance, does method A outperform method B only because random search for A got lucky? We therefore repeat our entire study three times making every random choice a new: hyperparameters, weight initializations, and dataset splits. Every number we report is a mean over these repetitions, together with their estimated standard error.
This experimental protocol amounts to training a total of 45,900 neural networks.
5.1. Results
Table 4 summarizes the results of our experiments. For each dataset and model, we average the best results (according to each model selection criterion) across test domains. We then report the average of this number across three independent runs of the entire sweep, and its corresponding standard error. For results per dataset and domain, we refer the reader to Appendix B. We draw three main conclusions from our results:
Our ERM baseline outperforms all previously published results
Table 1 summarizes this result when model selection is performed using a training domain validation set. What is responsible for this strong performance? We suspect four factors: a bigger network architecture (ResNet-50), strong data augmentations, careful hyperparameter tuning and, in Rotated MNIST, using the full training data to construct our domains (instead of using a 1000-image subset). While we are not first to use any these techniques alone, we may be first to combine all of them. Interestingly, these results suggest standard techniques to improve in-distribution generalization are very effective at improving out-of-distribution generalization. Our result does not refute prior work: it is possible that with similar techniques, some competing methods may improve upon ERM. Rather, our results highlight the importance of comparing domain generalization algorithms to strong and realistic baselines. Incorporating novel algorithms into DOMAINBED is an easy way to do so. For an extensive review of results published in the literature about more than thirty algorithms, we refer the reader to Appendix A.5.
When all conditions are equal, no algorithm outperforms ERM by a significant margin
We observe this result in Table 4, obtained from running from scratch every combination of dataset, algorithm, and model selection criteria included in DOMAINBED. Given any model selection criterion, no method improves upon the average performance of ERM in more than one point. We do not claim that any of these algorithms cannot possibly improve upon ERM, but getting substantial domain generalization improvements over ERM on these datasets proved challenging.
Model selection methods matter
We observe that model selection with a training domain validation set outperforms leave-one-domain-out cross-validation across multiple datasets and algorithms. This does not mean that using a training domain validation set is the right way to tune hyperparameters. After all, it did not enable any algorithm to significantly outperform the ERM baseline. Moreover, the stronger performance of oracle-selection (+2%) suggests possible headroom for improvement.
6. Outlook
We have conducted an extensive empirical evaluation of domain generalization algorithms. Our results led to two major conclusions. First, empirical risk minimization achieves state-of-the-art performance when compared to eight popular domain generalization alternatives, also improving upon all the numbers previously reported in the literature. Second, model selection has a significant effect on domain generalization, and it should be regarded as an integral part of any proposed method. We conclude with a series of mini-discussions that answer some questions, but raise even more.
How can we push data augmentation further?
While conducting our experiments, we became aware of the power of data augmentation. Zhang et al. [2019] show that strong data augmentation can improve out-of-distribution generalization while not impacting in-distribution generalization. We think of data augmentation as feature removal: the more we augment a training example, the more invariant we make our predictor with respect to the applied transformations. If the practitioner is lucky and performs the data augmentations that cancel the spurious correlations varying from domain to domain, then out-of-distribution performance should improve. Given a particular domain generalization problem, what sort of data augmentation pipelines should we implement?
Is this as good as it gets?
We question whether domain generalization is expected in the considered datasets. Why do we assume a neural network should be able to classify cartoons, given only photorealistic training data? In the case of Rotated MNIST, do truly rotation-invariant features discriminative of the digit class exist? Are those features expressible by a neural network? Even in the presence of correct model selection, is the out-of-distribution performance of modern ERM implementations as good as it gets? Or is it simply as bad as every other alternative? How can we establish upper-bounds on what performance is achievable out-of-distribution via domain generalization techniques?
Are these the right datasets?
Some of the datasets considered in the domain-generalization literature do not reflect realistic situations. In reality, if one wanted to classify cartoons, the easiest option would be to collect a small labeled dataset of cartoons. Should we consider more realistic, impactful tasks for better research in domain generalization? Attractive alternatives include medical imaging in different hospitals and self-driving cars in different cities.
It is all about (untestable) assumptions
Every time we use ERM, we assume that training and testing examples are drawn from the same distribution. Also every time, this is an untestable assumption. The same applies for domain generalization: each algorithm assumes a different (untestable) type of invariance across domains. Therefore, the performance of a domain generalization algorithm depends on the problem at hand, and only time can tell if we have made a good choice. This is akin to the generalization of a scientific theory such as Newton’s gravitation, which cannot be proved but has so far resisted falsification. We believe there is promise in algorithms with self-adaptation capabilities during test time.
Benchmarking and the rules of the game
While limiting the use of modern techniques cheapens experiments, it also distorts them from more realistic scenarios, which is the focus of our study. Our view is that benchmark designers should balance these factors to promote a set of rules of the game that are not only well-defined, but realistic and well-motivated. Synthetic datasets are helpful tools, but we must not lose sight of the goal, which is artificial intelligence able to generalize in the real world. In words of Marcel Proust:
Perhaps the immobility of the things that surround us is forced upon them by our conviction that they are themselves, and not anything else, and by the immobility of our conceptions of them.
왜 이런 철학적인 멘트로 저를 힘들게 하시나용
Broader impact
Current machine learning systems fail capriciously when facing novel distributions of examples. This unreliability hinders the application of machine learning systems in critical applications such as transportation, security, and healthcare. Here we strive to find robust machine learning models that discard spurious correlations, as we expect invariant patterns to generalize out-of-distribution. This should lead to fairer, safer, and more reliable machine learning systems. But with great powers comes great responsibility: researchers in domain generalization must adhere to the strictest standards of model selection and evaluation. We hope that our results and the release of DOMAINBED are some small steps in this direction, and we look forward to collaborate with fellow researchers to streamline reproducible and rigorous research towards true generalization power.
C. Dataset details
DOMAINBED includes downloaders and loaders for seven multi-domain image classification tasks:
• Colored MNIST [Arjovsky et al., 2019] is a variant of the MNIST handwritten digit classification dataset [LeCun, 1998]. Domain d ∈ {0.1, 0.3, 0.9} contains a disjoint set of digits colored either red or blue. The label is a noisy function of the digit and color, such that color bears correlation d with the label and the digit bears correlation 0.75 with the label. This dataset contains 70, 000 examples of dimension (2, 28, 28) and 2 classes.
• Rotated MNIST [Ghifary et al., 2015] is a variant of MNIST where domain d ∈ { 0, 15, 30, 45, 60, 75 } contains digits rotated by d degrees. Our dataset contains 70, 000 examples of dimension (1, 28, 28) and 10 classes.
• PACS [Li et al., 2017] comprises four domains d ∈ { art, cartoons, photos, sketches }. This dataset contains 9, 991 examples of dimension (3, 224, 224) and 7 classes.
• VLCS [Fang et al., 2013] comprises photographic domains d ∈ { Caltech101, LabelMe, SUN09, VOC2007 }. This dataset contains 10, 729 examples of dimension (3, 224, 224) and 5 classes.
• Office-Home [Venkateswara et al., 2017] includes domains d ∈ { art, clipart, product, real }. This dataset contains 15, 588 examples of dimension (3, 224, 224) and 65 classes.
• Terra Incognita [Beery et al., 2018] contains photographs of wild animals taken by camera traps at locations d ∈ {L100, L38, L43, L46}. Our version of this dataset contains 24, 788 examples of dimension (3, 224, 224) and 10 classes.
• DomainNet [Peng et al., 2019] has six domains d ∈ { clipart, infograph, painting, quickdraw, real, sketch }. This dataset contains 586, 575 examples of size (3, 224, 224) and 345 classes.
For all datasets, we first pool the raw training, validation, and testing images together. For each random seed, we then instantiate random training, validation, and testing splits.
D. Model architectures, hyperparameter spaces, and other training details
In this section we describe the model architectures and hyperparameter search spaces used in our experiments.
D.1. Architectures
We list the neural network architecture used for each dataset in Table 6 and specify the details of our MNIST network in 7.
D.2. Hyperparameters
We list all hyperparameters, their default values, and the search distribution for each hyperparameter in our random hyperparameter sweeps, in Table 8.
D.3. Other training details
We optimize all models using Adam [Kingma and Ba, 2015].
'Research > NLP_YS2024' 카테고리의 다른 글
[MaPLe] Multi-modal Prompt Learning (0) 2024.12.05 [DPLCLIP] Domain Prompt Learning for Efficiently Adapting CLIP to Unseen Domains (0) 2024.12.05 Layer의 재사용에 대하여 (0) 2024.12.03 A High-level Overview of Large Language Models (0) 2024.12.01