-
A Distributional Approach to Controlled Text GenerationResearch/... 2024. 12. 9. 23:44
https://arxiv.org/pdf/2012.11635
May 2021 (ICLR 2021)
Abstract
We propose a Distributional Approach for addressing Controlled Text Generation from pre-trained Language Models (LMs). This approach permits to specify, in a single formal framework, both “pointwise” and “distributional” constraints over the target LM — to our knowledge, the first model with such generality — while minimizing KL divergence from the initial LM distribution. The optimal target distribution is then uniquely determined as an explicit EBM (Energy-Based Model) representation. From that optimal representation we then train a target controlled Autoregressive LM through an adaptive distributional variant of Policy Gradient. We conduct a first set of experiments over pointwise constraints showing the advantages of our approach over a set of baselines, in terms of obtaining a controlled LM balancing constraint satisfaction with divergence from the initial LM. We then perform experiments over distributional constraints, a unique feature of our approach, demonstrating its potential as a remedy to the problem of Bias in Language Models. Through an ablation study, we show the effectiveness of our adaptive technique for obtaining faster convergence.1
1. Introduction
Neural language models, such as GPT-2/3 (Radford et al., 2019; Brown et al., 2020a), pretrained on huge amounts of text, have become pre-eminent in NLP, producing texts of unprecedented quality. In this paper, we are concerned with the problem of controlling a generic pretrained LM in order to satisfy certain desiderata. For instance, we may want to avoid toxic content; prevent certain demographic biases; or steer generations towards a certain topic or style. Prior work, taking inspiration from Reinforcement Learning (RL), has aimed at inducing autoregressive models to optimize global objectives using task specific rewards such as BLEU and ROUGE for Machine Translation and Summarization (Ranzato et al., 2016; Bahdanau et al., 2017), or hand crafted rewards (Li et al., 2016b; Tambwekar et al., 2019) to improve certain a priori desirable features.
However, such an optimization process is not infallible; Liu et al. (2016a) noted that it often leads to “degeneration”, producing poor examples that improve the average reward but forgo coherence and fluency. This degeneration is often diagnosed as an effect of deviating too much from the original pretrained LM during optimization. Consequently, prior work has regarded proximity to the pretrained model as a prescription for sample quality. This view is most prominent in open-domain generation where no gold references are available for fine-tuning, making the pretrained LM itself the yardstick for fluency. Jaques et al. (2017); Ziegler et al. (2019) propose a conservative fine-tuning approach moderated by a KL penalty between the trained policy and the original LM, discouraging large deviations. A KL penalty was also used by Dathathri et al. (2020), this time in a plug-and-play rather than a fine-tuning context. However, the authors show that balancing policy deviations from the original LM while also satisfying the control conditions is delicate. To combat degeneration they had to combine the KL penalty with post-norm fusion, reranking, and early-stopping procedures.
Most of the existing work on Controlled Generation has taken what we refer to as a “pointwise” view, namely focusing on the quality of each individual output, a view that is encouraged by the standard RL goal of maximizing rewards computed at the individual level. Such techniques are incapable of enforcing “distributional” conditions, where some collective statistical properties are desired over the set of all generations.
Distributional control is key to solving the problem of social biases in LMs trained on large, uncurated Web corpora. Those LMs - dubbed “Stochastic Parrots” in (Bender et al., 2021) - tend to encode hegemonic biases that are harmful to marginalized populations. There has been a large body of work analysing these distributional biases (Blodgett et al., 2020; Stanovsky et al., 2019; Prates et al., 2020; Sheng et al., 2019a; Brown et al., 2020b). However, applying distributional control on pretrained models is still an understudied problem. Sheng et al. (2020) introduce a method relying on adversarial triggers (Wallace et al., 2019); this method does not de-bias the whole distribution but only obtains non-biased continuations of given prompts. Bordia & Bowman (2019) introduce a regularization term for reducing gender bias when training a language model from scratch (as opposed to de-biasing a pretrained model).2
In this work, we present our Generation with Distributional Control (GDC) approach, in which we formalize the problem of controlled text generation as a constraint satisfaction problem over the probability distribution p representing the desired target LM. Namely, we require the expectations (“moments”) relative to p of certain output features to have specific values; this permits for instance to condition all outputs to speak about sports (a pointwise constraint), and 50% of them to mention female characters (a distributional constraint). Additionally, we require p to have a minimal KL divergence DKL(p, a) from the original pretrained LM a. This has the effect that p now inherits favorable linguistic qualities from a. As we will explain, this formulation is a generalization of the Maximum Entropy Principle and leads to a unique solution P(x). P(x) is an unnormalized distribution, aka an Energy-Based Model (EBM) (Hinton, 2002; LeCun et al., 2006; Bakhtin et al., 2020), of which p(x) = 1/Z P(x) is the normalized version, where Z .= summation x P(x) is the partition function of P.
Computing the EBM representation P is a crucial step, as it fully determines the optimal distribution p we are looking for. However, it is not the end of the story, because the representation thus obtained does not enable us to directly sample from p, an essential property of any LM.3 To this end, we introduce KL-adaptive DPG (Distributional Policy Gradient), a variant of an algorithm recently proposed in (Parshakova et al., 2019b). We train the policy πθ to approximate p in an adaptive way, by speeding up the next round of approximations based on approximations previously obtained. At the end of this process, we obtain a final πθ, our target LM, on which we can estimate diverse metrics, including DKL(p, πθ), measuring the approximation quality of πθ relative to the optimal p, and DKL(πθ, a), measuring the divergence of πθ relative to the original LM a.
This two-step approach differs from much research in NLP-oriented work with EBMs, which tends to use EBM representations inside the training loops of neural networks, blurring different dimensions of the problem. By contrast — similarly to Parshakova et al. (2019a;b) in a different context — we clearly decouple the relatively simple problem of determining a “pivot” optimal EBM from the more difficult problem of exploiting this EBM at inference time, Such decoupling is valuable, because it permits to better diagnose the important challenges to focus on.
Overall, our contributions can be summarized as follows:
1. We introduce a Distributional View for controlled text generation formalized as a constraint satisfaction problem combined with a divergence minimization objective, providing a single framework both for “distributional” constraints (collective statistical requirements) and for “pointwise” constraints (hard requirements on each individual) (§2.1). To our knowledge, this is the first framework with such generality for controlled text generation.
2. We show how these constraints lead to an optimal EBM for the target model (§2.2), propose the KL-Adaptive DPG algorithm for approximating the optimal EBM distribution by an autoregressive policy (§2.3), and show the effectiveness of this adaptive technique for obtaining faster convergence (§B.2).
3. We conduct experiments in a number of pointwise and distributional conditions, assessing results in terms of divergence from GPT-2, fluency and diversity, with better performance than strong baselines. The distributional experiments show the potential of our approach as a remedy to the current and important problem of bias in pretrained language models, providing a novel direction for addressing it (§3).
2. Formalization
"Minimizing D_kl(c, u) is equivalent to maximizing the entrpoy of c under the constraints - find the lease specific distribution satisfying constraints." - Maximum Entropy Principle
2.1. Constraints, Information Geometry, Exponential Families
2.2. From Moment Constraints to EBM
2.3. From EBM to Autoregressive Policy
3. Experiments, Results, and Evaluation
In this section we describe our evaluation methodology and perform experiments on pointwise constraints (§3.2) and on distributional and hybrid constraints (§3.3). The Appendix contains a detailed view of evaluation (§H), comparison with extra baselines (§D.2), and an ablation study (§B.2).
3.1. Evaluation Metrics
The main metrics we report are: (1) Ex∼πθ φi(x), assessing the ability of πθ to reach the expectation goal on the i-th constraint, (2) DKL(p||πθ), the forward KL divergence from the optimal distribution (which should be as close to 0 as possible), (3) DKL(πθ||a), the reverse KL divergence from the original GPT-2; for details on the estimation of these metrics see §B.1.
Previous work has mostly focused on the diversity of each individual output using Dist-1,2,3 scores (Li et al., 2016a) to measure repetitions within a single generated sequence. However, the shortcomings in terms of sample diversity, of optimization techniques when training generative models for text, has recently been documented in (Caccia et al., 2020). So additionally, we report Self-BLEU-3,4,5 (Zhu et al., 2018) to measure repetitions at a distributional level across the whole set of generated samples, and also provide a token/type frequency analysis (see Fig. 4 and §H.4). Note that KL divergence from the original GPT-2 also implicitly captures sample diversity: a distribution that focuses all its probability mass on a few sequences typically displays high divergence from GPT-2. Implementation details and hyper-parameters are available in the Appendix (§ F).
3.2. Pointwise Constraints Experiments
Pointwise constraints are of the form Epφi(x) = 1, with φi a binary feature. Contrarily to distributional constraints, they can be directly associated with a “reward”, namely φi itself. RL-inspired baselines can then be introduced naturally, and this is what we do here.
Single-Word constraints:
Here we constrain the presence of a specific word w in the generated text i.e. φ(x) = 1 iff w appears in the sequence x. We use 9 single-word constraints of different rarity levels: “US” (original frequency: 7·10−3 ), “China” (4·10−3 ), “Canada” (2·10−3 ), “amazing” (1·10−3 ), “Paris” (5·10−4 ), “restaurant” (6·10−4 ), “amusing” (6·10−5 ), “Vampire” (9·10−5 ), “Wikileaks” (8·10−5 ).
Word-list constraints:
We use 4 different word lists among those proposed in (Dathathri et al., 2020), covering the following topics: “kitchen”, “fantasy”, “politics”, and “computers”. We set φl(x) = 1 if x contains at least one word from the word list l.
Classifier-based constraints:
We use pre-trained classifiers from (Dathathri et al., 2020), which consist of a linear head on top of GPT-2. We select 4 classes and define corresponding pointwise constraints: “very positive”, “positive”, “very negative” and “Clickbait”. See §F for details on constraint computations.
Baselines:
We compare our method GDC to three baselines: (1) REINFORCE (Williams, 1992b), using the reward φ(x), i.e. trying to maximize Eπθ φ(x); (2) REINFORCE_P(x) : Reinforce again, but now using the reward P(x) based on our energy model P, i.e. maximizing Eπθ P(x); this baseline starts from the same optimal EBM P representation as GDC but with a standard optimization objective rather than a distributional one; in other words, while GDC tries to get a similar sampling distribution to p, this baseline tries to get sequences of maximal probability p(x). (3) ZIEGLER (Ziegler et al., 2019): an approach relying on the RL Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017) and which tries to maximize the objective Eπθ φ(x) − βDKL(πθ, a), which interpolates the reward φ(x) with a KL-divergence penalty from the pretrained model, but where the goal is not explicitly to satisfy a constraint; for a geometric illustration of the differences with GDC see §D.1. §D.2 provides a comparison of GDC with two additional baselines.
Results:
Figure 2 shows the evolution of the metrics over training steps, aggregated across the 9 + 4 + 4 = 17 experiments. We observe the following: the baseline REINFORCE , which does not have any explicit link in its objective to the pretrained GPT-2, converges very early in the training, reaching a maximum value of Eπθ φ(x) at the expense of a very large deviation from the original GPT-2. High values of DKL(πθ|a), are translated into low Dist-1 and very high Self-BLEU-5 indicating degeneration and lack of diversity. REINFORCE_P(x) maximizes the energy model P by peaking on a few sequences only; this can yield high values of Eπθ P(x), at the expense of low sample diversity as demonstrated in the highest values of SELF-BLEU-5 scores among baselines.12
In the case of ZIEGLER we can see a positive effect of the interpolation factor β between the reward and the KL penalty in the objective function. In the aggregated experiments reported here, the reward is slightly better than with GDC, but with inferior diversity scores (see also Fig. 4, showing that GDC produces richer vocabulary), and the stability is much worse (a detailed view of each experiment is provided in §H, showing more clearly the instability of this baseline). A complementary evaluation is provided by Figure 3, focusing on the ability of πθ to converge to the optimal distribution p. We see that GDC is superior to all baselines in terms of D_KL(p||πθ) and also much more stable.
In summary, in these experiments, we see that with GDC the constraint expectation Eπθ φ(x) smoothly increases while πθ maintains the lowest divergence from GPT-2, becomes closest to the optimal p, and has the best diversity scores overall. On the other hand, we also note that at the point where we stop training (30K steps), the average over experiments of Eπθ φ(x), while still increasing, does not reach 100%, an issue that we discuss at the end of the paper (§4).
3.3. Distributional and Hybrid Constraints Experiments
As formalized in §2, GDC permits to define pointwise and distributional constraints as well as any mix between them. This unique feature makes it very suitable to remedy biases that the text generation model may have, a problem identified in several previous works (Sheng et al., 2019b).
We employ GDC to balance gender and profession distributions across biographies generated by a GPT-2 model fine-tuned on Wikipedia Biographies (Lebret et al., 2016) (henceforth GPT-2bio) (§G gives additional details). The bias in GPT-2bio is significant: we calculated that this model generates only around 7% female biographies. It also displays a large imbalance between professions related to “Science” (1.5%), “Art” (10.0%), “Business” (10.9%) and “Sports” (19.5%).
Experiment 1: Single Distributional Constraint
We use the distributional constraint Ex∼p φ_female(x) = 0.5; GDC is able to reduce the bias of GPT-2bio to obtain 35.6% female biographies rather than only 7.4% (see Fig. 2 for this experiment and the next ones).
Experiment 2: Multiple Distributional Constraints
We then test our framework with several distributional constraints of different values and control directions. We specify four distributional constraints all at once with the goal of increasing the expectations of “science” and “art” to 40% and decreasing those of “sports” and “business” to 10%. GDC is able to increase the expectations of the first two professions respectively from 1.5% to 20.3% and from 10 to 31.6% and to decrease those of “business” and “sports” respectively from 10.9% to 10.2% and from 19.5% to 11.9%, reaching expectations close to the desired ones for all features using a single training method.
Experiments 3,4,5,6: Hybrid Constraints
Here we want to de-bias the model as in the previous case but we single out biographies of scientists, artists, etc. Formally, our requirements become Ex∼p φ_profession(x) = 1.0, a pointwise constraint, and Ex∼p φ_female(x) = 0.5, a distributional constraint. In those 4 hybrid experiments we can clearly see that GDC can address both pointwise and distributional constraints increasing each simultaneously with just the right amount to reach the desired expectations. Appendix §G further elaborates Fig. 2 (convergence curves).
4. Discussion
Our approach to controlled text generation is distinguished by its breadth — the first one to handle distributional along with pointwise constraints, with applications to the important problem of Bias in pretrained LMs — and by the transparency of the supporting formalism. It decouples the training objective along two different dimensions. The first consists in solving the initial constraints specification, and leads through a direct algorithm to an optimal solution in EBM format. The second, where the real computational difficulty lies, consists in approximating this EBM with an autoregressive policy for use at inference time.
Sampling from an EBM is an important, hard, and well-identified challenge in the literature. Our approach there consists in proposing a KL-adaptive version of the DPG algorithm, which exploits ascertained improvements of the trained policy to speed up convergence.
This is an effective method for rare events, as we show in an ablation study (§B.2). In the case of pointwise constraints, where comparisons with baselines can be done, our experiments show the method’s superiority in satisfying the constraints while avoiding degeneration. Reaching close to 100% samples meeting the constraints, can sometimes be obtained in these baselines, but only at a severe cost in terms of quality and sample diversity. Of course, if we do not care about such aspects, obtaining 100% constraint satisfaction is trivial: just generate one sentence satisfying the pointwise constraint!
Our method does not suffer from degeneration, but our end policies still generate a number of samples not satisfying the constraints. A possibility, left for future work, might consist in filling the moderate residual gap with MCMC techniques, which would be guaranteed to reach our optimal p in the limit. We do not go this route here, but conduct an experiment (see §C) to better understand the nature of the problem. In the simple case of a single-word constraint (x includes “amazing”), we sample directly 1M samples from GPT-2 and keep the roughly 5K samples containing amazing (a variant of rejection sampling, taking two processing days). We then do a standard supervised fine-tuning of GPT-2 with these samples, stopping training when the CE validation loss starts to increase, and observe that this model exhibits a worse constraint satisfaction rate than ours. This experiment does not mean that a much larger fine-tuning dataset, obtained in this slow, non-adaptive way, would not reach better statistics, but it raises doubts about the ability of the GPT-2 architecture to fine-tune over such a non-standard constraint as containing a given word somewhere in its output.
Overall, we believe that the proposed decomposition into two sub-problems is a methodological advantage compared to most other works, which directly aim at training a policy with the goal of improving certain evaluation metrics, but without clearly defining what qualifies as an optimal solution. The computational challenge of fully bridging the gap between the optimal EBM and an efficient sampling engine remains, and we hope that the formalism we propose, along with initial applications and experimental validations, will motivate further research along these lines.
B. Adaptivity
B.1. Details on KL-Adaptivity
In this section we provide details on the comparison step in our KL-Adaptive version of the DPG Algorithm, introduced in section 2. We want to assess whether the current πθ is closer than q to p, and if the test is positive, we set πθ as the new proposal, hoping to make the proposal more effective for importance sampling.
There are several ways to compute similarity between distributions, two of the most popular ones being on the one hand KL-divergence and on the other hand Total Variation Distance (TVD) — where TVD(p||p') .= 1/2 summation_x |p(x) − p'(x)| — which is often used in probability and MCMC theory.14 Calculation of these metrics relative to p is not straightforward since the distribution p ∝ P is only implicitly represented by the unnormalized EBM P, and we cannot easily obtain direct samples from p. In this section we describe a workaround.
Given P and a proposal distribution q that we can sample from, using importance sampling (Owen, 2013), one can calculate the partition function Z as follows:
In §B.2 we run an ablation study to compare the use of DKL on line 6 of Algorithm 2) or its replacement by TVD.
For both metrics, we need an estimate of Z. The precision of this estimate depends on the sample size and the quality of the proposal distribution q. We calculate a moving average estimate Z_MA of Z is used inside the estimations of DKL(p||πθ) and DKL(p||q) (Algorithm 3, lines 7 and 8). Z_MA is updated at each iteration of the training, and the moving average estimate is valid due to the fact that Zˆ i , based on K samples, is an unbiased estimate of Z, and therefore so is Z_MA. In this way, the estimate benefits from all the samples being produced during the course of the training; and also because the proposal distribution q evolves and gets closer to the target distribution p, the quality of the estimates of both DKL(p||πθ) and Z_MA through importance sampling increases (equation 7). A similar approach is taken in the case of TVD (not shown).
B.2. Ablation on Adaptivity
Here we run an ablation experiment on the adaptivity step of KL-Adaptive DPG (§2). We compare three variants of our proposed method: DPG-KLD, which uses KL divergence from the target distribution p to measure the quality of the trained policy πθ i.e. if DKL(p||πθ) < DKL(p||q) we update the proposal distribution q ← πθ. DPG-TVD is similar but with the total variation distance instead (TVD). In non-Adaptive the initial proposal q is kept fixed during training.
We run 3 point-wise experiments with single word constraints of three rarity levels in the original GPT-2 distribution, namely: “Vampire” (1/10^4 ),“Paris” (1/10^3 ),“US” (1/10^2 ) .For each we use 3 different seeds and train for 10k gradient updates.
Figure 6 shows training trends of the three ablations. We find a significant difference in convergence speed in favour of the adaptive methods. The efficiency gap between Adaptive and non-Adaptive methods becomes larger the more rare the constraints are. i.e. the proposal distribution q starting point is very far from the target distribution p, as the efficiency of the DPG algorithm is related to how close the proposal q is to the target p. When q is continuously adapted, the proposal distribution becomes closer to p and the training becomes efficient regardless of how far the initial proposal distribution is from p. We observe similar convergence rates for DPG-KLD and DPG-TVD.
D. More Comparisons
D.1. Illustration Comparing GDC, REINFORCE, and Ziegler
The figure below illustrates the difference between GDC, the RL-based REINFORCE and ZIEGLER baselines for a pointwise constraint. The main points to note are: (1) REINFORCE is trying to find a distribution pR maximizing r(x) (meaning that pR lies on the C manifold), but this pR is free to land anywhere on this manifold, and (2) ZIEGLER is trying to find a distribution pZ that interpolates (with a weight β) between a high average r(x) and the KL divergence from a; unless β = 0, in which case we are back to REINFORCE, pZ does not satisfy the constraint and falls outside of the manifold.
G. Distributional and Hybrid Control Experiments for Debiasing Language Models
Large pretrained Language Models are often trained on uncurated data from the internet, where several demographics are severely underrepresented. One of those demographics is women, whose biographies make up only 18.58% of English Wikipedia’s biographies (Graells-Garrido et al., 2015). It is expected that such bias is transferred if not amplified by Language Models. Previous work has suggested associations of certain demographics with certain professions, sentiments and stereotypes (Sheng et al., 2019b; Brown et al., 2020b; Nadeem et al., 2020). This shows that Bias in LMs also shows up in different forms than just under-representation, and the task of debiasing LMs could require more a complex control method. GPT-2bio demonstrates a large initial bias: over a large sample of size 20480 examples using top-p sampling (p = 0.9), it generates only around 7% female biographies. and a large imbalance between profession types “Science” (1%), “Art” (10%), “Business&Politics” (10%) and “Sports” (20%).
In this set of experiments, we demonstrate the potential of GDC as flexible general framework that can control pretrained Language Models to impose pointwise, distributional constraints, or even a mix between them (hybrid constraints). We design a set of 6 experiments whose descriptions and results are displayed in the figures below. Generation examples are provided in Table 7.
H. Extra Details on Pointwise Experiments
H.1. Approximating the desired distribution
H.2. More Details on Point-wise Constraints Experiments
H.4. Token Frequency Analysis
To analyse in depth the effect of deviating much from the original GPT-2, for policies obtained from our method and each baseline, we obtain a large sample and filter to 4000 sequences that satisfy the imposed pointwise constraints for each of the 17 pointwise experiments explained in §3. Figures 35, 36 and 37 plot a token frequency analysis for each of the training methods.
The vanilla policy gradient baselines REINFORCE suffer from very low diversity of generations; in the examples shown in section H.5 we note strong degeneration, in which all generations are composed of a few repeated tokens.
REINFORCE_P(x) suffers from a token diversity issue. As noticed and confirmed by generated examples shown section H.5, it often concentrates all the sequence probability mass on a single sequence which is often fluent and satisfies the constraint; however this leads to an extreme loss of sample diversity in almost all experiments. This shows the usefulness of our proposed analysis — in addition to the self-BLEU metrics — for distinguishing diversity at the sequence level or at the distribution level. Similarly, ZIEGLER (Ziegler et al., 2019) often suffers from the same lack of sample diversity (5 out of the 17 experiments); GDC obtains the highest diversity amongst all baselines, as demonstrated by the long tail in the figures below. It is important to note here that low sample diversity is also captured by the KL deviation from the original GPT-2 model i.e. DKL(πθ||a); GDC identifies the target distribution as the one which minimally deviates from the original policy while satisfying the constraints (p = arg min q∈C DKL(q, a)) is thus expected to preserve the high sample diversity of the original GPT-2.
H.5. Generation Examples
'Research > ...' 카테고리의 다른 글
(2/3) GAN, F-Divergence, IPM (0) 2024.12.21 (1/3) GAN, F-Divergence, IPM (0) 2024.12.20 DPG (0) 2024.12.17 On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting (0) 2024.12.17 [DPG] Distributional Reinforcement Learning for Energy-Based Sequential Models (0) 2024.12.12