-
Complex ReasoningResearch/NLP_CMU 2024. 7. 11. 07:20
※ Summaries after taking 「Advanced NLP - Carnegie Mellon University」 course
https://www.youtube.com/watch?v=mPd2hFmzjWE&list=PL8PYTP1V4I8DZprnWryM4nR8IZl1ZXDjg&index=19
https://phontron.com/class/anlp2024/assets/slides/anlp-21-reasoning.pdf
What is reasoning? The basic idea is using evidence and logic to arrive at conclusions and make judgements. From the philosophical standpoint, there are two varieties of this. One is formal reasoning and formal reasoning is mostly based on strict truth values. So it's kind of like you can definitely say, this is true, you cn definitely say this is not true. And in real life, there's very little actual formal reasoning outside, for example mathematics or algorithms in computer science and other things like that.
And then separately from that, we have informal reasoning based on experience and intuition. And this was rather elusive until large language models. People were working on it but it was really hard and this is one of the big breakthroughs I think of the past few years.
I should note that this paper here is review survey paper of reasoning in large language models.
There's three kinds of reasoning. The first one is deductive reasoning. Deductive reasoning is using logic to go from a premise to a conclusion. This is largely what people talk about when they think about formal reasoning.
And separately, there's inductive reasoning, and inductive reasoning is from observation, predict a likely conclusion or predict a likely kind of generalized conclusion. This is kind of a soft version of deduction.
The final one is abductive reasoning and this is from an observation, predict the most likely explanation.
Before getting into the bulk of the talk which is going to be about LLMs, I want to talk about some pre-LLM reasoning methods. And the first one is formal reasoning within computational semantics. This has been around for a really long time. It's also what powered the things that worked over knowledge based and the way it works is, it does derivational reasoning by starting out with certain premises and getting to final conclusions.Actually neural networks can do this variety of reasoning through Chain-of-thought, but it's a very rough approximation.
If there were multiple steps so did all people who are working at CMU get their PhD after 1990? and the answer to that is obviously no, this would be able to find the counter evidence to that whereas LLMs would not be guaranteed to be able to do that.
There's a couple problems with it, the first thing is, this really only traffics in strictly true or strictly false statements. And that's really big issue. So if anything is soft, this formal reasoning starts breaking down.
The second thing which actually is a really big problem is, once you start dealing with more complex things, there's always exceptions and becomes computationally expensive to prove anything that's non-trivial.
Another thing that's useful to talk about, this isn't very popular right now but, it might be become more popular in the future is, we start hitting the limits of what we can fit into long context windows for neural models. This is memory networks. The way that memory networks work is, they have the ability to write and read from memory.You have a query and then you get the embedding of the query and you take the inner product, you get the softmax of the inner product. So this looks like a attention. you look up embeddings and you take the weighted sum of the embeddings and you get the summary of the memory.
So this is attention over a big memory base. But then memory networks also have the ability to go in and update the memory. So the also have write operations. So you can read and write from the memory base.
The reason why I say this might become more popular is, one of the big issues with large language models nowadays is, they don't get to continually update their memory. One way you can do that is, you can just add text to the memory but there are certain limits to that. Text isn't necessarily the best way to encode all of the things that you've seen in the past. So I feel how to pin these sorts of architectures onto language models might be an interesting research direction for the future.
It's actually been around for a while is, solving questions with symbolic reasoning, and the way it works is, you would have a text here and based on the text, you can run these symbolic operations - find, filter, find-max-num and relocate. This explicitly manipulates kind of the attention and you can do things like filtering down to find the largest number, for example. This is interesting because, some of the things that neural networks are bad at are, finding the largest number in a big dataset or finding all of the things where something applies and throwing out all of the things where something doesn't apply.
So, this isn't used super widely in large language models right now because pleple have been focusing on prompting techniques in order to do this sort of reasoning, but I think this is another thing that's worth thinking about, taking a closer look at and seeing if there are ways to incorporate it with the current models.
All of things that I decided to introduce here is, this section are things that current models are still not particularly good at. Like reasoning taking many steps over sets of inputs (computational semantics), reading and writing from memory, so you can remember things over long periods (memory networks), also filtering down large pieces of text into smaller pieces of text to find relevant information (symbolic reasoning).
Compared to standard prompting where we have a question and an answer, in Chain-of-thought, we have a question and then we have a derivation for the questions.So by adding this to the prompt, you get the model to also do these derivations at test time. And this greatly improves some tasks where we can't immediately predict the answer directly.
I also prevously talked about Zero-shot Chain-of-thought reasoning where we just prompt the model to with something like "let's think step by step" and then the model becomes able to do this chain of thought reasoning.
Now I'd like to talk about some of more advanced methods that people use for reasoning as well. This is by no means an exhaustive list, they're just of the ones that I found interesting.First one is Self-Ask. One of the issues with large language models nowadays is that, they're not very good at asking follow-up questions or maybe not that they're not very good at it but just they're not trained to do it.
So if you play around with ChatGPT, I have never had ChatGPT ask me a follow-up question. I don't think it's because large language models aren't capable of doing it. OpenAI must think it's a bad user experience to have a language model that asks you follow up questions. That's only reason I can think about it.
But what Self-Ask does is, it explicitly prompts the model to ask if there are followup questions.
In this particular paper, this is just another variety of Chain-of-thought. It's not using it to incorporate any external information. It's just trying to more directly elicit information from the model. Nonetheless they demonstrate that this is useful.
There's also other methods that actually try to look up information explicitly to answer these questions wich are even more powerful than what we have here. That's what I'd like to introduce next.
This is a method that instead of just doing Chain-of-thought, it retrieves relevant sentences when you're doing the Chain-of-thought.So here (Self-Ask), we have the followups. If the model itself doesn't know how old somebody was when they died, then it won't be able to answer this.
So what they do in order to make this happen is, they do bm25 based retrieval over Wikipedia for each of the Chain-of-thought answers. And then they use the retrieved - I think it's 10 documents, multiple retrieved documents to prompt the model to follow up with its chain-of-thought.
So this is another variety of things that you can do in order to improve.
This is a pretty interesting result here. The next series of results are going to be based on the quality of the reasoning chains that the model uses in Chain-of-thought and this one is a simple heuristic for improving the quality of the reasoning chains and one thing I should mention is that the quality of the reasoing chain is definitely connected to the quality of the output.People have noticed that, if your explanation is wrong, your prediction also tends to be wrong. So if you make mistakes in inermediate steps of your explanation, it's tends to mess up your final prediction. So one of the interesting ways that people have found to improve the explanation quality is, they just observe that if the explanations are longer, they tend to be better. It's interesting if they give you more reasoning steps, this tends to be more accurate and they actually demonstrate that in this paper.
Here's a simple reasoning chain (43), here's a more complex reasoning chain(58.5) for exactly the same problem, and you see, they get about 15% boost. And these are naturally occuring reasoning chains. They didn't train the model to give you longer reasoing chains. Amongest the naturally occuring reasonng chains, the longer ones tend to be better.
And this fact could be simply used to improve accuracy. So the way they did this is, they just sampled multiple reasoning paths and they performed self-consistency over the longer reasoning paths.
Self-consistency is, you do majority voting over the answers for multiple reasoning paths. So they threw out the lower quality ones and that improved overall accuracy.
So that's a thing that you can do.
One of the big results that's actually really important to know about is, this sort of Chain-of-thought reasoning is considered to be an emergent ability in large language models.And what we mean by an "emergent ability" is, typically refers to something that increases dramatically as the model size gets up to a certain point.
What you see is, up until a certain point, you get zero accuracy and then outputs improve.
So for a while people were really confused about this, why does this happen? it feels lke magic that you get a really powerful model and then suddenly it gets better at the very end.
Actually there's a much simpler solution. There's not that much magic to this. This paper from 2023 really expressed it very clearly. So I highly recommand you take a look at this, if you're interested in the emergent abilities in language models.
The thing about "emergent abilities" is that, they're mostly a matter of how you measure your models accuracy.
이거 논문 확인해봐야겠다. 교수님 설명을 못알아들음. ;;;;
교수님, 칠판이 안보여용~~
나 이거 궁금했던 건데. emergent abilities..
Another correl of this is, let's say you want to do interesting experiments about reasoning on smaller models and see how it improves on reasoning, I would definitely encorage you to measure not only accuracy because you might see very little change in accuracy but also measure log likelihood of reasoning chains or something like tha because you'll see a smoother curve.
There's two ways that you could be reasoning. One way you could be reasonings is doing an explanation first and then predicting the answer.The other way you could do it is, predicting the answer and then giving the explanation.
In general, if you have a reasonably strong model, any of the modern frontier models right now, doing the explanation first and then making the prediction is better. And the reason why is, because Chain-of-thought works and the model is able to break down the questions into simpler. It's able to break down the question into simpler questions for mathematical reasoning or something like that and then give me the answer.
--------------------------------------------------------------------------------------------------------------------------------
One of the things here that I didn't talk about is, this paper measures not just the accuracy of the answer with chain of thoughts, but it also measures the factuality of the explanation. So whether the explanation is a good explanation for the actual derivation and also the consistency of the answer in the explanation to figure out whether the answer and the explanation match up with each other.
And they did this with some synthetic datasets where you could actually measure the reasoning steps by using math. What they were able to find is, the answer and the explanation tends to be consistent expecially for the stronger models.
And that also meant that if you had higher factuality in the explanation that translates into higher accuracy of the actual prediction. I would bet that these numbers are even higher nowadays. I bet the consistency is even higher with more modern models than text-davinci-002.
The reason being is, number one, models are stronger, number two, all models are trained for chain-of-thoughts pretty aggressively now. So that would make the differnce there.
The other thing I'd like to talk about is training for Chain-of-thought. There's a fair amount of work in this general direction. From my point of view, there's two ways that people do this nowadays. The first way is usually through generating lots of synthetic data that represents chain-of-thought. And then using that to train models.This is most famous version of this. Although this paper cites a lot of other ones, but basically they generate a large and diverse Chain-of-thought dataset from GPT3.5 and GPT4. It includes 5 million complex instructions. They generated 1 million from GPT4 and 4 million from GPT3.5 just because generating long sequences from GPT4 is expensive and they didn't want to do that many.
And then they achieved corresponding high accuracy on Chain-of-thought related things compared to other datasets. So compared to Alpaca which is much smaller and doesn't have as much Chain-of-thought and also Vicuna which is similarly less focused on Chain-of-thought. They were able to do a good job.
This paper was by Microsoft and they didn't actually realease the ORCA dataset for whatever reason lebal or competitive reasons or whatever. But there's another open ORCA dataset that you can download and use that attempts to replicate it and it's reasonably good. So you can keep that in mind if you're interested in.
This is another really interesting paper on trying to create automatic assessments of how good Chain-of-thought are.
What they do is, relatively simple. They get human feedback on each step of a derivation. So they just ask people "is this step of the derivation good?" and if the answer is yes, then they give it a smile face, if the answer is no, they give it a frowny face. And they use this to train a reward model where the reward model preidcts whether each step of the derivation is good.
They have the model generate Chain-of-thought and the assess them with the reward model and upweight answers that have good Chain-of-thought.
So the good thing about this is, they don't need the correct answers to train the model this way and because they don't need the correct answers to train the model this way, they can also train the model on lots of other questions. The reason why this works is, because Chain-of-thought makes it easier to generate each of the steps in the derivation, it's also easier to assess whether an individual step in a derivation is wrong than assess whether the answer is correct overall, so this feedback signal is easier to get model provided than it is for getting feedback on the answer itself.
Final thing that I like to talk about is abductive reasoning or learning explanations from data.
The idea is can we find a rule that underies a pattern in data? You want to discover underlying rules based on the data.
Why would you want to do this? There's a couple reasons why you would want to do this. The first reason why you would like to do this is, because you might want something that you can explain to humans. you can explain underlying pattern exists in this data. It explains why the data appears as it does appear. And then humans go in and analyze it.
So recently there's been a big focus on using large language models for scientific inquiry by coming up with good explanations for why data is the way it is. So if we were able to do that, that would be really interesting.
Another thing is, language models are not particularly good at being consistent about difficult tasks across very large numbers of examples. So if you could look at all of the data at once, infer general rules from them, put those rules in prompt and then apply that prompt to make predictions on new examples, you might be able to raise your overall accuracy as well.
That's how humans learn as well. We don't like just memorize each example. If we just look at a few examples, then we might not generalize well to new examples. So we tried to abstract away general rules.
So this is also similar to program induction from input, output examples which I talked during the code generation class. So you have input, output examples and from them, you would like to come up with general rules. But this is a little bit more general. It doesn't necessarily need to be a program that you're inducing. It could be a grammar or it could be an explanation or it could be anything else.
So there's a bit of work on rule induction with LLMs. It's pretty recent work. But I think it's pretty interesting.
The first one is, hypotheses generation. What is does is, it takes all of these input, output examples, and from these input, output examples, it predicts these rules. And then you evaluate it. You can either evaluate it using another language model or you can evaluate it using symbolic evaluator. If it's a program you could use a symbolic evaluator, if it's a language model, you could just ask the language model to pick an answer.
And then you get lots of outputs and then when you get lots of outputs, you then can compare them against the expected outputs and verify whether the rule is correct, verify whether the rule gives you the appropriate answer.
And once you've done that, you can go back and do hypothesis refinement. And maybe even give this feedback about what was wrong and gradually refine more accurate and more complex hypothesis.
This is another variant of this idea, which uses different methodology. I think both are completely valid, but this one has a little bit higher data constraints. So what we do is, we use hypotheses in Chain-of-thought reasoning and keep one that gives result in correct answers.
'Research > NLP_CMU' 카테고리의 다른 글
Code Generation (0) 2024.07.10 Large Language Models (0) 2024.07.09 Ensembling & Mixture of Experts (0) 2024.07.08 Reinforcement Learning from Human Feedback (0) 2024.07.08 Quantization, Pruning, Distillation (0) 2024.07.07