s1: Simple test-time scaling

LLMs/Reasoning 2026. 1. 4. 14:40

GitHub - simplescaling/s1: s1: Simple test-time scaling

s1: Simple test-time scaling. Contribute to simplescaling/s1 development by creating an account on GitHub.

github.com

Scaling law는 잘 알려져있다. Scaling law를 근간으로 대규모의 pretraining은 LLM의 performance를 향상시켰다.

본 논문에서는 새로운 "test-time scaling"을 제시한다. 즉, test-time 시에 compute을 increasing함으로써 reasoning performance를 향상시키는 것이다.

아주 작은 규모의 high-quality reasoning dataset으로 SFT을 시킨 후에, 모델이 test time시 output을 낼 때, "budget forcing"을 적용한다. 즉, test-time compute을 control해서 model이 좀더 reasoning하도록 duration을 강제 하는 것이다.

이를 통해, 모델은 스스로 답을 점검하고, 오류를 수정한다. 이는 reasoning performace improvement로 이어진다.

Open AI o1이 test-time scaling을 적용하였다고 한다. o1의 large-scale RL approach는 DeepSeek R1에 의해 replicate되었지만, test-time scaling에 의한 strong reasoning capability는 replicate되지 못하였다.

본 논문은 이 test-time scaling을 demonstrate한다.

※ 그리고 내가 방법론에 초점을 두다보니까 dataset curation은 자세히 언급하지 않았는데, deepseek-math / r1 에서도 그랬듯이, high-quality dataset curation은 기본 중의 기본이다. 본 논문에서도 dataset을 정말 세심하게 구성해서, 극도로 작은 size로도 performance improvement가 가능하도록 했다. (극강의 sample efficiency를 달성했다). 구체적으로는, difficulty, diversity, quality 세 가지의 기준을 충족하도록 하였다.

※ 여담인데, 앞서 deekseek series도 그렇고 본 s1 model도 그렇지만, reasoning dataset을 만들 때, 이미 high reasoning performance를 보이는 model의 response (reasoning process)로 dataset을 만드는 거, 일종의 behavior cloning 같다.

r1에서 보여준 distillation에서도 기존의 standard 방식 (prob distribution 을 matching시키는) 게 아니라, response를 따라하도록 하는 것은 일종의 behavior cloning으로 보여진다.

※ 또 여담인데, 예전에 CoT 를 보면서 blog에 썼던 글이 있는데, 내가 유튜브에 정말 좋아하는 꼬마아이가 있는데, 그 부모님이 매우 현명하시다. 아이와 대화를 정말 잘 이어가시는데, 아이가 답을 하면 더 생각해서 대답을 하도록 유도하는 질문을 참 잘하신다. 그러면 깜짝 놀랄정도로 점점 아이의 대답이 진화를 한다. 얼마나 말을 잘 하는지 감탄했던 기억이 있다.

여기서는 test-time에 답을 중단하지 않도록 "더 생각해봐"라는 signal 만으로도 스스로 답을 고찰하고 reasoning 실력을 boosting하였다는 게 인상적이다.

'LLMs > Reasoning' 카테고리의 다른 글

[COCONUT] Training LLMs to Reason in a Continuous Latent Space (0)	2026.01.04
[Dr.GRPO] Understanding R1-Zero-Like Training: A Critical Perspective (0)	2026.01.02
[DeepSeek-R1] Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (0)	2026.01.02
[DeepSeekMath] Pushing the Limits of Mathematical Reasoning in Open LMs (0)	2026.01.02
(On-going) Mixture-of-Experts (0)	2025.12.31

ABOUT ME

밤에 쓰는 편지 밤에 쓰는 편지

'LLMs > Reasoning' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'LLMs > Reasoning' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바