image.png

We introduce II-Thought-RL-v0, our first iteration to develop a large-scale, multi-domain Reinforcement Learning (RL) dataset. By providing a high-quality, large-scale dataset on RL question-answer pairs, we aim to advance reasoning research. This foundational step will pave the way for future iterations incorporating more complex reasoning traces.

In recent months, several datasets have been introduced to advance reasoning research, including PrimeIntellect’s Synthetic1, which spans a broad range of domains and tasks; Huggingface’s OpenR1, with 220k high-quality problems and reasoning traces; General Reasoning, with 323k diverse samples; and DeepScaler, which provides 40k high-quality math data points used to train state-of-the-art 1.5B models. While these efforts have been valuable and inspiring to our work, we found significant room for improvement in data quality, diversity, and integrity.

For instance, upon closer examination, we identified notable benchmark contamination in several datasets. Nearly 20 problems from Math-500 are duplicated in OpenR1, and at least 100 problems appear in both Math-500 and DeepScaler. Additionally, as General Reasoning is a crowd-sourced initiative, it contains a considerable amount of low-quality data that warrants further curation. Finally, there remains a gap in domain diversity and the availability of large-scale reinforcement learning datasets for reasoning tasks.

To address these issues, our approach is grounded in four core principles:

This post will share insights into our dataset collection process, key findings, curation approach, preliminary analyses, and work-in-progress for future iterations.

1. Background

DeepSeek recently introduced R1, a reasoning model that quickly emerged as a transformative force in advancing cognitive reasoning within LLMs. R1 uses Group Relative Policy Optimization (GRPO) [5], paired with verifiable rewards and explicit reasoning prompts (think <think> tags that nudge the model to start analyzing before giving the final answer). This approach doesn’t just tweak outputs—it encourages the model to think step-by-step, refining its logic and uncovering insights. This results in complete reasoning chains, fewer superficial answers, and rare moments of spontaneous clarity that mirror human cognition.

These intricate reasoning paths can then distill the reasoning process into smaller models via fine-tuning reasoning traces. While DeepSeek has released several powerful distilled models based on the Qwen family [20], replicating their capabilities remains challenging for the open-source community.

High-quality reasoning stems from well-crafted question-answer pairs, with complex problems yielding deeper insight. Effective reinforcement learning must rely on such high-quality data. This approach enabled us to build a robust, multi-domain RL dataset and laid the groundwork for refined reasoning traces in our next iteration of supervised fine-tuning.

2. Public RL Datasets

The success of DeepSeek R1 is primarily attributed to its collection of high-quality, verifiable question pairs suitable for GRPO training. Here, “verifiable” refers to responses that can be deterministically confirmed or systematically validated, ensuring an objective and precise basis for reinforcement.

Tulu-3 [22] from AllenAI, one of the first models to achieve strong performance with Reinforcement Learning with Verifiable Reward (RLVR), emphasizes that creating data for RLVR requires prompts paired with binary verifier functions—constructing a dataset where a verifier function accompanies each input x.

For example, math problems with deterministic results—where the final answer is formatted explicitly (e.g., enclosed in a box)—are inherently verifiable. Similarly, coding tasks can be validated by running predefined test cases, producing unambiguous outcomes that confirm the correctness of a response.