WisPaper
WisPaper
Scholar Search
Scholar QA
AI Feeds
Pricing
TrueCite
[Pre-training 2026] Reasoning Core: Beyond Web-Scraping to Foundational Symbolic Primitives
Summary
Problem
Method
Results
Takeaways
Abstract

The paper introduces Reasoning Core, a scalable procedural data generation suite designed to enhance language model reasoning through verifiable symbolic tasks (PDDL, FOL, CFGs, etc.). It provides a unified interface for pre-training and post-training data, achieving SOTA-level reasoning improvements in small models while maintaining language modeling quality.

TL;DR

Reasoning Core is a paradigm shift in how we think about "synthetic data." Instead of using LLMs to mimic human text, it uses procedural generators and rigorous external solvers to create an infinite supply of verifiable reasoning problems (Logic, Planning, Math). The result? Models that reason better from the first day of pre-training, without sacrificing their ability to speak human languages.

The Motivation: The "Wall" of Fixed Templates

In the quest to improve LLM reasoning, researchers often fall into two traps:

  1. Web-text Satiation: We are running out of high-quality human reasoning data.
  2. Narrow Proceduralism: Existing synthetic tasks like BlocksWorld or Dyck languages are too narrow. A model might master "moving blocks" but fail if the rules of the world change slightly.

The authors of Reasoning Core argue that for a model to truly internalize reasoning primitives, it needs to see the entire distribution of a formal domain. It’s the difference between memorizing a map (fixed puzzles) and learning how to use a compass (foundational logic).

Methodology: The "Reasoning Core" Architecture

The framework centers on four pillars: Generality, Scalability, Difficulty Control, and Rigorous Verification.

1. The Power of External Solvers

Unlike traditional datasets where the "label" is static, Reasoning Core connects to heavy-duty industrial solvers:

  • Vampire/E: For First-Order Logic.
  • FastDownward: For PDDL Planning.
  • Sympy: For algebraic equations.

This ensures that the model isn't just "matching tokens" but is being judged against absolute formal truth.

2. Gramforge & Topological Control

A major technical contribution is gramforge, a framework for context-sensitive probabilistic grammars. To avoid the "spike" problem (where generated trees are deep but narrow), they introduced bushiness factors, forcing lateral structural complexity.

System Overview Figure 1: The Unified Task API allows for seamless generation of prompts, answers, and even intermediate Chain-of-Thought (CoT) traces.

Experiments: Does it actually work?

The researchers didn't just test this on "toy" models; they conducted zero-shot evaluations against the GPT-5 family.

The Difficulty Knob

By adjusting a single "difficulty knob" (0.0 to 1.0), the system can scale from trivial arithmetic to logic problems that even frontier models struggle to solve.

GPT-5 Performance Figure 2: Performance of GPT-5 on RC tasks. Notice how "Hard" mode (Knob level 5) significantly degrades performance, proving the suite's benchmark utility.

The Pre-training Mix

The most significant finding is that mixing Reasoning Core data into pre-training (not just fine-tuning) improves reasoning reliability (measured by answer NLL on PlatinumBench) while slightly improving standard language modeling loss. This suggests that symbolic structure acts as a helpful inductive bias for natural language.

Experimental Results Figure 3: Evidence that adding a ratio of r=0.5 symbolic tokens (1/3 of the total mix) optimizes the reasoning-language tradeoff.

Critical Analysis & Future Outlook

Strengths:

  • Legal Safety: Fully procedural data bypasses copyright and licensing nightmares.
  • Contamination Proof: Since every instance is novel, models cannot "memorize" the benchmark.
  • Verifiable Rewards: Perfect for RLVR (Reinforcement Learning with Verifiable Rewards) setups like DeepSeek-R1.

Limitations:

  • The study was performed on smaller models (<100M). While promising, the "Scaling Law" for this specific data mix in 70B+ parameter models remains to be proven.
  • The "transfer" to non-symbolic tasks (like legal or medical reasoning) is hypothesized but not yet empirically validated.

Conclusion

Reasoning Core provides the "weights" for the "gym" of LLM training. By moving away from static datasets toward dynamic, solver-verified environments, we are paving the way for Neurosymbolic AI—where the fluidity of language models meets the cold, hard logic of symbolic systems.

Find Similar Papers

Try Our Examples

  • Find recent papers investigating the impact of injecting synthetic symbolic data during the LLM pre-training phase versus the instruction-tuning phase.
  • What are the primary differences between the Reasoning Core suite and the Reasoning Gym library in terms of task diversity and verification mechanisms?
  • Explore studies that evaluate how training on formalized logic or PDDL planning tasks transfers to non-symbolic domains like legal reasoning or software engineering.
Contents
[Pre-training 2026] Reasoning Core: Beyond Web-Scraping to Foundational Symbolic Primitives
1. TL;DR
2. The Motivation: The "Wall" of Fixed Templates
3. Methodology: The "Reasoning Core" Architecture
3.1. 1. The Power of External Solvers
3.2. 2. Gramforge & Topological Control
4. Experiments: Does it actually work?
4.1. The Difficulty Knob
4.2. The Pre-training Mix
5. Critical Analysis & Future Outlook
6. Conclusion