LongPage Dataset: 6K Books with Hierarchical Plans

What It Is

LongPage is a dataset containing over 6,000 complete books, each accompanied by hierarchical planning traces that decompose narratives into structured layers. Rather than presenting raw text alone, the dataset maps how stories flow from high-level outlines down through chapter summaries to individual scene descriptions.

The hierarchical traces function as a roadmap showing the structural decisions behind each book. A typical trace might start with a story premise, expand into chapter-level plot points, then drill down to scene-by-scene beats. This multi-level annotation reveals the architectural choices authors make when building long-form narratives.

Pageshift Entertainment maintains the dataset at https://huggingface.co/datasets/Pageshift-Entertainment/LongPage, where researchers can access the full collection. The team is currently using this data to train a model capable of generating complete novels, with plans to release it publicly once output quality reaches acceptable thresholds.

Why It Matters

Most language models learn from text sequences without understanding narrative architecture. They predict tokens based on local context, which works for short passages but struggles with the coherence demands of 80,000-word novels. A model might write compelling paragraphs while losing track of character arcs or plot threads introduced fifty pages earlier.

The hierarchical structure in LongPage addresses this limitation directly. Models trained on this data learn to think in layers - maintaining story-level consistency while generating chapter-specific content and scene-level details. This mirrors how human authors actually work, moving between outline, draft, and revision rather than writing linearly from first word to last.

Fiction writers and creative AI developers stand to benefit most immediately. Current tools can assist with brainstorming or polish individual chapters, but generating structurally sound full-length manuscripts remains beyond reach. A model trained on hierarchical planning traces could potentially draft complete novels that maintain thematic coherence and narrative momentum across hundreds of pages.

The broader AI research community gains a valuable benchmark for long-context reasoning. While technical documentation and code repositories test certain aspects of extended coherence, fiction presents unique challenges around character consistency, plot development, and thematic unity that push models in different directions.

Getting Started

Developers can access the dataset through the Hugging Face hub:


dataset = load_dataset("Pageshift-Entertainment/LongPage")

# Examine a sample book with its hierarchical trace sample = dataset['train'][0]
print(sample['outline'])
print(sample['chapter_summaries'])
print(sample['scene_descriptions'])

The dataset structure includes multiple levels of granularity for each book. Researchers experimenting with long-form generation can use these traces as training targets, teaching models to first generate outlines, then expand them into chapters, and finally flesh out individual scenes.

Teams working on creative writing assistants might extract the planning methodology without necessarily training on the full books. The hierarchical traces demonstrate one approach to decomposing complex creative tasks into manageable steps that models can learn to replicate.

Context

Traditional novel-writing datasets typically provide finished text without exposing the planning process. Project Gutenberg offers thousands of public domain books, while BookCorpus contains modern fiction, but neither includes structural annotations. LongPage fills this gap by making the scaffolding visible.

The dataset does have limitations worth noting. Six thousand books represents substantial volume, but covers only a fraction of narrative styles and genres. The hierarchical traces reflect one planning methodology, while human authors employ diverse approaches - some outline meticulously, others discover structure through drafting.

Models trained on this data will likely excel at structured, plot-driven fiction but may struggle with experimental forms or stream-of-consciousness narratives that deliberately avoid conventional architecture. The planning traces also represent post-hoc analysis rather than the messy, iterative reality of actual book creation.

Alternative approaches to long-form generation include retrieval-augmented methods that maintain consistency by referencing earlier passages, or multi-agent systems where specialized models handle different narrative aspects. LongPage’s hierarchical training offers a complementary path focused on teaching models explicit planning capabilities.

LongPage: 6K Books with Hierarchical Story Plans

LongPage Dataset: 6K Books with Hierarchical Plans

What It Is

Why It Matters

Getting Started

Context

Related Tips

Skyfall 31B v4.2: Uncensored Roleplay AI Model

Semantic Video Search with Qwen3-VL Embedding

CoPaw-Flash-9B Matches Larger Model Performance