Concept-Level Overfitting in Latent Space during Synthetic Pretraining
Preprint, Under Review ✦ March 2026
Evidence for Conceptual Memorization Without Token Repetition
Abstract
It is well established that training language models for multiple epochs on a fixed dataset leads to overfitting. We investigate whether analogous overfitting dynamics can emerge even when no tokens are literally repeated, through training on synthetic rephrasings of a small seed corpus. We characterize concept-level overfitting - a form of generalization failure in which language models overfit to the conceptual distribution of a narrow training corpus even when individual token sequences are never repeated. In controlled experiments training 125M-parameter transformer language models on subsets of C4 and their synthetic rephrasings generated by a single instruction-tuned LLM (Llama-3-8B-Instruct), we provide existence evidence that: (1) validation loss on held-out C4 data diverges by +0.34 nats (relative to best achieved) when training on 20× synthetic rephrasings of 1,000 seed documents, despite every training sequence being token-level unique; (2) this effect scales with the conceptual diversity of the seed corpus, with divergence falling to +0.03 nats at 10,000 seed documents; and (3) embedding-based deduplication delays overfitting onset by approximately 57% (from ~17,500 to ~27,500 training steps), while token-level deduplication is ineffective by construction. These results establish the phenomenon at 125M-parameter scale with a single rephrasing model; broader validation is essential before drawing production conclusions.
Keywords: synthetic data, pretraining, overfitting, data diversity, language models, concept-level diversity, representation geometry
1 Introduction
The scaling laws governing language model training have established clear relationships between model size, dataset size, and compute budget (Kaplan et al., 2020; Hoffmann et al., 2022). A central assumption in these analyses is that training data consists of independent, identically distributed samples from a broad distribution over natural text. When this assumption is violated - for instance, by training for multiple epochs on a fixed corpus - classical overfitting emerges: the model memorizes training sequences, training loss vanishes, and validation loss diverges.
The recent proliferation of synthetic data for language model pretraining (Gunasekar et al., 2023; Li et al., 2023; Eldan & Li, 2023) introduces a subtler challenge. Synthetic data pipelines often operate by taking a relatively small seed corpus and generating numerous rephrasings, elaborations, or structured transformations of this data. While each synthetic sample may be unique at the token level, the underlying conceptual content - the topics, facts, reasoning patterns, and semantic structures - may be drawn from a far smaller effective distribution than the token-level diversity would suggest.
In this paper, we investigate a phenomenon we term concept-level overfitting: the capacity for a language model to overfit on the conceptual distribution of its training data, even when no token sequences are repeated. Our central hypothesis is that there exists a meaningful notion of "data diversity" that operates in a conceptual space of topics and semantic structures, and that insufficient diversity in this space can produce overfitting dynamics qualitatively similar to those observed with literal data repetition. Critically, this phenomenon passes undetected by token-level deduplication - the dominant quality gate in current pretraining pipelines - because by construction, synthetic rephrasings exhibit no token-level overlap while still collapsing the conceptual distribution a model can learn from.
We formalize this hypothesis and test it through controlled experiments at the 125M-parameter scale. Specifically, we train GPT-2-style transformer language models on: (a) subsets of the C4 corpus of varying sizes, (b) multi-epoch repetitions of these subsets, and © synthetic rephrasings of these subsets generated by Llama-3-8B-Instruct. Our key experimental prediction is that training on synthetic rephrasings of a small seed corpus will produce overfitting-type divergence in validation loss on held-out C4 data, even though every training token sequence is unique. We frame this as an existence result: we aim to show the phenomenon can occur on real internet data, not to fully characterize the boundary conditions under which it emerges.
Our contributions are as follows:
We characterize and empirically study concept-level overfitting, distinguishing it from classical token-level overfitting. Seed corpus size is the primary predictor of overfitting severity; we additionally measure concept-level diversity as a supporting diagnostic.
We present controlled experiments demonstrating that validation loss divergence occurs when training on synthetic rephrasings of conceptually narrow seed corpora, providing existence evidence for concept-level overfitting on real internet data (C4).
We characterize the phenomenon through ablations over seed corpus size, rephrasing multiplicity, and deduplication strategies, showing that embedding-based deduplication is more effective than token-level deduplication at mitigating concept-level overfitting.
We analyze the representation geometry of models exhibiting concept-level overfitting, finding that their hidden states occupy a lower-dimensional manifold compared to models trained on diverse data.
2 Related Work
Overfitting and memorization in language models.
Carlini et al. (2021, 2023) demonstrated that large language models memorize and can regurgitate training sequences, with memorization rates increasing with model size and data repetition. Hernandez et al. (2022) showed that even partial repetition of training data leads to disproportionate memorization. Lee et al. (2022) studied deduplication of pretraining corpora and found that removing duplicate documents substantially reduced memorization without harming downstream performance. Our work extends this line of inquiry by showing that memorization-like effects can emerge at the conceptual level even without surface-level duplication.
Synthetic data for pretraining.
The use of synthetic data for language model training has grown rapidly. Gunasekar et al. (2023) trained the Phi series on textbook-quality synthetic data, demonstrating strong performance with less data. Eldan & Li (2023) showed that coherent synthetic stories could teach language models surprisingly well. Li et al. (2023) proposed self-improvement through synthetic data generation. More recently, Maini et al. (2024) explored rephrasing web data to improve pretraining, finding diminishing returns at scale. Our work identifies a specific failure mode of synthetic data: when synthetic generation acts as a bandwidth-limited channel on underlying concepts, it can induce overfitting dynamics even with surface-level diversity.
Model collapse and data contamination.
Shumailov et al. (2023) demonstrated that iterative self-distillation - training successive model generations on the output of the previous generation - causes progressive capability collapse. This is mechanistically distinct from what we study. Their collapse accumulates across generational iterations; ours occurs within a single training run on a fixed corpus. Most importantly, their finding does not predict the specific sensitivity to seed corpus size our ablations reveal: a model trained once on a large diverse rephrased corpus (SYNTH-10K-10x, ΔVal = +0.03 nats) is qualitatively healthy, while one trained once on a narrow rephrased corpus (SYNTH-1K-20x, ΔVal = +0.34 nats) is not. Seed diversity is the operative variable in our framework; Shumailov et al. have no corresponding variable that would generate this 10× variation. Muennighoff et al. (2023) studied multi-epoch repetition and found substantial degradation beyond 4 epochs; our work shows an equivalent degradation can emerge under the appearance of token-level novelty.
Data diversity and scaling.
Tirumala et al. (2023) studied the relationship between data diversity and memorization, introducing metrics for effective dataset size. Abbas et al. (2023) showed that semantic deduplication of pretraining data improves efficiency without harming quality. Sorscher et al. (2022) established that data pruning based on quality metrics can shift neural scaling laws. The D4 framework (Tirumala et al., 2023) and SemDeDup (Abbas et al., 2023) provide conceptual predecessors to our notion of conceptual diversity, though neither explicitly connects semantic similarity to overfitting dynamics. Our theoretical framework draws on the information-theoretic perspective of Hutter (2021) on the fundamental limits of sequence prediction.
3 Problem Formulation
3.1 Preliminaries
Let denote a vocabulary and the set of all finite sequences over . A language model parameterized by assigns probabilities to sequences via the autoregressive factorization . The training objective is to minimize the cross-entropy loss over a training distribution :
We evaluate on a held-out validation distribution and define overfitting as the regime where continues to decrease while increases.
3.2 Token-Level vs. Concept-Level Diversity
We distinguish two notions of dataset diversity. Token-level diversity measures the fraction of unique n-grams or exact-match documents in a corpus. Concept-level diversity measures the effective dimensionality of the semantic content, which we operationalize through the entropy of document embeddings in a pretrained representation space.
Formally, let be an embedding function (e.g., from a pretrained sentence encoder). For a dataset , we define the empirical concept distribution as the distribution of in . The concept-level diversity is then the effective dimensionality of this distribution, measured via the participation ratio of the eigenvalues of the covariance matrix:
where are the eigenvalues of for . A dataset with high token diversity but low concept diversity would have many unique token sequences but a low participation ratio - precisely the scenario created by synthetic rephrasing of a small seed corpus. We note that is a proxy that depends on the choice of embedding model ; we use Sentence-T5-XL (Ni et al., 2022) throughout and discuss the implications of this choice in Section 6.4.
3.3 Synthetic Rephrasing as a Bandwidth-Limited Channel
We model synthetic rephrasing as an information-theoretic channel. Let be a seed corpus of documents. A rephrasing function (implemented by an instruction-tuned LLM) maps each seed document to a synthetic version that preserves semantic content while altering surface form. The synthetic training set is , where denotes independent applications of with different random seeds, and is the rephrasing multiplicity.
The key intuition - which we state informally and support empirically rather than prove formally - is that rephrasing acts as a stochastic channel that preserves semantic content while randomizing surface form. By the data processing inequality (Cover & Thomas, 2006), any downstream quantity that depends on the semantic content of cannot have more mutual information with than it has with itself. In particular, if we posit a latent concept variable that generates the semantic content of documents, then , since is generated from through the Markov chain .
We use this framing as qualitative motivation, not as a claim about measurable Shannon information over conceptual space.
We emphasize that this argument depends on the rephrasing operation being semantically faithful - an assumption we do not formally verify but which is consistent with our experimental design (see Section 4.2). The practical implication is that as with fixed, the token-level diversity of grows without bound, but the concept-level diversity remains bounded by . This creates a mechanism for concept-level overfitting: the model encounters enough data to overfit on the conceptual distribution of while training loss reflects the token-level difficulty of predicting diverse surface forms. We do not claim this framework makes quantitative predictions about overfitting onset; rather, it provides qualitative motivation for the phenomenon we observe empirically.
4 Experimental Setup
Note on scope: The experiments below are conducted at a single model scale (125M parameters) with a single rephrasing model (Llama-3-8B-Instruct). We present these as existence evidence for the concept-level overfitting phenomenon. All reported results are from single training runs; multi-seed variance analysis remains an important direction for future work.
4.1 Model and Training
We train GPT-2-style (Radford et al., 2019) transformer language models with 125M parameters (12 layers, 768 hidden dim, 12 heads). All models are trained from random initialization using the AdamW optimizer (Loshchilov & Hutter, 2019) with a cosine learning rate schedule, peak learning rate , 2000 warmup steps, weight decay 0.1, and a context length of 1024 tokens. We use the GPT-2 BPE tokenizer (50,257 vocabulary). Training runs use 50,000 gradient steps with a batch size of 64 sequences, corresponding to approximately 3.3B tokens processed.
4.2 Data Construction
All data is derived from the C4 corpus (Raffel et al., 2020). We construct the following training sets:
Table 1: Training data configurations. Token counts are approximate.
| Configuration | Seed Docs | Rephrase × | Tokens | Token-unique? |
|---|---|---|---|---|
| FULL-100K | 100,000 | - | ~3.3B | Yes |
| REPEAT-1K | 1,000 | - | ~3.3B | No (multi-epoch) |
| SYNTH-1K-20x | 1,000 | 20 | ~3.3B | Yes |
| SYNTH-1K-50x | 1,000 | 50 | ~3.3B | Yes |
| SYNTH-1K-100x | 1,000 | 100 | ~3.3B | Yes |
| SYNTH-5K-20x | 5,000 | 20 | ~3.3B | Yes |
| SYNTH-10K-10x | 10,000 | 10 | ~3.3B | Yes |
For synthetic rephrasings, we prompt Llama-3-8B-Instruct (AI@Meta, 2024) with the instruction: "Rewrite the following text to convey the same information using different words, sentence structures, and stylistic choices. Preserve all factual content." Each seed document is rephrased independently times with temperature 0.9 and top-p 0.95 to maximize surface diversity. We verify that no token-level 13-gram overlaps exist between any pair of synthetic documents (within each seed's rephrasings and across seeds). We note that all rephrasings come from a single model and prompt; the degree to which our findings depend on this particular rephrasing strategy versus the underlying conceptual narrowness of the seed set is an important question we revisit in the discussion (Section 6.5).
4.3 Evaluation
We evaluate all models on a fixed held-out validation set of 10,000 C4 documents (approximately 33M tokens) that shares no documents with any seed set. We report per-token cross-entropy loss in nats. We additionally compute: (i) the participation ratio of document embeddings from a frozen Sentence-T5-XL encoder (Ni et al., 2022), and (ii) per-domain validation loss broken down by C4 domain category to assess whether overfitting is uniform or domain-specific.
5 Results
5.1 Main Result: Concept-Level Overfitting Exists
Figure 1 presents our central finding. Panel (a) shows the baseline: when training on 100K unique C4 documents (FULL-100K), both training and validation loss decrease monotonically, with the validation loss converging to approximately 2.08 nats. Panel (b) shows classical overfitting: when training on 1K documents for multiple epochs (REPEAT-1K), validation loss begins to diverge after approximately 12,500 steps (roughly 8 effective epochs over the seed set).
Panel © shows our key result: when training on synthetic rephrasings of 1K seed documents (SYNTH-1K-20x), validation loss follows a similar overfitting trajectory despite every training sequence being token-level unique. The onset of overfitting is delayed relative to literal repetition (approximately 17,500 steps vs. 12,500), and the magnitude of divergence is smaller, but the qualitative phenomenon is unmistakable. This confirms our central hypothesis: concept-level overfitting can occur without any token repetition.
Figure 5: CodePen embed by naf.
Figure 1: Training and validation loss curves across three data regimes. (a) Full C4 subset shows healthy convergence. (b) Literal repetition of 1K documents shows classical overfitting. © Synthetic rephrasings of 1K seed documents exhibit overfitting-type validation loss divergence despite all training tokens being unique. Results from single runs; see Section 4 for scope notes.
Table 2: Final validation loss and overfitting metrics at 50K training steps.
| Configuration | Val Loss | Best Val Loss | Δ Val | Concept Diversity κ |
|---|---|---|---|---|
| FULL-100K | 2.08 | 2.08 | 0.00 | 142.3 |
| REPEAT-1K | 3.41 | 2.52 | +0.89 | 8.7 |
| SYNTH-1K-20x | 2.78 | 2.44 | +0.34 | 12.4 |
| SYNTH-5K-20x | 2.38 | 2.30 | +0.08 | 34.1 |
| SYNTH-10K-10x | 2.25 | 2.22 | +0.03 | 56.8 |
Table 2 quantifies the effect. The "Δ Val" column reports the gap between final validation loss and best validation loss achieved during training, serving as a measure of overfitting severity. The SYNTH-1K-20x configuration exhibits a of +0.34 nats, which is substantial in magnitude, though downstream task implications require separate validation. While smaller than the +0.89 gap of literal repetition, this suggests that concept-level overfitting is not merely a marginal effect but a practically meaningful degradation at this scale. Confirming this finding with multi-seed runs and error bars is an important next step.
5.2 Scaling with Conceptual Diversity
Figure 2 shows validation loss at 50K steps as a function of the number of seed documents, comparing synthetic rephrasings to equivalent amounts of real data. Two findings emerge. First, increasing seed document count monotonically reduces final validation loss for both real and synthetic data, confirming that conceptual diversity is a primary driver of generalization. Second, a persistent gap exists between synthetic and real data at the same total token count, and this gap grows larger as the seed corpus shrinks. At 100 seed documents, the synthetic data yields a val loss 0.07 nats worse than real data with the same token budget; at 100K seed documents, the gap vanishes. This is consistent with our theoretical prediction: the information bottleneck introduced by rephrasing becomes binding only when the seed corpus has limited conceptual diversity.
Figure 2: Figure 2 by naf
Figure 2: Validation loss vs. number of seed documents. Synthetic rephrasings (red) consistently underperform equivalent real data (blue) when seed diversity is low, converging as diversity increases.
5.3 Rephrasing Multiplicity Ablation
Figure 3 presents an ablation over the rephrasing multiplicity (the number of rephrasings per seed document) with a fixed seed set of 1K documents. For real data, increasing token count monotonically improves validation loss, as expected from scaling laws. For synthetic data, validation loss initially improves with (from 1x to ~10x), then plateaus and eventually degrades at high multiplicity (50x-100x). This non-monotonic behavior is a hallmark of concept-level overfitting: beyond a threshold, additional synthetic data provides diminishing conceptual information while increasing the model's exposure to the same underlying concepts, enabling the model to overfit on this narrow conceptual distribution.
Figure 3: Figure 3 by naf
Figure 3: Effect of rephrasing multiplicity. With a fixed 1K seed set, increasing rephrasings shows diminishing returns and eventual degradation for synthetic data (red), while equivalent real data (blue) continues to improve.
5.4 Representation Geometry
To further characterize concept-level overfitting, we analyze the geometry of internal representations. Figure 4 shows PCA projections of hidden states from the final layer of models trained under each regime, computed on a fixed set of 1,000 held-out validation documents.
The full-data model (panel a) produces representations spread broadly across the two principal components, reflecting the diversity of the training distribution. The repeated-data model (panel b) shows highly collapsed representations clustered around a small number of modes corresponding to the seed documents. Critically, the synthetic-rephrasing model (panel c) shows an intermediate pattern: representations are broader than the repeated-data case but substantially more restricted than the full-data model, occupying a limited region of the representation space.
We quantify this using the participation ratio of the representation covariance matrix. The full-data model achieves , the repeated-data model , and the synthetic model (Table 2). The fact that the synthetic model's is much closer to the repeated model than to the full model - despite its training data being token-level unique - provides strong evidence that concept-level diversity, not token-level diversity, is the relevant measure for representation quality.
Figure 4: Figure 4 by naf
Figure 4: PCA of final-layer hidden states on held-out documents. (a) Full data produces widely distributed representations. (b) Literal repetition collapses to tight clusters. © Synthetic rephrasings show intermediate but still restricted coverage.
5.5 Deduplication Strategies
Given that concept-level overfitting stems from conceptual redundancy rather than token-level duplication, we investigate whether deduplication strategies can mitigate the effect. We compare three approaches on the SYNTH-1K-20x configuration:
No deduplication: All 20,000 synthetic documents are used as-is.
Token-level deduplication: We remove documents with >50% 13-gram overlap with any previously seen document. Since our rephrasings have no such overlap by construction, this removes zero documents - as expected for synthetic data.
Embedding-based deduplication: Following Abbas et al. (2023), we cluster document embeddings (from Sentence-T5-XL) using k-means and retain only cluster centroids, reducing the training set to approximately 1,800 documents with higher conceptual diversity per token.
Figure 5: Figure 5 by naf
Figure 5: Effect of deduplication strategies on validation loss. Token-level dedup (orange) provides minimal benefit over no dedup (red), while embedding-based dedup (blue) substantially delays and reduces concept-level overfitting.
Figure 5 confirms the prediction: token-level deduplication is essentially ineffective because the redundancy exists in conceptual space, not at the surface level. Embedding-based deduplication significantly delays the onset of overfitting (from ~17,500 to ~27,500 steps) and reduces its magnitude, though it does not eliminate it entirely because even cluster centroids share conceptual overlap from the narrow seed distribution.
6 Analysis
6.1 When Does Concept-Level Overfitting Matter?
Our results establish that concept-level overfitting is a real phenomenon on realistic data (C4) at the 125M-parameter scale. However, the practical severity depends on the ratio of seed diversity to model capacity. For our models, the effect becomes measurable with fewer than ~5,000 seed documents and severe below ~1,000. We expect these thresholds to scale with model size: larger models have more capacity to memorize conceptual distributions, potentially making them more susceptible to concept-level overfitting. Testing this scaling hypothesis is an important direction for future work.
A natural question is whether this phenomenon is relevant to current large-scale pretraining. We argue it may be, for two reasons. First, many synthetic data pipelines involve generating training data from a relatively small set of seed topics or curricula, even if the individual documents are diverse. Second, web crawls themselves contain substantial conceptual duplication - the same news stories, product descriptions, and tutorial topics appear across many domains - and this natural conceptual redundancy may interact with synthetic augmentation in ways our controlled setup does not fully capture. Understanding whether concept-level overfitting operates at larger model scales is essential for production practitioners.
6.2 Implications for Synthetic Data Pipelines
Our findings have direct implications for practitioners building synthetic data pipelines for pretraining, though we emphasize these recommendations are based on 125M-parameter experiments and may require recalibration at larger scales. First, token-level deduplication - the standard approach in most pipelines - is insufficient to prevent conceptual overfitting. Practitioners should additionally employ embedding-based diversity measures to assess and maintain conceptual coverage. Second, the non-monotonic relationship between rephrasing multiplicity and validation performance (Figure 3) suggests that there exists an optimal rephrasing budget that balances the benefits of additional data against the costs of conceptual redundancy. We observe that this optimum lies at approximately for our 1K-seed setting, though the optimal ratio likely depends on seed corpus diversity and model capacity.
Third, our results underscore that the effective dataset size for pretraining may be better measured in conceptual units than in token counts. A dataset of 100B tokens generated by rephrasing 1,000 seed documents is fundamentally different from 100B tokens of diverse web text, even if standard quality filters rate both equally. We recommend that synthetic data practitioners report concept-level diversity metrics (such as the participation ratio of embedding distributions) alongside standard dataset statistics.
6.3 Connections to Scaling Laws
Standard neural scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) model validation loss as . Our results reveal a qualitative failure mode of this framework when applied naively to synthetic data: Figure 3 shows that for a fixed seed count , increasing the token budget via higher rephrasing multiplicity first improves and then worsens validation loss - a non-monotonicity that is unpredictable within the standard -parameterized framework.
The non-monotonicity arises because synthetic rephrasings have bounded conceptual diversity. As rephrasing multiplicity increases, the model transitions from underfitting (insufficient coverage of concepts) to memorizing the geometry of seed concepts. The transition occurs at approximately 10-20× for 1K-seed corpora. This motivates replacing with an effective dataset size that is monotonically increasing in and , but bounded above by the conceptual diversity of the seed corpus regardless of rephrasing multiplicity. The specific functional form is a conjecture pending multi-scale experiments, but the bound is supported by our data: SYNTH-1K-20x achieves equivalent final validation loss to a real-data corpus of far fewer tokens than 3.3B.
Practitioners can treat as a ceiling on the conceptual return from synthetic expansion, with the practical saturation point at approximately for 1K-seed settings.
6.4 Connections to Prior Theoretical Work
Our bandwidth-limited channel formalization connects to several lines of theoretical work. The information bottleneck principle (Tishby et al., 2000) provides a natural framework: synthetic rephrasing acts as a bottleneck that preserves task-relevant information up to the capacity of the rephrasing model, but cannot exceed the information content of the seed corpus. Our concept-level diversity measure relates to the intrinsic dimensionality measures used in manifold learning (Facco et al., 2017) and to the effective rank measures recently applied to characterize neural network training dynamics (Roy & Vetterli, 2007).
The monotonic improvement we observe when training on novel real data (as opposed to synthetic rephrasings) is consistent with the "blessing of scale" in language modeling: each genuinely new document provides information that is non-redundant with respect to the existing training set. The failure of this monotonicity under synthetic rephrasings is precisely the signature of the information bottleneck becoming binding.
6.5 The Rephraser Style Confound
An important alternative hypothesis is that models may be overfitting to the stylistic distribution of the rephrasing model (Llama-3-8B-Instruct) rather than the conceptual distribution of the seed corpus. All 20,000 synthetic documents in SYNTH-1K-20x share the stylistic fingerprint of a single instruction-tuned LLM - including vocabulary preferences, sentence structure patterns, and discourse conventions - that would not appear in held-out C4 data.
We cannot fully decompose conceptual and stylistic contributions to the observed overfitting. The seed-count scaling gradient constrains style as the sole explanation - SYNTH-5K-20x and SYNTH-10K-10x use the same rephraser yet show dramatically less overfitting (Val +0.08 and +0.03 vs. +0.34), and representation geometry shows concept-clustered rather than style-collapsed structure. However, both mechanisms may operate simultaneously. Definitive isolation requires training on rephrasings from diverse LLMs (e.g., Llama-3 and Mistral-7B) - our highest-priority follow-up.
6.6 Limitations
Our experiments use a single model scale (125M parameters), a single rephrasing model (Llama-3-8B-Instruct), and single training runs without confidence intervals. The quantitative thresholds we identify (overfitting onset at ~1K seed documents) may not transfer directly to larger scales. Our concept-level diversity metric depends on the choice of embedding model, and we evaluate only language modeling loss - downstream task evaluations may reveal additional effects. Multi-seed replication, multi-scale validation, and downstream benchmarking are priorities for follow-up work.
7 Conclusion
We have presented existence evidence for concept-level overfitting: a phenomenon where language models overfit on the conceptual distribution of their training data even when no token sequences are repeated. Through controlled experiments on subsets of C4 and their synthetic rephrasings at the 125M-parameter scale, we demonstrated that validation loss divergence - the hallmark of overfitting - can emerge from conceptual redundancy alone. Our analysis of representation geometry confirms that models trained on synthetic rephrasings develop restricted internal representations similar to those produced by literal data repetition, despite full token-level uniqueness.
For practitioners building synthetic data pipelines: measure embedding-space diversity, not just token uniqueness. Limit rephrasing multiplicity relative to seed diversity (our results suggest ~10-20× for narrow corpora). Use embedding-based deduplication alongside token-level methods. And consider using multiple diverse rephrasers rather than a single model.
More broadly, our results suggest that for datasets synthesized through rephrasing, concept-level diversity may be a useful complement to token-count metrics for predicting generalization - a perspective that, if validated at larger scales, could reshape how we think about data requirements for foundation models.
References
Abbas, A., Tirumala, K., Simig, D., Ganguli, S., & Morcos, A. S. (2023). SemDeDup: Data-efficient learning at web-scale through semantic deduplication. In International Conference on Learning Representations.
AI@Meta. (2024). Llama 3 model card. https://github.com/meta-llama/llama3
Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., & Zhang, C. (2023). Quantifying memorization across neural language models. In International Conference on Learning Representations.
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., ... & Raffel, C. (2021). Extracting training data from large language models. In USENIX Security Symposium.
Cover, T. M. & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley.
Eldan, R., & Li, Y. (2023). TinyStories: How small can language models be and still speak coherent English? arXiv preprint arXiv:2305.07759.
Facco, E., d'Errico, M., Rodriguez, A., & Laio, A. (2017). Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific Reports, 7(1), 12140.
Gunasekar, S., Zhang, Y., Anber, J., Mendes, C. C., Del Giorno, A., Gopi, S., ... & Li, Y. (2023). Textbooks are all you need. arXiv preprint arXiv:2306.11644.
Hernandez, D., Brown, T., Conerly, T., DasSarma, N., Drain, D., El-Showk, S., ... & Ganguli, D. (2022). Scaling data-constrained language models. Advances in Neural Information Processing Systems.
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., ... & Sifre, L. (2022). Training compute-optimal large language models. Advances in Neural Information Processing Systems.
Hutter, M. (2021). On the foundations of universal sequence prediction. In Algorithmic Learning Theory.
Kaplan, J., McCandlish, S., Hein, T., Brown, T. B., Chess, B., Child, R., ... & Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., & Carlini, N. (2022). Deduplicating training data makes language models better. In Association for Computational Linguistics.
Li, Y., Bubeck, S., Eldan, R., Del Giorno, A., Gunasekar, S., & Lee, Y. T. (2023). Textbooks are all you need II: phi-1.5 technical report. arXiv preprint arXiv:2309.05463.
Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. In International Conference on Learning Representations.
Maini, P., Shu, S., Yamaguchi, A., Garg, V., & Lipton, Z. (2024). Rephrasing the web: A recipe for compute and data-efficient language modeling. In Association for Computational Linguistics.
Muennighoff, N., et al. (2023). Scaling data-constrained language models. NeurIPS.
Ni, J., Qu, C., Lu, J., Dai, Z., Abrego, G. H., Ma, J., ... & Chang, Y. (2022). Large dual encoders are generalizable retrievers. In Empirical Methods in Natural Language Processing.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1-67.
Roy, O., & Vetterli, M. (2007). The effective rank: A measure of effective dimensionality. In European Signal Processing Conference.
Shumailov, I., et al. (2023). The curse of recursion: Training on generated data makes models forget. arXiv:2305.17493.
Sorscher, B., Geirhos, R., Shekhar, S., Ganguli, S., & Morcos, A. S. (2022). Beyond neural scaling laws: Beating power law scaling via data pruning. Advances in Neural Information Processing Systems.
Tirumala, K., Simig, D., Ganguli, S., & Morcos, A. S. (2023). D4: Improving LLM pretraining via document de-duplication and diversification. arXiv preprint arXiv:2308.12284.
Tishby, N., Pereira, F. C., & Bialek, W. (2000). The information bottleneck method. arXiv preprint physics/0004057.
Appendix A: Reproducibility Details
All experiments were conducted on a cluster of 8 NVIDIA A100 80GB GPUs using PyTorch 2.1 and the HuggingFace Transformers library. Training a single 125M-parameter model for 50K steps takes approximately 4 hours on a single A100. Synthetic rephrasings of 1K seed documents (at 20x multiplicity) required approximately 2 hours on a single A100 using Llama-3-8B-Instruct with vLLM for inference. The total compute budget for all experiments in this paper is approximately 500 A100-hours. All random seeds, hyperparameters, and data preprocessing scripts will be released upon publication.
13-gram overlap verification was performed using a rolling hash (Rabin fingerprint) over the BPE token sequence of every document pair. The embedding-based deduplication used k-means with k=1,800 clusters on Sentence-T5-XL embeddings (768-dimensional), selecting the document closest to each centroid. The participation ratio was computed from the top 256 eigenvalues of the document embedding covariance matrix.