descriptive claim
Repeated exposure to the same datapoint during training causes generative models to transition from novel high-entropy completions to verbatim regurgitation; in a GPT-2 fine-tuning experiment on Shakespeare, one pass produced incoherent novel output, ten passes produced memorized regurgitation of Coriolanus, and five passes produced a mixed regime of partially memorized and partially novel tokens.
desc_data_repetition_causes_memorization
confidence 0.75
Evidence (1)
supports (1)
- When does generative AI qualify for fair use? direct_measurementweight0.70
locator: Factor (3), Data repetition during training subsection
“after training on each datapoint ten times, it ends up memorizing the beginning of the play Coriolanus and regurgitating it when prompted”