BookCorpus
Updated: 06/22/2024 by Computer Hope
Also called the Toronto Book Corpus, BookCorpus is a collection of about 7,000 books that were scraped by the eBook distribution site Smashwords. This dataset is made up of 985 million words, and the books that comprise it are varied in their genre types, everything from science fiction to romance. BookCorpus was designed to help train LLMs (large language models) such as OpenAI's ChatGPT and Google's BERT (Bidirectional Encoder Representations from Transformers).
AI Terms, Dataset, Internet terms, Natural language processing, Research, Text mining