Welcome to the GitHub repo for our book!
In this book, we explore machine learning for text analysis as it relates to the data product pipeline. We discuss data ingestion and wrangling to preprocess input text into a corpus that can be analyzed. We explore methods for reading and analyzing corpora to build models that can be used in data applications, while monitoring them for change. Finally we discuss how to begin operationalizing these methods, moving towards a complete picture of how to build language aware data products.
Note #1 This book is currently in early release, and therefore the code and other content is still in raw and unedited form. Updates are forthcoming. Please bear with us in the interim.
Note #2 Much of the code in this book is based around a large corpus of ingested RSS feeds. You are encouraged to construct your own sample using the tools detailed in Chapter 2: Building a Custom Corpus. However, we have also made available a random 515.9 MB sample of 33,232 files (~5%) of the full data source. This sample is post-processed. You can obtain a 7.2 GB raw sample here.
A note on the copyright of the corpus - this corpus is intended to be used for academic purposes only, and we expect you to use them as though you downloaded them from the RSS feeds yourself. What does that mean? Well it means that the copyright of each individual document is the copyright of the owner who gives you the ability to download a copy for yourself for reading/analysis etc. We expect that you'll respect the copyright and not republish this corpus or use it for anything other than tutorial analysis.
Cheers, Rebecca, Ben, and Tony
Below are links to the other data sets used throughout the book (in order of appearance).