id author title date pages extension mime words sentences flesch summary cache txt altman-building-2020 altman-building-2020 .docx application/vnd.openxmlformats-officedocument.wordprocessingml.document 6071 311 60 I did most of my data cleanup by hand using spreadsheet software, and was not careful about preserving the formulas for each step of the process; instead, I deleted and wrote over many important intermediate computations, saving only the final results. The pipeline for a machine learning project generally comprises five stages: data acquisition, data preparation, model training and testing, evaluation and analysis, and application of results. However you get your initial data, it is generally a good idea to save a copy in the rawest possible form and treat that copy as immutable, at least during the initial phase of testing different algorithms or configurations. This is often the part of the process that requires the most work, and you should expect to iterate over your data preparations many times, even after you've started training and testing models. As you begin ingesting and preparing data, you'll want to explore possible machine learning algorithms to perform on your dataset. ./cache/altman-building-2020.docx ./txt/altman-building-2020.txt