key: cord-0048529-1kjvr8rw authors: Westfall, Jon title: Project 5: The Change Alert! date: 2020-07-01 journal: Practical R 4 DOI: 10.1007/978-1-4842-5946-7_9 sha: 93fecc017cd234a19ead933ba0f54a4c833fbd6b doc_id: 48529 cord_uid: 1kjvr8rw In late January 2020, I sat at my desk thinking about the projects for this book. I had come up with all of the projects except one, the one that we’re to talk about in this chapter. Looking for inspiration, I went down the hallway to my colleagues and asked them what “pain points” they dealt with on a daily basis. One of them lamented to me that she wished another office would let her know when they added information to a report. Because they didn’t, she had to pull the report every day, just to see if anything had been added. It got me thinking about how many times I do a similar thing – check information just to see if it has changed – thus, the inspiration for this chapter! Change is the only constant in the universe, yet finding changes can be uniquely frustrating. In a perfect world, anytime someone (or something) made a change, they (or it) would proactively tell us that things were different. Yet the whole point of this chapter is solving the problem that lack of this notification causes. Depending on what you're trying to track, your methods may differ. In this section, I'll talk about three different sources or situations that you might find data in and some solutions for tracking them. I'll also provide example code that you can adapt for your situation. Perhaps nothing is more frustrating than receiving a dataset from a colleague and having them say "I don't think this is the most up-to-date… I seem to remember another file that had more data" or having them say "Yesterday I could have sworn that there were more people in that other group". In a dataset of a thousand or so observations, no one has time to go through and compare them line by line to see what's changed. That's why we have computers and, in this case, why we have the arsenal package and its function comparedf(). First, it should be noted that there are a lot of packages that will do file difference or data frame differencing. The reason I prefer arsenal is the summary report it can provide and the options that you can use to get exactly the report you need. To show you what I mean, I've taken the survey data we collected in Chapter 3 and provided two versions in the code for this chapter. The file survey_file_1.csv is the same as the complete data from Chapter 3. In survey_file_2.csv I took out ten lines and changed two random demographic variables. Opening the files side by side and we can easily see the ten lines missing (Figure 9 -1). However, it can be a bit difficult to see the differences in the demographics. However, the following code will find these two needles in the haystack pretty quickly: install.packages("arsenal") library(arsenal) df1 <-read.csv(file="survey_file_1.csv") df2 <-read.csv(file="survey_file_2.csv") # 10 new rows of data, and 2 demographics changes made comparedf(df1,df2) summary(comparedf(df1,df2, by="Response.ID")) Running that code, we get a very comprehensive report that includes the following two snippets: Right away we can see that there are 11 observations in y (our second data frame) that aren't in x (our first data frame). Farther down in the list, we actually get a table that lists the Response.IDs of each observation that isn't shared. And what about those two demographic pieces that I changed? We can find them easily too: These might have been changed for legitimate reasons (e.g., the person entering the data made a typo that they fixed later in the day, after the first report had gone out), or they might have been changed for less than legitimate reasons. Security professionals will tell you that individuals who break into computers can modify log files to cover their tracks. Imagine having a central monitoring script that recorded user activity and periodically auditing the logs to see if the user count was retroactively changed. You might find something before it's too late. And as I mentioned earlier, comparedf() has some features that other packages lack. What if your column names are slightly different ("Response.ID" vs. "response. id")? It can handle that using the tol.vars option. What about having a level of acceptable difference? Pass tol.num.val with an absolute difference and you'll only trigger differences if the threshold is met. comparedf() can also support user-defined tolerance functions, which means you can customize your criteria even more. Example 1 in the comparedf vignette shows how to allow items with a newer date to be ignored, suggesting that those differences are intentional updates. If only all data could be that nice. And it's possible it might be. It is possible to coerce a lot of data into a data frame in some way. However, in doing so, you may lose some of the things that make that data special -like the actual content! What might we do if the content is what we want to monitor, through either downloading it via an API or directly from the rendered web page or document? Enter the next two scenarios! Data comes in many ways, with a good chunk coming through API calls. One of the more common formats that you can receive data in is JSON (JavaScript Object Notation). R can deal with this data through several packages, one of which is jsonlite (if your API of choice uses XML, then you can check out the easily remembered xml package). What makes JSON particularly nice is that in many cases you can download it directly, without having to save it to a file. The data tends to be rather small, and tons of websites support it. In the following example, I'm going to download the current "hot" list of topics from Reddit and display the top item: install.packages("jsonlite") library("jsonlite") hot <-"https://www.reddit.com/hot/.json" hot.df <-fromJSON(hot); hot.df$data$children$data$title[1] As we'll see later, once I have the item, there are ways in which I can store it and then compare it later. In the following example, we'll actually check it once an hour to see if it changes, although you can check it as often as you like. Well, almost… Here's a caveat about API data -not all of it is free. Many websites understand that if you're using an API to access their content, that's fewer eyes that will be on their actual web page. And while it's nice of you to lighten their server load, you're also lightening their pockets by reducing their ad revenue. So for many, they require that you provide an API key in order to access their API. You purchase that key and a certain number of data calls with it. Reddit themselves allow up to 60 requests per minute and require that you authenticate if you're building a client application. So don't look to bog down a server with a ton of calls, because even if you are able to do a few, you might find yourself blocked if you're grabbing the same data 10-20 times per minute. In situations where the data is expensive or an API isn't available, what option do we have? The majority of the time our best option then is to use some form of web scraping, like I discussed in Chapter 2. The Web has a ton of data, but it isn't always in a format friendly to us. That's why we have tools like R. In Chapter 2, I gave you a very basic version of scraping data from a web page with the rvest package. When it comes to web scraping, there are likely two different goals you might have in mind: comparing two versions of a page (highlighting the differences) or mining data out of the page to use in your script. The first is actually easiest thanks to the diffr package. The following code grabs a copy of the homepage of Delta State University from the Internet Wayback Machine at archive.org at two separate points. It then analyzes the differences and produces a difference report of the raw HTML that highlights the differences in file2.html vs. file1.html. As we can see from the report, quite a bit of code changed -mostly due to a difference in file paths. install.packages("diffr") library(diffr) install.packages("shiny"); library("shiny"); addr <-"https://web.archive.org/web/20200102195729/http://www.deltastate.edu/" addr2 <-"https://web.archive.org/web/20191202220905/http://www.deltastate.edu/" download.file(addr,"file1.html"); download.file(addr2,"file2.html"); diffr("file1.html","file2.html", contextSize=0); diffr can work on any text file to create a difference report. It's not going to be your solution, however, for compiled files such as a PDF. As of this writing, there isn't a utility in R to compare PDF documents; however, it does exist for Python. And as hinted earlier, diffr is not going to be your solution if the document you download has specific data you need. For that, rvest is your best option. However, as a caveat, rvest or any other package is going to be pretty worthless if the data you download doesn't include the data you need. Many web pages today use JavaScript to download data after the page is loaded, through an Asynchronous JavaScript and XML (AJAX). When you download a web page using the download.file() function, it downloads the raw HTML. That HTML might just include placeholders for the data that you want, because that data gets loaded later in your web browser. This means that rvest won't be able to find anything. Later in this chapter, we'll look at a Figure 9-3. Output of the diffr Function solution for this in our ChangeAlert script, by emulating the page loading experience in a "headless" web browser. For now, let's turn back to the basics of detecting change. Because once we can get the data one time, we can get it in the future as often as we like to compare against that first copy. In a perfect world, we'd always do what we want to do at the time we want to do it. As I write this, in the midst of the COVID-19 pandemic, my morning routine looks a lot different than it usually does. I wake up around 6:50 AM, and I'm at the "office" (a.k.a. my wife's craft room that I'm given a corner of) after a 10-second commute. Around 7:10 AM I make a trip to the coffee shop (my kitchen), and I am back by 7:15 AM. If I need to do something every morning, I can easily think about it based off my schedule -"right after coffee". But normally my days are a bit more chaotic -I get to my office at 8:00 AM, bounce between meetings, classes, conferences, lunches with friends, and more. I don't have anything to hook on to that's constant, so if I have to download a report and compare it against the previous day's version, I will forget on more days than I will remember. Thankfully, we have a few ways in R to schedule our task to run on a schedule that we set once, and only update as needed. When it comes to scheduling, we have two options: having our script run at a set time or having our script run on an interval. We may also do a combination of the two. In the first scenario, we use the task scheduler on our operating system (Cron for most Linux, Unix, or macOS users) and tell it to run the RScript command we discussed in Chapter 8. Additionally, packages exist that help schedule tasks from within R -with cronR for our non-Windows users and taskscheduleR for the Windows crowd. This is great if I know I want to run something at 8:00 AM every workday, or every hour on the hour. It's not so great if I want to run something over a shorter period of time (e.g., every 10 seconds) or if I want to monitor the output in real time, able to start and stop the task as needed. For that, we need a package that will let us run something on a loop. Enter the aptly named later package and function. later() allows you to specify a function and a delay. Take a look at the following code: install.packages("later") library("later") loop <-TRUE Typing myfunc() into R after running that code will print "This is the output of the scheduled later loop" repeatedly while the loop variable remains true. Type loop <-FALSE and press Enter, and you'll get one more loop of the myfunc() command and then it will be done. We can also interrupt the later() function by using code such as this: myfunc2 <-function() { cat("this is the output of the scheduled later loop, run at ", format(Sys.time(),"%a %b %d %X %Y")) cancelfunc <<-later(myfunc2, 10) } cancelfunc <-later(myfunc2, 10) That code will start the function, and it will keep going every 10 seconds as it "reschedules" itself each time it runs. However, typing cancelfunc() into R and pressing Enter will cause the loop to be cancelled. Unlike the first method, you don't get an additional run out of the code -it stops as soon as you execute cancelfunc(). If you'd like the option to cancel after the next run (using the loop variable) or before the next run (using cancelfunc()), you can modify the code to this: print("This is the output of the scheduled later loop") while(loop) { cancelfunc <<-later(myfunc3, 10); break;} } The best of both worlds, at least in terms of loop flexibility. And flexibility is important, because just like humans don't always wake up at the right time of the morning, our computers can have off days as well. Power outages, data connectivity issues, confused system administrators who reboot the system while people are using it, and more all mean that your script might not keep running in perpetuity as you intended. Let's talk about a little bit of redundancy. In the R world, we have a few choices for how to store data for either short-or long-term retrieval. In some cases, I can simply declare a variable with the data I want and then compare my data to that variable later. Imagine that I write a function to get the top hot thread on Reddit, named gettophot() (if you can't wait, you can see this code later in our ChangeAlert script). Running gettophot() returns whatever the top hot article title is at that moment. In that scenario, the following code can be very helpful -it checks to see if I had a tophothist variable in my R environment, and if it does not (because this is the first run of my script), it creates it and saves the current top hot thread to tophothist: ifelse(!exists("tophothist"),tophothist <-gettophot(),"") As long as my R environment is saved when I exit R, I can always come back and compare against this. It's fast, and it's also easy to modify if I want to test my script -I don't have to wait for the top hot title to change, I simply need to change tophothist to something new and my script will detect it as a change. However, as you may have guessed, this can be problematic if my R environment isn't saved by default or if I'm working on multiple machines that don't replicate my R environment. For this scenario, I might want to store my data in file format. Options for this vary depending on the type of data you're storing. In the following example, where I'm storing web pages, I might use download.file() to get the original copy of my file and read_html() from the rvest package to load it. I could also store R objects into a .RDS. Saving tophothist to an RData Object file would be done using saveRDS(tophothist, file="tophothist.rds") and restored using readRDS(file="tophothist.rds"). Finally, you can also programmatically save and restore your entire R workspace image using the save.image() function, which stores your image in a file named ".RData" by default. However, you can specify a filename using the file= option. You restore the image by using the load() function. You'll have to give some thought to how you want to store your "baseline" data -the data you want to compare new data against. In some cases, you may have files always on your hard drive that you're comparing. In other cases, you might need flexibility to have a new file downloaded in either by your script or through another means. For example, many cloud storage providers such as Dropbox and OneDrive allow you to keep files in sync across many devices with a local copy stored on each. In this way, you could run your R change script from multiple places, and it won't matter where you last ran it if it's always writing to the same file. Once you've decided what you want to track and how you want to schedule it, you next need to decide how you want to alert yourself to a change! Once we talk about that, we can see all of this in action in the ChangeAlert script! So how do you want to be disturbed? Or how disturbed do you want to be by technology? That's the question for this next section. We'll talk about various ways that we can learn about our change. Perhaps the simplest way to learn that something has changed is for R to let us know about it directly in the console. We've seen this in the later() example, with R using the print() command to output a notification. We'll also see it a bit in the following ChangeAlert script. It's simple and a great backup method if you aren't quite sure your other methods are 100% reliable. If you want to get a little bit fancier than a simple print output, you can use one of the several packages that R has for logging information. futile.logger has a ton of features for multiple logs and different notification thresholds. Combining those together, you could choose to silently log certain levels of change, without disturbing you until you want to check on the results later. Another option would be to have a pop-up dialog box or alert. Something like the svDialogs package will do the trick: install.packages("svDialogs") library(svDialogs) dlg_message("This is a test") These methods are great if you are running your code in a machine you can easily access and view; however, they might not work if you're scheduling your script (or at least not work in the ways that you would find useful or intuitive). Thus, you may want to send a notification elsewhere, in a few different ways that we'll explore. Perhaps the most intuitive way to be notified would be through email or text message. These methods are fairly well used and accepted. We've seen how to send emails in Chapter 6, and dedicated texting options do exist. While this might be your knee-jerk notification go-to, it's worth pointing out a few potential problems. First, email can be held up or dropped for looking untrustworthy. Sending a message that says "This is a notification that your job has finished" will look rather suspicious to most email filters. Your message may be dropped without you even realizing it. The same goes for text messaging. Unless you're using a commercial service designated for texting, you might run into the same "suspicious" problem. At the very least you'll have to figure out how to get your email to text, which can be confusing as this feature isn't as widely used as it once was. At the end of this section, we'll talk about a method that I endorse over email and text for its reliability and customizability, named Pushover. However, there is one more way you might get your notification across -updating your web page. A final "basic" method to notifying yourself comes in a simple solution: push an update file to a server that you can access to check. This works best in situations where you want a status update when you want it, not when it wants to send. After a change is detected, your computer could write a text file with the status update and use a package such as ssh to move the file to a web server. Alternatively, you could place the file in a cloud storage folder and let the cloud client on your computer (e.g., Dropbox, Google Drive, OneDrive) push the file up to the cloud. You can then browse to that folder at your leisure and check the results. I'd suggest putting a timestamp into any message, so you know if it's new or old. The stamp that I like to use looks like this in R: format(Sys.time(),"%a %b %d %X %Y")or in human speak, something like "Tue Apr 14 22:05:50 2020". We've just discussed three ways to send notifications, with each one having a strength. Outputting to the console allows you to output a lot of information -some urgent, some just informative. Sending to email or text is intuitive and simple -one line of code (after you're set up) and the email can go out. But email can be unreliable. A status file to a web server or cloud storage is nice, but it means you need to keep track of a lot of different filenames to check. The service I'll discuss in the following, Pushover, can do all of those previous things. Sadly, unlike most everything else I've mentioned in this book, Pushover isn't free -it's $5 for a lifetime license on your choice of platform, iOS, Android, or Desktop. And it's worth the very small price, as we shall discover. To understand Pushover, you need to think back to a time when push notifications were first being widely used, almost 10 years ago. To get a push notification, you needed to be a developer with an infrastructure to send these notifications. You also needed to pay to access the notification APIs, either directly or by having an app in the Google Play Store or the Apple App Store that supported push. It wasn't really accessible for individuals. Pushover changed that. At its core, Pushover is a push notification service that offers a variety of customizations. You purchase a Pushover license for a one-time cost of $4.99 per platform. I bought my Pushover license for iOS in 2012, and it's the same one I'm using today. No subscription costs, no ads, just a reliable service. To send a notification, you can simply email a special email address Pushover assigns to you. But you won't want to do that if you're using R, because you can use the pushoverr package! Here's a small listing of use scenarios that Pushover supports, with R code: • Simple setup: Create an app to get an API key (see Figure 9 -9). Then put your API key and user key at the top of your script: install.packages("pushoverr") library("pushoverr") set_pushover_app(token="ad7w7uqezsfd3v81sze7znjhaz1") #Change this to your API key set_pushover_user(user="GnoCrXCawlXUwcBDFDkhKBUgC1IMSO") #Change this to your user key • A simple one line of code notification that will be delivered within 10 seconds (or so): pushover(message = "This is a test message") ( Figure 9 -10, at the bottom). • The ability to send silent notifications: They will be delivered to the Pushover app on the person's device, but won't pop up: pushover_ quiet(message = "This message doesn't vibrate or pop up on an Apple Watch") ( Figure 9 -11). • The ability to send priority notifications: They will be delivered and cause the device to make a noise (even if on mute, as long as you've given the client permission): pushover(message="High priority message",priority = 1) (Figure 9 -12, on an Apple Watch). • The ability to require the user acknowledge the notification, or the client will keep notifying them on a regular interval. This is great if you're prone to forgetting that you got a notification. The pushoverr package can also track these acknowledgments, so that you can check to see if you have acknowledged yet (Figure 9 -13, on an Apple Watch). # send an emergency message emer_msg <-pushover_emergency(message = "This message will make a noise on an iPhone even with the mute switch on!"); # Check to see if the emergency message was acknowledged is.acknowledged(emer_msg$receipt) check_receipt(emer_msg$receipt) • The ability to set "quiet hours" in your settings, so that Pushover won't wake you up with notifications that could wait until morning. • Group support: If you have multiple people who you want to be notified, each of them can send you their Pushover user key, and you can create a "group" key that will contact everyone. • Support for different sounds and the iOS Critical Alerts features. I've been a user of Pushover for a number of years, which is why I recommend their service when you need a notification that's reliably delivered and trackable. Over the years, I've used Pushover for the following situations outside of R: • Server monitoring, making sure that the load on a particular machine doesn't get too high • Weather alerts • Alerts through IFTTT.com's network of sources • A custom button on my website that sent an alert when it was pressed And I'll likely use it for others. But before that, let's put everything we've discussed together in this chapter and create our ChangeAlert script! The ChangeAlert Script -Tracking What's Hot, and When Email Tops 130,000,000,000! Now that we've talked about all of the things you might want to track and how you'd like to track them, let's look at putting it all together. There are two ChangeAlert scripts that are provided in the book's code download. We'll walk through both of them: in ChangeAlert-JSON.R, we see how to download JSON data and alert ourselves that something has changed. In ChangeAlert-Rendered.R, we look at a situation where a raw download won't work. # Get Top hot Reddit Thread install.packages("jsonlite") install.packages("later") library("jsonlite") library("later") # Track the change of the top hot gettophot <-function() { hot <-"https://www.reddit.com/hot/.json" hot.df <-fromJSON(hot); return(hot.df$data$children$data$title[1]) } tracktophot <-function() { ifelse(!exists("tophothist"),tophothist <<-gettophot(),"") #If it doesn't exist, start tracking it. same <-tophothist == gettophot() if (!same) { tophothist <<-gettophot() cat(" at ",format(Sys.time(),"%a %b %d %X %Y"), " New Top Hot: ", gettophot()) } cancelfunc <<-later(tracktophot,3600) } tracktophot() The preceding code takes an example I mentioned earlier in the book and fully develops it out with other concepts introduced. It first installs and loads the jsonlite and later libraries. Next, the code creates a function named gettophot(). This function first declares the URL for the JSON download of the top hot threads from Reddit and then returns the top hot thread. Theoretically, we could always just call these three lines of code, but by putting it in its own function, we make things a bit more elegant in the following section. In that next section, we create another function, tracktophot(). This function first checks to see if we have a variable named tophothist. If we do not, it creates it and puts the current top hot thread title into it. If we didn't have this line, R would complain that we were referencing a variable that didn't exist when we compare the current top hot thread to our tophothist variable. Next, we ask R to compare the current top hot thread (by calling the gettophot() function) to the history that we've saved. The variable same now is either True (if the current top thread is the same as the history variable) or False (if it's different). If it's true, we don't need to do anything -there hasn't been a change. But if it's false (which we test by using the !same statement), we then need to alert someone to its change. In this example, I'm simply writing to the console that, at a given time, the top hot thread changed. I then output what it changed to. Technically here I'm being a bit wasteful -I'm calling the JSON URL twice and downloading it twice. I could modify my code to store the variable temporarily in a variable in the function, but since this only runs infrequently, the difference is pretty minuscule in processing time, bandwidth, and API usage. Finally, I use the later() function to schedule my loop to run every hour (3600 seconds). I then launch it by calling tracktophot(). If I want to cancel my running loop, I can by calling cancelfunc(). When I resume, because tophothist is a global environment variable, it will compare based off of my last unchanged thread title. In thinking about what you would like this type of script to do, obviously you can swap out the console logging for any of the other notification options. For example, changing cat( to pushover( after loading in the pushoverr library and setting your user key and API key will cause the message to get pushed to your cell phone instead of written to the console. # The number we want, Emails sent Today, is element 22 # html_text(node) [22] num.emails <-as.numeric(gsub(",","",html_text(node)[22])) # we can now alert if that number is above our critical cutoff (130 billion, 130,000,000,000) if (num.emails > 130000000000) { if (exists("msg")) { if (!is.acknowledged((msg$receipt))) { pushover(message = "It's Still Above, and No One Acknowledged") } else { pushover(message = "It's Above, but someone has acknowledged. No Further } else { msg <<-pushover_emergency(message = "It's Above!") } } cancelfunc <<-later(checkmail,300) } checkmail() Our goal in this script is to monitor the web page InternetLiveStats.com and let us know when the daily email sent total reaches 130 billion and notify you with Pushover (with a second version that's pushier than the first). There are a lot of other stats on that page we could also use, but given how much email I feel like I send and receive, the email number seemed darkly comedic. There is a lot going on in this script, including calling another application to do some heavy lifting, so let's walk through it! First, we need to load our libraries and set our Pushover values. Next, we create a function that downloads the live stats and checks them. This is actually a lot harder than it sounds. Thus far we've downloaded static web pages, where the data we need lives in the raw HTML files. If you download the raw HTML of internetlivestats.com, you get a page with placeholder values. That's because the authors of internetlivestats use JavaScript to load the values in using AJAX calls in the background. This lets them keep the numbers rolling higher and higher as the person views the page, but it also means that the data lives somewhere other than the raw HTML. Our way around this is to emulate the same thing we would do if we went to the web page in our browser and downloaded the fully rendered page. We can do this using a piece of open source software named PhantomJS (https://phantomjs.org). PhantomJS is a command-line "headless" browser, which will take a URL and render it, saving the output. To instruct PhantomJS what we need it to do, we use the system() function in Now we use the pushover_emergency() function to send a high priority message. We then check, one each fun of the function, first if we've sent an emergency notification before (exists("msg")) and then to see if it's been acknowledged (!is.acknowledged...). A few things to note in this code that you may have noticed but not understood. First, you'll see that when we send the emergency pushover, we use < 130000000000) { if (exists("msg")) { if (!is.acknowledged((msg$receipt))) { pushover(message = "It's Still Above, and No One Acknowledged") } else { pushover(message = "It's Above # Monitor the number of emails sent on the internet, alert me when it's above 130 billion library("pushoverr") library(rvest) library(later) set_pushover_app(token="ad7w7uqezsfd3v81sze7znjhaz1") #Change this to your API key set_pushover_user(user="GnoCrXCawlXUwcBDFDkhKBUgC1IMSO") #Change this to your user key # This Function downloads the live stats, and checks them. It then # notifies you if the number of emails sent is above 130,000,000,000 checkmail <-function() { system("./phantomjs get_internetlivestats.js") page <-read_html("livestats.html") node <-html_nodes(page,"span") # The number we want, Emails sent Today, is element 22 # html_text(node) [22] num.emails <-as.numeric(gsub(",","",html_text(node)[22])) # we can now alert if that number is above our critical cutoff (130 billion, 130,000,000,000) if (num.emails > 130000000000) { pushover(message = "It's Above!") } else { pushover(message = "It's Still Below!") } cancelfunc <<-later(checkmail,300) } checkmail() # This modified function downloads the live stats, and checks them. It then # notifies you if the number of emails sent is above 130,000,000,000 # It also will require that you acknowledge it or it will keep sending checkmail <-function() { system("./phantomjs get_internetlivestats.js") page <-read_html("livestats.html") node <-html_nodes(page,"span")