This is follow-up to the work that Michael Ludwig (one of our summer interns from last summer) conducted and is continuing to work on as part of his PhD research.
For more details, please see this about page for the experiment. Thank you.
Musings — mostly about color art, science, and technology
This is follow-up to the work that Michael Ludwig (one of our summer interns from last summer) conducted and is continuing to work on as part of his PhD research.
For more details, please see this about page for the experiment. Thank you.
The so-called Twitter Anomaly Detection function for R is excellent but also very minimalistic. The input is a two-column data frame where the first column consists of the timestamps and the second column contains the observations. In addition to a plot, the output is a data frame comprising timestamps, values, and optionally, expected values.
In practice, we usually have some semantic information that we would also like to include in the output, so we do not have to refer back to the original data. Fortunately, there is a quick-and-dirty way to add a description to the outlier data frame.
We start with the annotated data frame containing at least columns with the timestamps, the observations, and factors providing contextual or semantic information on each observation. We then create a simple data frame with just the first two columns, which we pass to the outlier detection function.
We can write a trivial function that for each outlier finds the row index in the simple data frame and looks up the semantic information in the annotated data frame:
AddDescription <- function(series1, series2, outliers) { quantity <- lengths(outliers$anoms[1]) if (quantity < 1) return (NULL) else { result <- NULL for (i in 1:quantity) { rowIndex <- which(series1$timestamp == outliers$anoms$timestamp[i]) newRow <- data.frame(outliers$anoms$timestamp[i], outliers$anoms$anoms[i], as.character(series2$note[rowIndex])) result <- rbind(result, newRow) } colnames (result) <- c("timestamp", "outlier_value", "description") return (result) } }
This function is just an elementary example. It is easy to add to each outlier more detailed information you can compile from the full data frame.
timestamp | outlier_value | description | |
---|---|---|---|
1 | 2017-01-17 06:53:00 | 209 | gear display flashing |
2 | 2017-09-19 09:10:00 | 206 | gear shift failure |
3 | 2017-11-17 07:26:00 | 211 | check engine lamp on |
Dates are a sore point of analytics: they alway get you. When no time zone is specified, i.e., tz = "", R assumes the local time zone. In the data frame returned by Twitter's AnomalyDetectionTs
functions, the time column has UTC as the time zone. Therefore, the following statement is useful after the call to AnomalyDetectionTs
:
anomalies$anoms$timestamp <- as.POSIXct(anomalies$anoms$timestamp, tz = "")
When we think about places we have never visited, we build on other information about the place—the stereotypes—we have gained from variou...