I always enjoy seeing crossover between statistics and computer science. In fact, one of my very first jobs involved using S+ (the closed-source precursor to R) writing code to support a textbook my professor was writing at the time. These days, machine learning usually comes to people’s mind for that mix, but occasionally digital forensics can make use of these techniques as well.
Stochastic forensics
At Black Hat a few weeks ago, I attended a presentation by Jonathan Grier on stochastic forensics. I had visions of Markov models of user activity and malware Monte Carlo simulations.
As it turns out, this wasn’t too far off. Essentially, the idea is that we can infer certain data from a system by looking at its collective characteristics. In other words, we can measure across a large number of individual members and observe the behavior of the body as a whole to draw conclusions. The initial case related to data exfiltration. A client organization wanted to prove that a user had copied a large number of files containing proprietary data to an external drive. Windows doesn’t normally track this information except to a very limited extent in file access times, but even that only records the last time a file was accessed. So subsequent access to a file will overwrite that time stamp and destroy any previous record. For an individual file, then, we might have significant difficulty proving that someone had copied it. Worse, the user in this case had legitimate access to the data, so any single data point would prove nothing.
By taking a statistical view of the system, however, particularly looking at entire directory trees, we can plot a histogram of last access times and compare to control data (from other directory trees not under suspicion). The observed pattern for normal usage might look one way, with most files not touched and recent accesses limited to a small set of files. But if a tree has been copied wholesale, such as via a drag-and-drop operation or zipping it up or some other recursive copy, then the access times would look different. You would (hopefully) have a clear delineation of all the files accessed at some particular time and then a sort of power law distribution following from that showing normal access patterns.
As I listened to the presentation, I noted a few weaknesses in his approach: manipulation of time stamps, for example, or perhaps other feasible explanations for this type of pattern. In particular, the file system simulator he wrote as an initial test did not strengthen his argument at all, because it essentially only verified the model he coded into the simulation rather than tell us something useful about “real” systems. In addition to explaining improved testing methods he used later, his responses mostly mollified me: this won’t always work, so you need to test carefully on a system (e.g. test to see if the AV software overwrites time stamps, etc.) And the most it will give you is circumstantial evidence pointing to the fact that something happened on the system at that time. Perhaps the user took a legitimate backup, for example. But now you have something to investigate further.
Forensic research
This led me to muse on the nature of research in information security. Sometimes we have a tendency toward the perfectionist fallacy: if it’s not perfect, then it’s worthless. In forensics in particular, this occurs for understandable because we have definitive standards of proof to meet (e.g. “preponderance of the evidence” in civil trials or “beyond a reasonable doubt” in criminal trials). So of course we really do need to look at the weaknesses of a system or an approach.
But if we find weaknesses, that shouldn’t be the end of the story. Instead, perhaps it can point the way for future research: if you think antivirus scanning will overwrite the time stamps, then test and report it. If you think that comparing access timestamp patterns only identifies anomalies, then say so and identify what sorts of anomalies might generate this pattern. Partial results can still provide value, even if not as much as we’d like. And of course further testing to invalidate a hypothesis or show problems with an approach provides great research value.
The research on stochastic forensics I discussed above will not revolutionize digital forensics. No long-standing large-scale theories will topple. On the other hand, we have an incremental result for other researchers to consider and try to validate or invalidate. We also have an idea that we can try to apply in other areas like network forensics.
Most scientific research advances, not in great leaps of intuition and revolutions that wipe the slate clean with an entirely new look at things, but in small evolutionary steps that work us closer to our goals of knowledge and information. We must treat our discipline as a science and not just an art to emulate the progress other fields of science have enjoyed.






