Tag Archives: analysis

Konig: malware, graph theory, and fuzzy hashes

As a small personal research and learning project, I spent a few hours this weekend writing Konig. This is intended to evolve into a framework for investigating relationships between fuzzy hashes (e.g. a corpus of malware gathered with mwcrawler) using graph-theoretical methods. Underneath, it basically just marries NetworkX and ssdeep.

At the moment, the code is fairly barebones: create the hash library based on files in a particular directory, then construct a graph of the relationships between those files where the similarity exceeds a user-specified threshold. Also, please keep in mind that my Twitter bio for a while just said “I write bad code”, and for good reason: I do. The GUI purely consists of a matplotlib window and needs a lot of work. (I have less experience with interfaces than almost anything, so keep your expectations even lower). I’ve added some very basic information on the properties of the graph (order, density, etc.), as well as the ability to select the connected component that includes a node (file) of interest.

Example output:

kmaxwell@gauss:~/src/konig$ python konig.py -d ~/data/mwcrawler/unsorted/PE32 -t 90 -i PE32.json
Loading saved hash database
Calculating fuzzy hashes for all files in /home/kmaxwell/data/mwcrawler/unsorted/PE32...
Creating graph structure for files with similarity >= 90...
Name:
Type: Graph
Number of nodes: 2932
Number of edges: 265625
Average degree: 181.1903
Graph density: 0.0618185990375
Preparing plot of graph structure...

Konig screenshot

The goals here include refreshing my knowledge of graph theory, as the last time I seriously studied this stuff, I think the OJ Simpson verdict hadn’t come back. Also, this code will help pave the way for some related work I have slated to use mwcrawler and vxcage together. In fact, I really think of Konig as a proof-of-concept implementation to throw away before doing something more useful and robust.

Getting into the guts of mwcrawler

Earlier this week, my buddy Ken Pryor mentioned a project with which I had no prior familiarity:

So I went over and dug into mwcrawler. From the project README:

mwcrawler is a simple python script that parses malicious url lists from well-known websites (i.e. MDL, Malc0de) in order to automatically download the malicious code. It can be used to populate malware repositories or zoos.

It turns out that it really is pretty simple and hackish, which fits my needs perfectly. This is all a very experimental side project just to keep me amused during the (relatively) cold weather here in Texas.

Given how much I already love Github, I forked the project, then made a few improvements to allow for the use of a proxy (for OPSEC reasons) and to specify a dump directory from the command line. Requiring the user to modify source just to change config options works fine for alpha, but a little bit of polish goes a long way. I’ve also started implementing some logging to keep the metadata (like source URLs for each file). And yes, I’ve submitted pull requests, but neither mine nor the user agent randomization patch from Ben Jackson have gotten any response from the project owner. Hopefully that will change now that the holidays have finally run their course.

Now once I have all this data, I wanted to do something with it. Just for messing around, I went with the old standby of ssdeep to find relationships. That doesn’t mean it’s a final step at all; this weekend, I’ll run them through VirusTotal API, for example, to classify known samples by hash, and perhaps also incorporate something like pyew for clustered analysis to pull out interesting features. And it features integration with thug, which I’ve not started running yet. Some bugs still exist, like unhandled exceptions when the script can’t reach the page or dependence on the semi-deprecated Beautiful Soup 3.

But my current tiny little repository includes 227 MB in 344 PE32 executables (not counting other file types like archives and such). As an extremely simple preview, even basic fuzzy hashing as mentioned above creates some interesting clusters (graph generated with awk and Maltego):

mwcrawler-ssdeep

Kent doctrine for security intelligence analysis

I’ve said before that log management matters, but log analysis matters more. Extracting and communicating useful information (analysis) requires collecting and storing your security data as well as processing the data quickly. But having all the data available won’t matter to anybody except auditors if you don’t use it in ways that inform good decisions. Mike Rothman of Securosis expressed this exceptionally well in his preview of the upcoming RSA Conference:

You will see a bunch of vendors talking about their new alerting engines taking advantage of these cool new data management tactics, but at the end of the day, it’s not how something gets done – it’s still what gets done.
So a Hadoop-based backend is no more inherently helpful than that 10-year-old RDBMS-based SIEM you never got to work. You still have to know what to ask the data engine to get meaningful answers. Rather than being blinded by the shininess of the BigData backend focus on how to use the tool in practice. On how to set up the queries to alert on stuff that maybe you don’t know about.

To paraphrase Socrates, unexamined data are not worth collecting. So analysis methodology and critical thinking skills matter. Rothman is spot on with this: the value of big data tech comes when you need to grow past the capabilities that traditional SIEM and RDBMS provide. By way of analogy: if you don’t understand algebra, then don’t take a course in calculus until you have the basic prerequisites down. You’ll just frustrate yourself and waste your tuition dollars.

Sherman Kent

Provided by CIA

In this vein, then, I appreciated the pointer from the OSINT and analysis training firm Treadstone 71 to a CIA paper on the background and work of Sherman Kent, the “father of intelligence analysis”.

He promoted an analytic doctrine that boils down to nine key points, listed in the CIA paper above. That doctrine applies across domains, not just for the sorts military and geopolitical analysis we expect from government intelligence agencies. I highly recommend that everyone read at least that section of the paper, but here are some applications for those of us involved in security intelligence analysis, especially in the private sector.

  1. Focus on Policymaker Concerns: What keeps your management up at night? Hopefully security isn’t the only thing, of course. So assuming that your CxOs understand the general threat landscape, analysts need to ensure that they track relevant areas that can lead to useful changes and decisions at strategic and tactical levels.
  2. Avoidance of a Personal Policy Agenda: Many analysts focus on threats that concern them for reasons outside of their organization. Maybe they disagree with the politics of the Occupy movement and overemphasize threats to entirely unrelated organizations, or worry about APT China because of Sinophobia rather than a reasoned assessment of the situation. Or maybe they want to drive decision makers to a particular tech solution. Even worse, they may use their analyses as weapons for corporate political plays. Doing that represents a disservice to the organization and an unprofessional approach.
  3. Intellectual Rigor: This area stands as-is: “Estimative judgments are based on evaluated and or­ganized data, substantive expertise, and sound, open-­minded postu­lation of assumptions. Uncertainties and gaps in in­formation are made explicit and accounted for in making predictions.”
  4. Conscious Effort to Avoid Analytic Biases: None of us can completely avoid cognitive bias, but we can make sure we understand it and try to correct for it where possible. That principally means application of the scientific method. As previously noted, whether or not faith and dogma have a place in one’s personal life, they certainly do not in one’s professional analyses.
  5. Willingness to Consider Other Judgments: Fight for your ideas, but “playing devil’s advocate” should rest on a better intellectual basis than simply spreading FUD. Recognize that others may in fact know more than you do or have insights that can help you.
  6. Systematic Use of Outside Experts: In addition to seeking out and understanding the work of other analysts, don’t restrict yourself solely to your field or even industry. Work with a community and keep bringing in fresh concepts from other disciplines.
  7. Collective Responsibility for Judgment: Eventually, your team will produce a report. You may not have agreed with everything that went into it, but that’s the way the sausage gets made. Once that report goes to its audience, support it. Throwing the rest of your analysis team under the bus by telling the audience “I told them so” doesn’t actually make you look smarter. It makes you look unprofessional. That doesn’t mean that you should ignore all criticism; rather, it means that you should be willing to take lumps with the rest of the group. If someone asks you for your opinion, give it – but clarify that it doesn’t represent the considered opinion of the rest of the team.
  8. Effective communication of policy-support information and judgments: Analysts need three core skills: domain expertise, critical thinking skills, and communication ability. This includes targeting your analysis to the level appropriate to your audience. You must be able to summarize your findings in understandable and accurate ways. And you must be able to handle points of uncertainty properly.
  9. Candid Admission of Mistakes: You won’t always be right. Admit it, and review past work to see what you can learn for improvement the next time. “Try again. Fail again. Fail better.”

Security intelligence analysts should learn from previous work, instead of simply trusting in their own domain expertise and innate intelligence. Dr. Kent led the way, and even we non-spooks can still learn from his work.

Physical vs virtual document analysis

"it ain't a REAL paper without mentioning an alien invasion" by fling93When I need to read and analyze a document, I usually print it out to a hard copy so I can highlight and make marginal notes. I can see it from a larger scale, taking in a whole page or even multiple pages at once. This works particularly well for long-form articles, whether blog posts or essays or other documents. Once I have those notes, I can synthesize my analysis where appropriate (or just summarize properly).

But at the same time, I’d rather not have to do it that way. The hard copies inevitably turn into clutter or just more “stuff” I need to file and store physically. Even though I find this method highly productive, it feels highly wasteful of physical (natural) resources. Instead of printing out documents for markup, we should have a method of doing this on our desktops: perhaps an extra layer on top of the document (e.g. PDF) where we can attach visible notes, highlight connections between two sections, and even sketch in simple diagrams.

If anyone knows of a tool that can do this, please let me know. The trees thank you.
"Lord of The Rings Tree"

Review: The Analyst’s Cookbook (Volume 2)

"RAF Binbrook Officers kitchens" by DigiTaL~NomAd

Due to my interest in threat intelligence and security monitoring, I picked up the Kindle Edition of The Analyst’s Cookbook (Volume 2) while looking for discussions on intelligence analysis. Some of the material might prove somewhat useful, though that varies.

The book consists of apparent term papers written by college seniors or, more likely, graduate students at the Mercyhurst College Institute of Intelligence Studies. While of high quality, they all follow a similar structure:

  • Description of the technique
  • Strengths and Weaknesses
  • How-To (high level)
  • Personal Application (case study)

This structure provides a lot of value, though some techniques have better explanation and review than others. Some of the techniques, however, seem ill-suited to intelligence analysis (e.g. patent analysis). While I already knew about some of the techniques, the in-depth review still had value. In particular, I liked the discussions on the following:

  • Delphi Method Analysis
  • Social Network Analysis
  • Social Media Collection
  • Game Theory
  • Red Teaming Analysis

For the $5 I spent on the Kindle version, I found it useful and wish they had Volume 1 online as well.

Analysis failures in .IL.US SCADA incident

"Blue sky" by BousureThe infosec community this week has buzzed with news of an embarrassing non-incident at an Illinois water plant. You can get the whole story at Threat Level, but the gist goes something like this: a water treatment plant experienced a failure in one of its pumps. The staff initially treated it as another common mechanical failure, but at some point someone saw logs that indicated a remote login from a Russian IP address and decided that one event caused the other. Subsequent investigation by a government terrorism and intelligence fusion center revealed that the two events had no relation and we can all flush our toilets relatively free of anxiety. The story has much more detail, all of which will cause moments of extreme facepalm.

Others have addressed the ongoing questions around SCADA security and vendor FUD. But I want to discuss some common failures in threat analysis and incident response that this particular case has highlighted.

Someone at some point jumped to the conclusion that a security incident had occurred. Before I even read the article, the precise phrasing “out of an abundance of caution” seemed like the cliché of the moment, and sure enough, a water district trustee had used those precise words. Apparently, the sight of a .ru address led to fantasies of a “digital Red Dawn” scenario and thus escalation to the intelligence and counter-terrorism community. But equipment failures occur much more frequently than SCADA intrusions by any measure – by many orders of magnitude. As doctors know, when you hear hooves, you should look for a horse, not a zebra.

Additionally, no one validated the login with the user.In this case, a contractor had logged in from an unusual location, and investigating that could make sense for a small water treatment plant in the Midwestern United States. But an anomalous event may have a good explanation; a quick phone call or email to the user would have straightened this out quickly.

No other corroborating evidence appears to have existed. If an attacker had logged in with stolen credentials five months earlier and somehow caused one pump to fail many months later, additional artifacts should have existed: perhaps some exploratory probing inside the network, or other logins, or (one might think) greater damage. In other words: if an investigator has a hypothesis to explain one data point, then he should seek other data that could confirm it. The lack of that data should cause him to re-examine his hypothesis. In particular, the timeline alone should have given pause based on critical thinking (the core of good troubleshooting).

I have no particular knowledge of this specific incident. Therefore, while I will talk about possible root causes for the analysis failure, I can only base it on my experience in similar situations across a number of organizations.

  • Undertrained analysts: I don’t mean the tech who originally noted the address, although as noted above, he should have thought about the timeline. But the analysts at the fusion center clearly lacked the training, judgment, and experience for even this simple scenario.
  • Poor validation workflow: Once the center received the report and, I assume, a first-level analyst looked at it, more senior analysts should have validated it. Either those senior analysts never saw the analysis before it got out to the public, or they have even less qualification for their roles than front-line analysts.
  • Institutional culture: In many such centers (whether private or public sector), the culture rewards analysts who find “things”. After all, if the center receives enough data, then many organizations will assume the data contain lots of evil. This can happen when the management does not construct their metrics with care, for example. But human nature also plays a part: finding evil is fun and sexy; finding banality is not.

I’d draw two core lessons here. First, train analysts in analysis, not just technology. Many organizations focus on spending huge amounts on systems and possibly data feeds, trusting in “smart people” who understand the tech to know how to analyze it. Certainly, analysts must have domain-specific technical qualifications, but the mental toolbox matters every bit as much. As an example, one of the best texts ever for analysts of any stripe is Turning Numbers into Knowledge by Dr. Jonathan G. Koomey. The book doesn’t focus on statistics or any particular methodology. Instead, it  discusses the mindset of an analyst and the sorts of thinking required to do this successfully. Organizations need to focus on this kind of training to help analysts sift through data to find useful information.

Second, government doesn’t have special powers to find and understand threats. While clearly this problem has existed for a long time, it bears repeating at a time when the drive to suck up greater and greater amounts of data for mining and analysis threatens to infringe on core civil liberties. Seeking out evil, whether online or on the streets, doesn’t (necessarily) mean lots of data.

It means getting relevant data and knowing what to do with it. And clearly, we still have a long way to go.