Category Archives: Notes

Brain dump of DFIR and network security research ideas

Maybe I could get more of these done with this.

Maybe I could get more of these done with this.

I’ve seen several people talk about lacking ideas for research projects, often around DFIR or network security. Personally, I have the opposite problem: endless ideas for projects, often with the barest hint of a start, but not enough time to pursue them all. So I thought I’d publish a bit of a brain dump. I actually have made good progress on a few of these, and I have concrete plans around others (beyond just “wouldn’t it be cool if…”), but in any case I’d love to see other people pick them up and run with them.

If you do happen to get interested in any of the following, I wouldn’t mind a quick note to touch base to see about possibilities for collaboration or at least an acknowledgement in whatever you publish. Don’t interpret that as any sort of requirement, though; ideas have no value without execution, so all the hard work hasn’t even begun.

  • Malware
    • Classification across a large corpus
    • Automated IOC extraction and publication
  • Threat Actors
    • Profiling systems, particularly based on OSINT
    • Underanalyzed crime groups (e.g. drug cartels involvement in malware, spam, and fraud)
    • Hacktivism motivations and methods
  • Passwords
    • Cracking lab setups
    • Useful entropy calculations
  • Quantitative analysis of incidents
    • DDOS attacks (hard to get numbers on these)
    • Defacements and low-level leaks
  • Active Defense
    • Honeypots and honeyclients
    • Vocabulary or taxonomy on various methods
    • Callback Trojans in documents
    • C2 / RAT vulnerability research

CFAA and foreign computers

25iht-heng25-articleLarge
As part of some research into “active defense“, I decided to review the actual text of the Computer Fraud and Abuse Act (CFAA). This law has a number of well-documented problems, which I don’t plan to address in this post, partly because IANAL and partly because I want to focus on how the Act describes a “protected computer”:

the term “protected computer” means a computer—
(A) exclusively for the use of a financial institution or the United States Government, or, in the case of a computer not exclusively for such use, used by or for a financial institution or the United States Government and the conduct constituting the offense affects that use by or for the financial institution or the Government; or
(B) which is used in or affecting interstate or foreign commerce or communication, including a computer located outside the United States that is used in a manner that affects interstate or foreign commerce or communication of the United States

(Emphasis mine.) Specifically, I want to think about the implications related to a “computer located outside the United States”. Assuming that such a system doesn’t affect US commerce or communications (whether or not that activity takes place within the US), would it fall under the definition of a protected computer? For example, if a US person gains access to a command-and-control system in another country and takes some action that would otherwise certainly violate the CFAA were the C2 in the United States, perhaps the CFAA does not apply. Or maybe somebody accesses an exploit server or malware host to gather additional information: does the CFAA cover this? (Other statutes, particularly in the host country, may apply, so don’t do anything that might get you thrown in prison, kids. We’re just thinking about what the law may cover.)

Google may have possibly done something akin to this when investigating the Aurora incident. According to the New York Times story after the incident, Google:

managed to gain access to a computer in Taiwan that it suspected of being the source of the attacks. Peering inside that machine, company engineers actually saw evidence of the aftermath of the attacks, not only at Google, but also at at least 33 other companies, including Adobe Systems, Northrop Grumman and Juniper Networks, according to a government consultant who has spoken with the investigators.

(Emphasis mine again.) So, according to this story, Google somehow accessed a system that presumably did not belong to them. Depending on that system’s function, perhaps this didn’t violate the CFAA. Certainly, the USSS or the Department of Justice or Secretary Clinton did not publicly express concern about this. As far as we know, they didn’t shut down the system or otherwise damage it, so while they could have concerns about Taiwanese law if they actually did any of this, they might not have to worry about the CFAA.

This post does not advocate so-called hack back retaliation, but my initial non-lawyerly analysis makes me wonder if other people already depend on this interpretation for various sorts of activities.

Maltrieve: retrieving malware for research

ThreadsAs I continued to hack on mwcrawler over the last month, I found that it didn’t really meet my needs for various reasons: slowness, difficulty of maintaining and adding sources, repeated grabbing of the same URL, and lack of response from the original author. So I’ve rewritten it and released Maltrieve, which (as the name indicates) retrieves malware directly from the sources listed at a number of sites. Improvements listed in the README include:

  • Proxy support
  • Multithreading for improved performance
  • Logging of source URLs
  • Multiple user agent support
  • Better error handling

Right now, Maltrieve only looks at four meta-sources because two of the six in mwcrawler appear offline. But I have at least four more on deck, and mwcrawler didn’t parse all of its meta-sources correctly in any case. I also know of a few bugs that I haven’t figured out how to squash yet, but the core functionality works and it needs a broader audience to bang on it. Thus, I’ve tagged this version “beta-1″. Don’t rely on this for serious production, please.

If you use it, please let me know just so I can bask in the warm glow of productivity. The project itself remains under the GPL, of course. Suggestions, bug reports, etc. also would make me happy, whether via issues and pull requests on Github, contacting me on Twitter, or comments here.

Konig: malware, graph theory, and fuzzy hashes

As a small personal research and learning project, I spent a few hours this weekend writing Konig. This is intended to evolve into a framework for investigating relationships between fuzzy hashes (e.g. a corpus of malware gathered with mwcrawler) using graph-theoretical methods. Underneath, it basically just marries NetworkX and ssdeep.

At the moment, the code is fairly barebones: create the hash library based on files in a particular directory, then construct a graph of the relationships between those files where the similarity exceeds a user-specified threshold. Also, please keep in mind that my Twitter bio for a while just said “I write bad code”, and for good reason: I do. The GUI purely consists of a matplotlib window and needs a lot of work. (I have less experience with interfaces than almost anything, so keep your expectations even lower). I’ve added some very basic information on the properties of the graph (order, density, etc.), as well as the ability to select the connected component that includes a node (file) of interest.

Example output:

kmaxwell@gauss:~/src/konig$ python konig.py -d ~/data/mwcrawler/unsorted/PE32 -t 90 -i PE32.json
Loading saved hash database
Calculating fuzzy hashes for all files in /home/kmaxwell/data/mwcrawler/unsorted/PE32...
Creating graph structure for files with similarity >= 90...
Name:
Type: Graph
Number of nodes: 2932
Number of edges: 265625
Average degree: 181.1903
Graph density: 0.0618185990375
Preparing plot of graph structure...

Konig screenshot

The goals here include refreshing my knowledge of graph theory, as the last time I seriously studied this stuff, I think the OJ Simpson verdict hadn’t come back. Also, this code will help pave the way for some related work I have slated to use mwcrawler and vxcage together. In fact, I really think of Konig as a proof-of-concept implementation to throw away before doing something more useful and robust.

Getting into the guts of mwcrawler

Earlier this week, my buddy Ken Pryor mentioned a project with which I had no prior familiarity:

So I went over and dug into mwcrawler. From the project README:

mwcrawler is a simple python script that parses malicious url lists from well-known websites (i.e. MDL, Malc0de) in order to automatically download the malicious code. It can be used to populate malware repositories or zoos.

It turns out that it really is pretty simple and hackish, which fits my needs perfectly. This is all a very experimental side project just to keep me amused during the (relatively) cold weather here in Texas.

Given how much I already love Github, I forked the project, then made a few improvements to allow for the use of a proxy (for OPSEC reasons) and to specify a dump directory from the command line. Requiring the user to modify source just to change config options works fine for alpha, but a little bit of polish goes a long way. I’ve also started implementing some logging to keep the metadata (like source URLs for each file). And yes, I’ve submitted pull requests, but neither mine nor the user agent randomization patch from Ben Jackson have gotten any response from the project owner. Hopefully that will change now that the holidays have finally run their course.

Now once I have all this data, I wanted to do something with it. Just for messing around, I went with the old standby of ssdeep to find relationships. That doesn’t mean it’s a final step at all; this weekend, I’ll run them through VirusTotal API, for example, to classify known samples by hash, and perhaps also incorporate something like pyew for clustered analysis to pull out interesting features. And it features integration with thug, which I’ve not started running yet. Some bugs still exist, like unhandled exceptions when the script can’t reach the page or dependence on the semi-deprecated Beautiful Soup 3.

But my current tiny little repository includes 227 MB in 344 PE32 executables (not counting other file types like archives and such). As an extremely simple preview, even basic fuzzy hashing as mentioned above creates some interesting clusters (graph generated with awk and Maltego):

mwcrawler-ssdeep

Bodhi drop

Jesse Ventura

The Body, not The Bodhi

I started using Bodhi Linux a couple of months ago when the Linux Format magazine shipped it on their DVD. I liked the extremely minimalistic approach: the default install only includes the core OS, a browser, and the Enlightenment window manager that inspires its name. Happily, my Dell Precision M4600 ran it just fine: no problems with networking, sound, or anything.

Even with all that said, though, I’ve decided to move back to Xubuntu. Two primary reasons drive this decision.

First, the repositories feel insecure. Much like the rest of Ubuntu, they don’t use HTTPS (which probably makes sense given the state of SSL certificate checking by most non-browser applications), but they also intentionally do not sign the packages. This strikes me as a bad idea because now you have absolutely no assurance whatsoever that the package you get is what you expected to get. Using encryption properly doesn’t solve all security problems but we can’t handwave it away, either. Other than this, the distribution itself has worked pretty well thus far.

Second, and perhaps more pragmatically, the rest of my family already uses Xubuntu (because I don’t want to deal with Unity or Amazon lenses). It looks nice for them, does everything they want without much extra work on my part beyond installing Google Chrome, and I just find it easier to stick with one. Yeah yeah, monoculture and homogeneous networks. But it makes my life easier, and I’ve reached the point where running Linux has less to do with cool factor or trying new things and more about getting stuff done quickly and effectively with a minimum of inefficiency. I don’t see moving them to Bodhi as the solution, either.

Far pointers: threat intel concepts and CIF-Maltego edition

Not Grover, although Andy Grove ran Intel whose segmented architecture made them necessary… wow, was Jim Henson trying to tell us something?

I wrote a post on the Verizon Business Security Blog titled Concepts in Sharing Threat Intelligence. You should read it; I hope you like it. Comments over there, please! It makes my bosses happy when you read and comment on my stuff there. And when they’re happy, I’m happy. And when I’m happy, everybody[1] is happy.

Maltego and CIF

So as part of my recent work on all things CIF, I wrote a Maltego transform with a little help from the fantastic Andrew MacPherson. Assuming you already know how to use both, then you’ll have no trouble with this.

In Maltego, in the menu bar near the top, select Manage > Local Transforms. You can call it whatever you like, such as something imaginative like “CIF lookup”, but be sure to specify the “Input entity type” as an IPv4 address. The transform set doesn’t really matter, I don’t believe, but I put it under “IP owner detail” because that seemed to make the most sense to me. Then point Maltego at the script and it should work. You’ll need to have the CIF client in /usr/local/bin or otherwise change the Popen() call in the script.

I have plans for more Maltego transforms (e.g. VirusTotal), but if you run into any issues with this one, or want something changed, please let me know. This will work just fine with Maltego Community Edition, by the way, but I highly recommend buying a Maltego commercial license if you’re doing anything serious with it. The folks there are incredibly responsive and helpful and they deserve something for all their hard work if you’re using it.

[1]: For small values of “everybody”.

Terrible Presentation Stress Disorder: “Finding APT with Big Data”

Yesterday, I attended the North Texas ISSA Meeting with my friends Michelle and Ryker. The talk carried the fascinating title Threatscape 2012: Finding Advanced Persistent Threats with ‘Big Data’ Analysis and Correlation, presented by A.N. Ananth, CEO of Prism Microsystems, and one of the leaders of Global DataGuard. (I think the actual presenter was not the person listed, though I might have missed that.) Among the lessons I “learned”:
Condescending Wonka wants to hear all about Big Data and the APT

  • Doing recon on a target is what makes the APT “advanced”, especially if you use LinkedIn to figure out more about a company.
  • “0 day” attacks are the Ebola and bird flu of information security.
  • 100 Gb (sic) of data is Big Data.
  • All the traffic to your web site is Big Data. But we’ve been dealing with Big Data like that since the 1970s, so we know how to handle it.
  • Log data from 25 servers is Big Data, because Big Data doesn’t really have anything to do with how much data you have. (This theme came up a lot.
  • Attackers only care about customer records, so clearly the talk focused heavily on real-world experience with the APT.
  • The Verizon DBIR1 mostly only covers North America.
  • Log analysis is advanced behavior correlation analytics.
  • You must do regression testing on your network behavior adaptive learning modeling.
  • Comparing one million points of data is big data but you can use metadata for higher level indicators.
  • Detecting Stuxnet would have been obvious because it’s a new process.
  • Databases suck for analyzing security data.

I swear I’m not winking at you. That’s an eye twitch from TPSD (Terrible Presentation Stress Disorder). Because lunchtime has arrived and I feel a little frisky now that I’ve had my Dr Pepper, let’s take just a few examples to demonstrate the already-obvious cluelessness of this presentation.

First, do a bit of basic research before you cite anything. The Verizon DBIR states in the Executive Summary on page 2:

We also welcome the Australian Federal Police (AFP), the Irish Reporting & Information Security Service (IRISS), and the Police Central eCrimes Unit (PCeU) of the London Metropolitan Police. These organizations have broadened the scope of the DBIR tremendously with regard to data breaches around the globe.

In addition, Verizon performs data breach investigations around the world, not just domestically. Page 12 shows the countries in which confirmed breaches occurred as part of the analyzed caseload. I don’t believe we published data showing the specific geographic distribution of cases, but the statement that the DBIR mostly covers only North America has no support.

Second, I’ve written a good bit here about the APT, and others have done so far more extensively. The APT – nation-state threat actors with significant cyber capabilities, usually meaning “China” or similar when used cluefully – doesn’t care so much about customer records. After all, if you own a significant chunk of the national debt of the United States, credit card numbers are small fry. When you start talking about research plans, sensitive business documents, source code, and the like, now you’ve started to address the target assets. That’s not to say that nobody cares about customer data, of course. Look at the tremendous amount of fraud coming from all over the world, largely but not exclusively centered in Russia and Eastern Europe, not to mention the “hacktivism” related breaches in 2011.

Third, while basic security measures would certainly prevent most common breaches, this doesn’t hold true for truly advanced attacks (by definition). If you think that any system that simply monitors for new processes would detect Stuxnet in an obvious manner, either you don’t really know much about enterprise monitoring or malware, or you are lying. I will charitably assume the former and recommend that you sit down and buy a beer for an actual incident responder or malware analyst to get the real story.

Finally, Big Data absolutely does have something to do with the volume of your data, though that’s not the only factor involved. The presenters correctly stated in the midst of their chaotic confusion that Big Data means data for which traditional RDBMS and similar systems just don’t work. That doesn’t mean 100 data points or even 100 gigabytes (again, charitably assuming a typo here). It means that you have so much data arriving so quickly and in such different forms (schemas) that you can’t simply stream it into a traditional database. This differs significantly from data science and analytics in which we try to find patterns and anomalies in the data, sometimes with advanced methods like machine learning and distributed computation. These two concepts aren’t identical, they’re orthogonal: you may perform analytics on smaller data sets, or you may have a very large data set that maps to well-understood models. The phrase “regression testing on your network behavior adaptive learning modeling” is gobbledygook.

I could go on, but really, the only other thing I want to say to the gentlemen who presented this useless waste of an hour is:
Zoidberg: Your presentation is bad and you should feel bad!


1: Disclosure: I work for the Verizon RISK team that produces the DBIR, though I joined after the publication of the 2012 edition and had no hand in it.

Coopetition and sharing threat intelligence

Imagine a street market with lots of vendors hawking their wares. Customers wander in and out of the market, some of whom you don’t see every day while you know others as regular visitors. Perhaps you are one of several selling coffee beans1. Now imagine that you’ve realized that there’s a thief in the market, and you know more or less what he looks like or perhaps a little about his modes of operation. It’s in your interest to let the other coffee bean sellers (and perhaps even other vendors) know, along with perhaps the local police, because you don’t want that thief robbing you, your suppliers, or your customers – nor your competition.

Some of my recent thinking about sharing and cooperation stems from recent discussions about the CISPA and similar initiatives, while some of it stems from thinking about the fact that, in many areas of business, we frequently compete with organizations whose employees we may consider friends. And of course, competition in business should only go so far. I subscribe to the belief that “there’s no such thing as business ethics” in the most positive sense: we cannot simply limit our ethical behavior to certain areas of life, then turn around and act unethically in other areas.

Sorry baby! Gotta go save the Internet!All of that musing sets the stage for thinking more about sharing threat intelligence. Clearly, we never want to share threat intelligence with the adversaries that may pose a threat to us. This explains why most experienced incident responders recommend not sending malware samples immediately to an antivirus vendor, particularly during an open investigation: that intelligence can easily leak back to the attacker and compromise your operational security. At the same time, we can find benefits in sharing data with our ostensible competition. For example, payment processors have formed a group within the FS-ISAC to share “information about fraud, threats, vulnerabilities and risk mitigation in the payments industry”. Yes, this means that corporations that compete doggedly for merchant accounts and transaction fees will help each other with security intelligence, since that information has more value when aggregated: each processor gets more intel from the group than they put into it. As a result, the marketplace can function more cleanly, to the benefit of all (honest) participants.

That doesn’t mean that an organization should share all of its security secrets. Generally speaking, we can say that the operational security risk from sharing intelligence has an inverse correlation with the specificity of the intelligence. So discussing the (fairly well-known) idea that a lot of fraud originates in Russia and Eastern Europe doesn’t increase the risk to an organization. Sharing information about specific BINs with extremely high fraud levels might incur slightly more risk, but not much (and that primarily from an operational or possibly legal perspective, rather than technical). When we start sharing indicators of compromise and known attacker addresses, then we have to take greater care to ensure that the information doesn’t leak to the adversary. But again, the adversary here isn’t the company next door trying to expand their market share, possibly at the expense of yours. The adversary wants information from both of you, to the detriment of others in the marketplace like cardholders, merchants, and so on.

I don’t quite know what I think about how this might extend to groups (including vendors) whose business includes collecting and selling threat intelligence, including my own employer2 and other companies with which I’ve maintained good working relationships. But I do think that there’s value in some level of cooperation even among these groups, and I’m interested to know what others think.

1: Despite my surname, I don’t have any affiliation with Maxwell House Coffee, and I don’t even drink their stuff. I just like thinking about coffee. Mmm, coffee.
2: To repeat what should be obvious, my opinions here are my own, if anyone’s. Sometimes I end up not even agreeing with myself, so don’t expect that anybody else will!

Introduction to the Collective Intelligence Framework

Just back off dudeCIRTs and related organizations often handle incident detection as well as response. Both of these roles produce and consume threat intelligence in different ways. For example, we often want to correlate our network traffic with OSINT indicators (known bad IP addresses and URLs, MD5 hashes of suspicious files, etc.) I’ve started looking at the Collective Intelligence Framework as a way to fulfill these needs. CIF development is sponsored by the REN-ISAC and National Science Foundation, with most of the coding (and everything else!) handled by Wes Young. Everything is open source for those of us who like – or need – to hack directly on the code.

In this article, I’ll explain CIF, give some usage examples, and discuss test deployment scenarios.

Understanding CIF

From the perspective of a user, CIF allows you to run queries against many data sources at once. If you have other private data sources available, particularly via XML (RSS), JSON, or in a file (e.g. CSV), you can incorporate those, as well as additional OSINT sources. CIF comes preconfigured for:

Use cases include manually querying the database for specific indicators (e.g. “do we have any records for this IP address?”) as well as pulling feeds of various sorts for use by security systems (e.g. “what URLs should we block at the proxy?”). CIF includes concepts of severity and confidence as well as privilege. This allows you to provide feeds of high-confidence public data to some systems while still allowing investigators to query private, unconfirmed data.

Essentially, CIF ingests data – typically on an hourly or data basis, depending on the source – indexes it on the fly for performance reasons, performs correlation analytics (e.g. so that a URL also turns into domain and IP address information), and then makes it available in feeds via various output plugins. These plugins include tables and HTML for viewing by a user, but also IPtables rules, Snort rules, JSON, and CSV for processing by other security systems.

Usage examples

Everything below comes from the Perl client. I haven’t yet dealt with the Python client, much less hacked on it, but that’s coming Soontm.

cif -q infrastructure/malware -c 50 -s medium

gives a fairly large list of IP addresses associated with malware. (I used medium severity and 50% confidence in these examples.)

Even if you don’t use a proxy server, you might find CIF useful for checking suspicious URLs:

cif -q url -c 50 -s medium -p snort

You now have a list of Snort rules to pull into your IDS.

Or if you have your own list of IP addresses to check, such as when an ongoing case has new indicators:

you can put them in a file and query each of them.

for f in `cat hostlist.txt` ; do cif -q $f >> specific-ip.txt; done

This yields another list. You might see a few lines in that example with a “private” restriction and impact as “search”. This happens because, by default, CIF will log every query for a specific indicator. A number of searches, such as from other investigators, may have significance apart from any data. However, if you don’t want CIF to log a query, just use the “-n” parameter.

If you’d like to play with it some more, contact me for an API key and the address of my semi-public CIF server. Twitter or email both work fine.

Appendix: CIF on the Amazon cloud

Amazon Web Services provide a decent platform for testing CIF or running a public instance like mine. The following assumes some familiarity with Linux administration and at least a basic understanding of the Elastic Compute Cloud (EC2).

You can start with a small instance for the installation, but you’ll quickly want to move to a medium instance at least. I run a large instance using the Ubuntu Cloud Guest server image. In general, follow the server install instructions for CIF. You’ll also want to note the specifics for Ubuntu as they contain a few workarounds you will need. Allocate an Elastic IP and register it in DNS someplace, such as with Amazon Route 53. For the Security Group, only add HTTPS and SSH. You won’t need anything else, and I recommend leaving it at this minimal state for security purposes. You’ll also need an Elastic Block Store. While you can start with 10GB, expect that to grow a few GB per week, so you’ll need to resize from time to time or create a larger volume at the beginning. While not required for CIF installation, I can’t recommend enough that you use git to manage config files. Srsly.

When installing Postgres, note that “peer” may appear in the original file instead of “ident sameuser”. Also, I did not use the values in CIF doc, as postgres didn’t like them. I left everything at the defaults except:

work_mem = 512MB
checkpoint_segments = 32

When setting up BIND9, first check /etc/resolv.conf for the IP addresses you should use as forwarders.