Pre-processing threat intelligence

I usually like to think of threat intelligence as either “high level” or “low level”. High level intelligence includes human-understandable information that we can’t immediately parse into specific data, like a warning that “hacktivists” have targeted an organization. In contrast, low level intelligence usually consists of atomic data (network addresses, malware indicators, payment card information, etc.)

This low level threat intelligence often arrives in raw form that requires its own analysis and processing before we can take action on it. Automated actions require a great deal of careful thought before implementation: while we can certainly take a list of, say, IP addresses and put it directly into a firewall, that doesn’t necessarily give the best benefit. What if those addresses mistakenly include internal ranges, or include key business partners? What if the IP addresses have domain names or URLs mixed into them?

Workflow

Because of these issues, I find it useful to map out a workflow for managing incoming threat intelligence:

High-level threat intel workflow

Organize your incoming threat intelligence into use cases, perhaps based on the source. You may have access to data feeds from a trusted partner, like the US-CERT or an ISAC. Certainly, you should be developing intelligence based on your existing caseload, tugging on the threads you find to unravel the bad guys’ sweater. Various OSINT sources like black lists, pastebin monitoring, or social media might also come into play for some organizations, though these often require more advanced capabilities and understanding. Each type of data will usually require its own use case, though as you work through more of them, you will start to consolidate on a core workflow that will make building new use cases

Pre-processing

CSVs and RSS feeds may look highly parseable, and in a very real sense they are. But that doesn’t mean they don’t need clean-up and validation. Cleaning smaller data sets may only require some massaging in a text editor or spreadsheet, or perhaps a small shell script. For larger data sets, though, you will likely need more power. I like Google Refine for its rich abilities to categorize and edit data, though it can break down at really large scales, such as gigabytes of data. In those cases, you might want a powerful statistical package like R – although you should probably instead re-examine your use case to evaluate whether that data set really will work for threat intelligence.

Validation means trying to eliminate as many spurious alerts as possible, so that you don’t classify http://www.google.com/webhp?hl= as a malicious URL or look for malware with the MD5 hash d41d8cd98f00b204e9800998ecf8427e (the result for the null string).  Maltego does this particularly well, not just because of its visualization abilities but also because of its ability to run transforms (“lookups”) on data sets. As an example, imagine you have a set of suspicious IP addresses. Before leaping to conclusions, you need to know who or what those addresses represent. You can dump the addresses into Maltego and quickly determine their reverse DNS, WHOIS contact info, and a good estimate of their geolocation. From here, you can look over the data to determine whether you’ve  possibly thrown your net a little too widely, then look through proxy logs based on the reverse DNS lookup. Conversely, if you have a list of DNS names, you can quickly resolve those to IP addresses for blocking in a firewall or looking up in other data sources.

To take a related example, the other day I found traffic to a set of addresses that needed a closer look:

72.21.203.26
74.125.113.108
74.125.113.109
74.125.115.108
74.125.115.109
74.125.157.108
74.125.157.109
74.125.65.108
74.125.65.109
98.139.215.231

After pasting these into Maltego and looking for reverse DNS, contact info, and geolocation, I had this graph:

Maltego graph based on IP addresses

Click to embiggen

So these turned out to be Gmail and Yahoo! Mail servers plus what turns out to be an AWS server. This makes sense given the fact that I saw TCP 993 traffic (Secure IMAP).

Good pre-processing eliminates a lot of the pain when getting further along in your workflow and allows you to have more confidence when deciding to pre-emptively block an address or incur the performance cost on a system by looking for specific hashes.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s