## If you feel big data is useless for security, you are doing something wrong

Richard Stiennon recently wrote a post titled If you feel you need big data for security, you are doing something wrong. It’s worth reading, and I agree with his recommendation of implementing threat intelligence data and techniques. But his key thesis lacks quite a bit of support, largely due to analyzing technologies in isolation and criticizing them for not providing a complete security package by themselves.

First, after reviewing the (very real) issues with managing intrusion detection systems (IDS), he states that he “declared IDS dead as a functioning security solution.” A lot of practitioners – including this one – would disagree strongly. While IDS can certainly generate far too many signatures and alerts, good management practice winnows this down fairly quickly. For example, alerting on every teardrop attack in 2012 is totally useless cruft. No MSSP or clueful analyst actually ever needs to investigate millions or billions of alerts a day. Instead, you apply good analysis techniques and find the ones worth your time. I’ll grant that the concept of network security monitoring, including full packet capture, provides far more value, but even then the security infrastructure should include a properly-managed IDS as one component.

He goes from there to a very closely connected idea: using a security information & event management system (SIEM) to manage and correlate these data with other logs and asset models (vulnerability data, for example). Stiennon criticizes this technology because “these solutions addressed the data overload issue but did little to address security. They failed to curtail the rise of targeted attacks that are now wreaking havoc upon businesses and critical infrastructure operators.” This argument has two fundamental problems. First, SIEMs provide analytic and detective capabilities. While an organization’s processes can take those and feed the data back into preventive controls (e.g. firewalls), their primary purpose is to understand what is happening in your environment. Second, SIEMs provide critical abilities to investigate the attacks he cites. Log analysis has not itself detected nearly enough attacks for a number of reasons (in my view, largely due to poorly-trained analysts). However, gathering logs from literally hundreds of different sources after a compromise has come to light via some other method (e.g. third-party notification) creates a huge roadblock in the investigation. If you assume that you are already compromised and will be again, then you need a SIEM to close the loop as quickly as possible.

Stiennon finishes up his criticisms by attacking the idea of Big Data. Applying these techniques in infosec represents an evolution in SIEM usage, as opposed to a revolution. If you already collect full packet captures, all system logs, event logs from every device in your network – including the IDS – then you’ve rapidly entered the world of Big Data:

Data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures.

We still have a lot of work to do in order to understand how to analyze this data set, but we improve significantly every day due to the hard work of a lot of very smart people applying themselves to the problem.

So when Stiennon recommends security intelligence, he’s right to do so. I completely agree that organizations should bring in threat data on attack sources and understand their adversaries in greater detail. But you need to do something with that intelligence. In large part, that means correlating it with your existing data to find the attacks in your environment so that you can contain, investigate, and eradicate the intrusion. Of course, you should never assume that threat intelligence is the final piece of the puzzle, either, because that will lead to the same problems as the ones Stiennon identifies with IDS, SIEM, and even Big Data tech.

Security intelligence matters, and I’m personally committed to expanding our abilities to gather, analyze, and react to these data. But it works in concert with more fundamental systems; it doesn’t replace them.

## Thoughts on network security dashboards

Given the modern security environment in which our networks and systems exist, organizations like incident response teams and security operations centers focus on the state of our own systems and networks. Outside of unusually large spikes, such as those from a Slammer-scale worm, global threat level is relatively uninteresting in this context because it isn’t actionable. We might be interested in regular summaries of those global data (e.g. daily/weekly briefings), but not for minute-to-minute dashboards. So the sorts of cruft that many organizations throw into their dashboards really provide zero or even negative value: high-level threat intelligence or generic indicator feeds like those from ThreatExpert or DShield may have their uses, but not here. And the color-coded DHS-style “threat levels” provide even less value, because they tell us nothing about the risks in our own environments.

So what would we want to see on network security dashboards? Focus on the realities of modern network security, which means monitoring high-probability threats rather than what’s easiest to understand by executives or, worse, based on a decade-old threat model. We don’t care about port scans or negatively correlated IDS alerts (e.g. those for which we know the asset is not vulnerable). Firewall logs can provide some value, primarily as a proxy for netflow-type data. In general, if a perimeter security device blocks the attack, we don’t need to display it on a dashboard. The firewall or similar device did its job, and we care about detecting incidents rather than all attacks.

Dashboards need to present data that have two principal qualities: low false positive rates and immediate action available. The data should also allow the analyst to drill down to more detail for maximum utility, because data without context can create confusion and inefficiencies. In a general sense, we can divide up our dashboards into two types based on their appropriate level of visibility.

### Primary

Primary dashboards need to be constantly visible and updated relatively frequently. These should include the types of data that would be on the large monitors visible to everyone in a SOC.

• Anomalous outbound connections or extrusion detection. Outbound flows can help identify compromised systems, once you whitelist normal connections like those from your proxy servers, mail relays, etc.
• Anomalous VPN logins, meaning those not whitelisted as coming from known-good addresses at normal times. As a first pass, you might show logins from foreign countries depending on your organization’s needs. This might be a candidate for machine learning to identify normal addresses and login times.
• Host sweep results, such as from a tool like Mandiant Intelligent Response looking for indicators of compromise. Other host-based IDS systems may also provide value here, but they often have very high false-positive rates that limit their usefulness on dashboards.
• Network-based malware, such as FireEye or similar. Securosis has a particularly useful introduction to this tech.
• Bandwidth utilization, possibly plotted against normal or past data. Most network management teams already have a tool like MRTG. This helps you see DDOS attacks as they occur.
• Correlated IDS alerts, hopefully filtered through a SIEM so that you properly integrate asset and vulnerability data. I make a point of only listing high-confidence, high-severity alerts in my dashboards.
• WAF notifications help identify attacks using port 80 that firewalls will ignore and most regular network IDS do not handle very well. SQL injection is still a significant vector, and you ignore it at your own peril.
• Social media monitoring may deserve its own separate process, but it can play a role here as part of OSINT monitoring. Think in terms of watching pastebin and Twitter for interesting and relevant hits. For now, I’d consider this highly experimental in most organizations.

### Secondary

Secondary dashboards should be available for rapid review so that they can immediately present useful info to an analyst. These include summaries and visualizations that an analyst will want to have immediately available but will not need immediately visible at all times. As an example, my secondary dashboards include:

• Session lists to show logged-in users, especially on VPN. Windows domain logins can also be useful, though take care with scaling issues.
• Recent traffic, though think carefully about what to exclude here. I don’t find it useful to have a dashboard showing TCP 80 traffic to web servers, TCP 25 on SMTP relays, or UDP 53 to DNS servers. I keep those logs available, but not displayed like this. Your environment will dictate the sorts of things you want to display here, but you can start by looking at the logic for anomalous outbound connections and removing some of the filters.
• IDS data also has value on a secondary dashboard, including uncorrelated or lower-priority alerts. You may identify attackers that use ineffective vectors before they find the ones that work.

### Conclusion

This should just present a starting point to think about your own near real-time displays and dashboards. Look for what makes sense for your organization and will enable you to detect incidents, rather than possible “security-relevant” data that seems easier to understand at first glance.

## OSINT monitoring with scripts

My last post mentioned briefly the difference between “high level” and “low level” threat intelligence.

High level intelligence includes human-understandable information that we can’t immediately parse into specific data, like a warning that “hacktivists” have targeted an organization. In contrast, low level intelligence usually consists of atomic data (network addresses, malware indicators, payment card information, etc.)

However, we should see this as a spectrum rather than a dichotomy: continuous, not discrete. As an example of this, what about monitoring social media from within your SIEM? For example, many analysts have noted the value of Pastebin as an OSINT source. So Xavier Garcia wrote a post on monitoring Pastebin leaks. This served as a basis for Xavier Mertens to post on monitoring Pastebin.com within your SIEM. Maybe you can use this to look for compromised logins on your domain, then correlate against login attempts for those accounts?

This has grown, of course, and so now we have examples of monitoring RSS feeds and tracking tweets from within a SIEM environment. If you tie this to case management (which many of us do within the SIEM, e.g. using ArcSight), then you’ve got a head start on OSINT monitoring. I suspect you could combine this with Yahoo! Pipes to monitor all sorts of loosely-structured data, whether for correlation or integration into your workflow.

## Kent doctrine for security intelligence analysis

I’ve said before that log management matters, but log analysis matters more. Extracting and communicating useful information (analysis) requires collecting and storing your security data as well as processing the data quickly. But having all the data available won’t matter to anybody except auditors if you don’t use it in ways that inform good decisions. Mike Rothman of Securosis expressed this exceptionally well in his preview of the upcoming RSA Conference:

You will see a bunch of vendors talking about their new alerting engines taking advantage of these cool new data management tactics, but at the end of the day, it’s not how something gets done – it’s still what gets done.
So a Hadoop-based backend is no more inherently helpful than that 10-year-old RDBMS-based SIEM you never got to work. You still have to know what to ask the data engine to get meaningful answers. Rather than being blinded by the shininess of the BigData backend focus on how to use the tool in practice. On how to set up the queries to alert on stuff that maybe you don’t know about.

To paraphrase Socrates, unexamined data are not worth collecting. So analysis methodology and critical thinking skills matter. Rothman is spot on with this: the value of big data tech comes when you need to grow past the capabilities that traditional SIEM and RDBMS provide. By way of analogy: if you don’t understand algebra, then don’t take a course in calculus until you have the basic prerequisites down. You’ll just frustrate yourself and waste your tuition dollars.

Provided by CIA

In this vein, then, I appreciated the pointer from the OSINT and analysis training firm Treadstone 71 to a CIA paper on the background and work of Sherman Kent, the “father of intelligence analysis”.

He promoted an analytic doctrine that boils down to nine key points, listed in the CIA paper above. That doctrine applies across domains, not just for the sorts military and geopolitical analysis we expect from government intelligence agencies. I highly recommend that everyone read at least that section of the paper, but here are some applications for those of us involved in security intelligence analysis, especially in the private sector.

1. Focus on Policymaker Concerns: What keeps your management up at night? Hopefully security isn’t the only thing, of course. So assuming that your CxOs understand the general threat landscape, analysts need to ensure that they track relevant areas that can lead to useful changes and decisions at strategic and tactical levels.
2. Avoidance of a Personal Policy Agenda: Many analysts focus on threats that concern them for reasons outside of their organization. Maybe they disagree with the politics of the Occupy movement and overemphasize threats to entirely unrelated organizations, or worry about APT China because of Sinophobia rather than a reasoned assessment of the situation. Or maybe they want to drive decision makers to a particular tech solution. Even worse, they may use their analyses as weapons for corporate political plays. Doing that represents a disservice to the organization and an unprofessional approach.
3. Intellectual Rigor: This area stands as-is: “Estimative judgments are based on evaluated and or­ganized data, substantive expertise, and sound, open-­minded postu­lation of assumptions. Uncertainties and gaps in in­formation are made explicit and accounted for in making predictions.”
4. Conscious Effort to Avoid Analytic Biases: None of us can completely avoid cognitive bias, but we can make sure we understand it and try to correct for it where possible. That principally means application of the scientific method. As previously noted, whether or not faith and dogma have a place in one’s personal life, they certainly do not in one’s professional analyses.
5. Willingness to Consider Other Judgments: Fight for your ideas, but “playing devil’s advocate” should rest on a better intellectual basis than simply spreading FUD. Recognize that others may in fact know more than you do or have insights that can help you.
6. Systematic Use of Outside Experts: In addition to seeking out and understanding the work of other analysts, don’t restrict yourself solely to your field or even industry. Work with a community and keep bringing in fresh concepts from other disciplines.
7. Collective Responsibility for Judgment: Eventually, your team will produce a report. You may not have agreed with everything that went into it, but that’s the way the sausage gets made. Once that report goes to its audience, support it. Throwing the rest of your analysis team under the bus by telling the audience “I told them so” doesn’t actually make you look smarter. It makes you look unprofessional. That doesn’t mean that you should ignore all criticism; rather, it means that you should be willing to take lumps with the rest of the group. If someone asks you for your opinion, give it – but clarify that it doesn’t represent the considered opinion of the rest of the team.
8. Effective communication of policy-support information and judgments: Analysts need three core skills: domain expertise, critical thinking skills, and communication ability. This includes targeting your analysis to the level appropriate to your audience. You must be able to summarize your findings in understandable and accurate ways. And you must be able to handle points of uncertainty properly.
9. Candid Admission of Mistakes: You won’t always be right. Admit it, and review past work to see what you can learn for improvement the next time. “Try again. Fail again. Fail better.”

Security intelligence analysts should learn from previous work, instead of simply trusting in their own domain expertise and innate intelligence. Dr. Kent led the way, and even we non-spooks can still learn from his work.

## 3 reasons why big data matters for SIEM

“Big data” isn’t just a buzzword, and it doesn’t just mean “big piles o’ bits”. It’s jargon, but it has a particular meaning:

Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.

Alternately, “big data” refers to data of such volume that storage, management, processing, and analysis present engineering challenges beyond traditional IT solutions. If it fits in, say, a traditional RDBMS setup like MySQL or Oracle, then it may be a lot of data, but it’s not “big data”.

This new tech has lots of useful applications in social policy, business intelligence, science, and IT, among others. In the SIEM world, we’ve got to start looking at applying some of this tech where it makes sense, for at least a few specific reasons:

1. Traditional SQL databases don’t fit the data model. We don’t necessarily care in most SIEM implementations about meeting the ACID standard. Shoehorning our needs into what exists holds us back.
2. Big data tech (specifically, NoSQL database design) allows us to focus on the area of CAP that really matters to us: Partition Tolerance. Of the remaining two, we can usually settle for Availability and eventual Consistency.
3. IT organizations consistently experience significant budget pressure as organizations focus on reducing expenses. This applies even more to security, where we provide loss avoidance rather than growing top line revenues. We need architecture that allows us to use cheaper, commodity hardware while still enabling us to maintain appropriate performance.

We haven’t reached the point yet where we need to focus too strongly on particular aspects of “big data”. Do you need Hadoop? What analysis tasks fit map-reduce algorithms? Should you try to leverage Amazon EC2 or another cloud provider? As Jon Oltsik writes:

While “big data” will intersect with security intelligence, the actual “big data” technology aspects are irrelevant. CISOs need the analytics capabilities but really don’t care what’s under the hood. Let’s focus on data analysis and situational awareness and avoid a debate about OLAP, Massively-Parallel Processing (MPP), and Hadoop.

Those things will matter when building an implementation (e.g. to a vendor). SIEM users, though, should generally focus on what capabilities they actually want, such as data sources and analysis methods.

Oltsik’s piece makes another cogent point about security intelligence:

Security intelligence demands more data. Early SIEMs collected event and log data then steadily added other data sources like NetFlow, packet capture, Database Activity Monitoring (DAM), Identity and Access Management (IAM), etc. Large enterprises now regularly collect gigabytes or even terabytes of data for security intelligence, investigations, and forensics. Many existing tools can’t meet these scalability needs.

Users will see this as the real driving force: to do the job effectively, the SIEM has to do more than just bring in firewall, IDS, and operating system logs. And it needs to support better exploratory data analysis, rather than just reporting and notifications.

I don’t know of many vendors that currently have products built on this approach, though I don’t doubt we’ll see a lot of them hurriedly slapping the label on their material even when it doesn’t fit: witness the APT debacle.

## Scope expansion for data science

I’ve discussed my interest in data science and big data quite a bit on Twitter. This partly has to do with my contention that good SIEM and log analysis work should overlap significantly with data science, among other fields. It also has to do with my ongoing search for fulfillment in finding ways to work on stuff that matters (i.e. not pure infosec).

So then today I just asked the question straight out:

I got a bit of feedback from some of my usual Twitter crowd, encouraging me to simply grow the scope of this site. I have two concerns: one, will the (relatively small) existing reader base get frustrated with posts that have, at best, a tangential relationship to security? Two, will any new readers pigeonhole the blog – or me – as an information security blog, passing over the data content?

The sorts of things I intend to start including, whether here or elsewhere, include technical discussion of data analysis, walkthroughs of techniques as I’m exploring them myself, and applications in other fields. As an example, right now I have some processes running to analyze refugee trends based on data provided by the United Nations High Commissioner for Refugees.

Any thoughts, suggestions, or other pointers?

## Data analysis will change the world

One of my favorite infosec thinkers, Andrew Hay, had a pair of recent posts that have given me lots to chew on.

This provoked a wide-ranging conversation about what that means. We’ll find tremendous value in applying big data techniques to security data. (Actually, I think data analysis will change the world, but that’s a bit larger scope than this post can comfortable handle.) We can then start to bring in additional data feeds past what traditional SIEMs handle. Think along the lines of more OSINT, network flows, and possibly even business data. At that point, you can really start to grasp the qualitative and quantitative improvements to data protection.

The next day, he wrote an article in which he asked an oft-heard data analysis question: Where’s my ‘Minority Report’ dashboard?. We have to unpack that a little, though, because the data analysis scenes involved a few different useful things.

First, and perhaps most memorably, Cruise’s character used a gesture-based interface to work with the data he had available. As Hay notes, this tech has started to push down into consumer electronics like game consoles, but not generally into business applications like SIEM. While this might seem natural, we will have to move beyond the standard desktop metaphor and start to think of data as objects. It certainly won’t happen completely intuitively, but the long existence of similar ideas in various cultures (think mudras and sign language) and scientific research into the connection between words and gestures seems to indicate that we still have a lot of potential here.

Second, note how many disparate data feeds he had available. Apart from the fictional visualizations from the “precogs” (for which we can use surveillance video as a stand-in), he had social profiles, financial records, and more. While most of the entities we need to visualize aren’t always so human, we can assume some of the analogues I mentioned above for deploying “big data” tech. Data mining and machine learning will help here, particularly in knowledge discovery to hypothesize and test for correlations among the various data.

Third, the system latency seemed absurdly low. Try running a DB query on unstructured, near-realtime data, and tell me if it happens that immediately. While we’ve seen significant leaps in these areas, we need lots more advancement. Much of the tech today has started to move back towards a batch processing model rather than direct interaction and exploration, for example. Don’t think of this as just an engineering problem, because latency greatly matters when talking about trying to analyze data at anything remotely resembling the speed of thought.

Finally, the analyst clearly had excellent spatial reasoning skills. As younger generations continue to move into adulthood, we’ll likely see more applications of spatial reasoning. This means more research into data dimensionality: human brains don’t really visualize high-dimensional spaces very well, so we need to improve our models and analysts. It might turn out, for example, that we need to conceive of data as a hypercube as we drill down into specific nodes. Analysts already need to understand the foundations of graph theory when working in a lot of knowledge domains.

The future of data analysis excites me, and I really geek out over the possibilities. This has fractal-type potential: no matter whether we’re looking at data science from the MBA-typical “thirty-thousand foot view” or ångström altitude, we can find ways to change the world. (And if you’re working on this stuff and want some cross-domain thinking, let’s talk.)

## Two Things: SIEM and DFIR edition

Thanks to Hacker News, I ran across the charming and thought-provoking concept of Two Things:

“You know, the Two Things. For every subject, there are really only two things you really need to know. Everything else is the application of those two things, or just not important.”

You also might think of these things as first principles, though these might represent something even more basic. After spending some time thinking about it, I came up with the following. Feel free to add your own or point out what I’ve missed.

Two things for DFIR:

1. The bad guys always leave evidence behind.
2. You aren’t looking for it in time.

Two things for SIEM:

1. Log analysis matters more than log management.
2. SIEM analysts eventually become DBAs. (Bejtlich‘s Principle)

I don’t know whether anybody else has called it that before, but I sure wish I could find the canonical reference for Bejtlich’s Principle.

## Adapting intelligence analysis for DFIR

We can define an analyst as a function taking data and caffeine as inputs that outputs (hopefully useful) knowledge:

$analyst(data,caffeine) \to knowledge$

But analysts need more than just good data and properly brewed coffee (or tea, if that’s your thing). We need well-written “internal code”: our thought processes, if you will. As I’ve previously mentioned, too much material focuses on the data and not enough on the processing. If you look for information on log management, you can find endless advice on how to collect your logs, and how to store them. If you look for information on SIEM systems, you can find lots of vendor “marketecture”, compliance guidance, and so forth – but not enough guidance on what to do with the information you find there.

To find what we really need, two things have to happen. First, we need to look outside the IT security echo chamber. Simply repeating the same endless mantras won’t advance the state of the art at all, but looking at other fields with related problems and finding ways to cross-pollinate certainly can bear fruit. In my view, the intelligence community has spent decades working through similar issues. Some really useful references I’ve found lately include Psychology of Intelligence Analysis (which largely discusses “Tools for Thinking” and “Cognitive Biases”). But another document, Basic Counterintelligence Analysis in a Nutshell, has much better applicability to DFIR. Some things work directly, like the section on “Analytic Traps and Mindsets”, others have simply gone out of date, and other concepts have useful analogues. For example, map analysis usually doesn’t reveal very much if invoked in a geographic context (since network links and physical proximity don’t correlate very well), but when you overlay your data on a network map, it certainly can.

So in February, I intend to take the “Basic Counterintelligence Analysis in a Nutshell” document and adapt the ideas in it to network security investigations in particular. But to do this justice takes more than a simple post, so instead of posting that here as originally intended, I’ll spend some time on it and get feedback when it’s ready. This post mostly serves the purpose of getting it out there so that my colleagues, friends, and readers can hold me accountable next month.