Tag Archives: Big Data

Terrible Presentation Stress Disorder: “Finding APT with Big Data”

Yesterday, I attended the North Texas ISSA Meeting with my friends Michelle and Ryker. The talk carried the fascinating title Threatscape 2012: Finding Advanced Persistent Threats with ‘Big Data’ Analysis and Correlation, presented by A.N. Ananth, CEO of Prism Microsystems, and one of the leaders of Global DataGuard. (I think the actual presenter was not the person listed, though I might have missed that.) Among the lessons I “learned”:
Condescending Wonka wants to hear all about Big Data and the APT

  • Doing recon on a target is what makes the APT “advanced”, especially if you use LinkedIn to figure out more about a company.
  • “0 day” attacks are the Ebola and bird flu of information security.
  • 100 Gb (sic) of data is Big Data.
  • All the traffic to your web site is Big Data. But we’ve been dealing with Big Data like that since the 1970s, so we know how to handle it.
  • Log data from 25 servers is Big Data, because Big Data doesn’t really have anything to do with how much data you have. (This theme came up a lot.
  • Attackers only care about customer records, so clearly the talk focused heavily on real-world experience with the APT.
  • The Verizon DBIR1 mostly only covers North America.
  • Log analysis is advanced behavior correlation analytics.
  • You must do regression testing on your network behavior adaptive learning modeling.
  • Comparing one million points of data is big data but you can use metadata for higher level indicators.
  • Detecting Stuxnet would have been obvious because it’s a new process.
  • Databases suck for analyzing security data.

I swear I’m not winking at you. That’s an eye twitch from TPSD (Terrible Presentation Stress Disorder). Because lunchtime has arrived and I feel a little frisky now that I’ve had my Dr Pepper, let’s take just a few examples to demonstrate the already-obvious cluelessness of this presentation.

First, do a bit of basic research before you cite anything. The Verizon DBIR states in the Executive Summary on page 2:

We also welcome the Australian Federal Police (AFP), the Irish Reporting & Information Security Service (IRISS), and the Police Central eCrimes Unit (PCeU) of the London Metropolitan Police. These organizations have broadened the scope of the DBIR tremendously with regard to data breaches around the globe.

In addition, Verizon performs data breach investigations around the world, not just domestically. Page 12 shows the countries in which confirmed breaches occurred as part of the analyzed caseload. I don’t believe we published data showing the specific geographic distribution of cases, but the statement that the DBIR mostly covers only North America has no support.

Second, I’ve written a good bit here about the APT, and others have done so far more extensively. The APT – nation-state threat actors with significant cyber capabilities, usually meaning “China” or similar when used cluefully – doesn’t care so much about customer records. After all, if you own a significant chunk of the national debt of the United States, credit card numbers are small fry. When you start talking about research plans, sensitive business documents, source code, and the like, now you’ve started to address the target assets. That’s not to say that nobody cares about customer data, of course. Look at the tremendous amount of fraud coming from all over the world, largely but not exclusively centered in Russia and Eastern Europe, not to mention the “hacktivism” related breaches in 2011.

Third, while basic security measures would certainly prevent most common breaches, this doesn’t hold true for truly advanced attacks (by definition). If you think that any system that simply monitors for new processes would detect Stuxnet in an obvious manner, either you don’t really know much about enterprise monitoring or malware, or you are lying. I will charitably assume the former and recommend that you sit down and buy a beer for an actual incident responder or malware analyst to get the real story.

Finally, Big Data absolutely does have something to do with the volume of your data, though that’s not the only factor involved. The presenters correctly stated in the midst of their chaotic confusion that Big Data means data for which traditional RDBMS and similar systems just don’t work. That doesn’t mean 100 data points or even 100 gigabytes (again, charitably assuming a typo here). It means that you have so much data arriving so quickly and in such different forms (schemas) that you can’t simply stream it into a traditional database. This differs significantly from data science and analytics in which we try to find patterns and anomalies in the data, sometimes with advanced methods like machine learning and distributed computation. These two concepts aren’t identical, they’re orthogonal: you may perform analytics on smaller data sets, or you may have a very large data set that maps to well-understood models. The phrase “regression testing on your network behavior adaptive learning modeling” is gobbledygook.

I could go on, but really, the only other thing I want to say to the gentlemen who presented this useless waste of an hour is:
Zoidberg: Your presentation is bad and you should feel bad!


1: Disclosure: I work for the Verizon RISK team that produces the DBIR, though I joined after the publication of the 2012 edition and had no hand in it.

If you feel big data is useless for security, you are doing something wrong

Richard Stiennon recently wrote a post titled If you feel you need big data for security, you are doing something wrong. It’s worth reading, and I agree with his recommendation of implementing threat intelligence data and techniques. But his key thesis lacks quite a bit of support, largely due to analyzing technologies in isolation and criticizing them for not providing a complete security package by themselves.

First, after reviewing the (very real) issues with managing intrusion detection systems (IDS), he states that he “declared IDS dead as a functioning security solution.” A lot of practitioners – including this one – would disagree strongly. While IDS can certainly generate far too many signatures and alerts, good management practice winnows this down fairly quickly. For example, alerting on every teardrop attack in 2012 is totally useless cruft. No MSSP or clueful analyst actually ever needs to investigate millions or billions of alerts a day. Instead, you apply good analysis techniques and find the ones worth your time. I’ll grant that the concept of network security monitoring, including full packet capture, provides far more value, but even then the security infrastructure should include a properly-managed IDS as one component.

He goes from there to a very closely connected idea: using a security information & event management system (SIEM) to manage and correlate these data with other logs and asset models (vulnerability data, for example). Stiennon criticizes this technology because “these solutions addressed the data overload issue but did little to address security. They failed to curtail the rise of targeted attacks that are now wreaking havoc upon businesses and critical infrastructure operators.” This argument has two fundamental problems. First, SIEMs provide analytic and detective capabilities. While an organization’s processes can take those and feed the data back into preventive controls (e.g. firewalls), their primary purpose is to understand what is happening in your environment. Second, SIEMs provide critical abilities to investigate the attacks he cites. Log analysis has not itself detected nearly enough attacks for a number of reasons (in my view, largely due to poorly-trained analysts). However, gathering logs from literally hundreds of different sources after a compromise has come to light via some other method (e.g. third-party notification) creates a huge roadblock in the investigation. If you assume that you are already compromised and will be again, then you need a SIEM to close the loop as quickly as possible.

Stiennon finishes up his criticisms by attacking the idea of Big Data. Applying these techniques in infosec represents an evolution in SIEM usage, as opposed to a revolution. If you already collect full packet captures, all system logs, event logs from every device in your network – including the IDS – then you’ve rapidly entered the world of Big Data:

Data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures.

We still have a lot of work to do in order to understand how to analyze this data set, but we improve significantly every day due to the hard work of a lot of very smart people applying themselves to the problem.

So when Stiennon recommends security intelligence, he’s right to do so. I completely agree that organizations should bring in threat data on attack sources and understand their adversaries in greater detail. But you need to do something with that intelligence. In large part, that means correlating it with your existing data to find the attacks in your environment so that you can contain, investigate, and eradicate the intrusion. Of course, you should never assume that threat intelligence is the final piece of the puzzle, either, because that will lead to the same problems as the ones Stiennon identifies with IDS, SIEM, and even Big Data tech.

Security intelligence matters, and I’m personally committed to expanding our abilities to gather, analyze, and react to these data. But it works in concert with more fundamental systems; it doesn’t replace them.

Twitter review: 2012-03-23

While I’m dissecting the Verizon DBIR and the Mandiant M-Trends report, plus preparing for my talk to the NAISG Dallas chapter next week (“Evolution of an IRT”), I thought I’d take a look at some relevant Twitter data.

Storification

First, I assembled a Storify to document a conversation on Twitter today related to those two reports. Take a look at DBIR and M-Trends: Different Perspectives. Credit to @bond_alexander for kicking it off.

Twitter dataviz

I also generated the below visualization using xefer. It shows activity by hour of the day, by day of the week, and the ratios of tweets / replies / retweets, all in US Central time (GMT-6 or -5 during DST).

Click to enlargeinate

A few things jumped out at me:

  • I usually go offline around 11pm and don’t get going again until 7 or 8am the next day. Typical sleep cycle.
  • Twitter activity declines during the noon (lunch) hour.
  • During the 5pm hour, I have very little to say. This represents the time I wrap up my daily work, drive home, and see my family.
  • Activity drops off during the weekend, when I spend time with the family or generally relax (e.g. gaming).
  • Thursday and Friday evenings slow down considerably compared to Monday through Wednesday evenings. I know why that happens on Friday (going out), but not Thursday.
  • Wow, I chat a lot. But if you follow me, you probably knew that already.

Blocking Trending Topics

Lots of us can’t stand to read the “trending topics” on Twitter. They usually revolve around celebrity “news” and other useless bits. If you have Adblock Plus for Chrome or Firefox, though, just add the following two lines to your filter list:

twitter.com##.trends-inner
twitter.com##.wide-trends

Other tweets this week

A few relevant Twitter postings:

Next time, I’m doing that in Storify.

Aside

I had the opportunity to participate briefly in a conversation on Twitter with @Beaker, @RaffaelMarty, and a few others. So I thought I’d experiment and put it all back together using Storify. Take a look at my recap of the … Continue reading

Kent doctrine for security intelligence analysis

I’ve said before that log management matters, but log analysis matters more. Extracting and communicating useful information (analysis) requires collecting and storing your security data as well as processing the data quickly. But having all the data available won’t matter to anybody except auditors if you don’t use it in ways that inform good decisions. Mike Rothman of Securosis expressed this exceptionally well in his preview of the upcoming RSA Conference:

You will see a bunch of vendors talking about their new alerting engines taking advantage of these cool new data management tactics, but at the end of the day, it’s not how something gets done – it’s still what gets done.
So a Hadoop-based backend is no more inherently helpful than that 10-year-old RDBMS-based SIEM you never got to work. You still have to know what to ask the data engine to get meaningful answers. Rather than being blinded by the shininess of the BigData backend focus on how to use the tool in practice. On how to set up the queries to alert on stuff that maybe you don’t know about.

To paraphrase Socrates, unexamined data are not worth collecting. So analysis methodology and critical thinking skills matter. Rothman is spot on with this: the value of big data tech comes when you need to grow past the capabilities that traditional SIEM and RDBMS provide. By way of analogy: if you don’t understand algebra, then don’t take a course in calculus until you have the basic prerequisites down. You’ll just frustrate yourself and waste your tuition dollars.

Sherman Kent

Provided by CIA

In this vein, then, I appreciated the pointer from the OSINT and analysis training firm Treadstone 71 to a CIA paper on the background and work of Sherman Kent, the “father of intelligence analysis”.

He promoted an analytic doctrine that boils down to nine key points, listed in the CIA paper above. That doctrine applies across domains, not just for the sorts military and geopolitical analysis we expect from government intelligence agencies. I highly recommend that everyone read at least that section of the paper, but here are some applications for those of us involved in security intelligence analysis, especially in the private sector.

  1. Focus on Policymaker Concerns: What keeps your management up at night? Hopefully security isn’t the only thing, of course. So assuming that your CxOs understand the general threat landscape, analysts need to ensure that they track relevant areas that can lead to useful changes and decisions at strategic and tactical levels.
  2. Avoidance of a Personal Policy Agenda: Many analysts focus on threats that concern them for reasons outside of their organization. Maybe they disagree with the politics of the Occupy movement and overemphasize threats to entirely unrelated organizations, or worry about APT China because of Sinophobia rather than a reasoned assessment of the situation. Or maybe they want to drive decision makers to a particular tech solution. Even worse, they may use their analyses as weapons for corporate political plays. Doing that represents a disservice to the organization and an unprofessional approach.
  3. Intellectual Rigor: This area stands as-is: “Estimative judgments are based on evaluated and or­ganized data, substantive expertise, and sound, open-­minded postu­lation of assumptions. Uncertainties and gaps in in­formation are made explicit and accounted for in making predictions.”
  4. Conscious Effort to Avoid Analytic Biases: None of us can completely avoid cognitive bias, but we can make sure we understand it and try to correct for it where possible. That principally means application of the scientific method. As previously noted, whether or not faith and dogma have a place in one’s personal life, they certainly do not in one’s professional analyses.
  5. Willingness to Consider Other Judgments: Fight for your ideas, but “playing devil’s advocate” should rest on a better intellectual basis than simply spreading FUD. Recognize that others may in fact know more than you do or have insights that can help you.
  6. Systematic Use of Outside Experts: In addition to seeking out and understanding the work of other analysts, don’t restrict yourself solely to your field or even industry. Work with a community and keep bringing in fresh concepts from other disciplines.
  7. Collective Responsibility for Judgment: Eventually, your team will produce a report. You may not have agreed with everything that went into it, but that’s the way the sausage gets made. Once that report goes to its audience, support it. Throwing the rest of your analysis team under the bus by telling the audience “I told them so” doesn’t actually make you look smarter. It makes you look unprofessional. That doesn’t mean that you should ignore all criticism; rather, it means that you should be willing to take lumps with the rest of the group. If someone asks you for your opinion, give it – but clarify that it doesn’t represent the considered opinion of the rest of the team.
  8. Effective communication of policy-support information and judgments: Analysts need three core skills: domain expertise, critical thinking skills, and communication ability. This includes targeting your analysis to the level appropriate to your audience. You must be able to summarize your findings in understandable and accurate ways. And you must be able to handle points of uncertainty properly.
  9. Candid Admission of Mistakes: You won’t always be right. Admit it, and review past work to see what you can learn for improvement the next time. “Try again. Fail again. Fail better.”

Security intelligence analysts should learn from previous work, instead of simply trusting in their own domain expertise and innate intelligence. Dr. Kent led the way, and even we non-spooks can still learn from his work.

3 reasons why big data matters for SIEM

"Nesting Dolls" by Andy Ihnatko“Big data” isn’t just a buzzword, and it doesn’t just mean “big piles o’ bits”. It’s jargon, but it has a particular meaning:

Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.

Alternately, “big data” refers to data of such volume that storage, management, processing, and analysis present engineering challenges beyond traditional IT solutions. If it fits in, say, a traditional RDBMS setup like MySQL or Oracle, then it may be a lot of data, but it’s not “big data”.

This new tech has lots of useful applications in social policy, business intelligence, science, and IT, among others. In the SIEM world, we’ve got to start looking at applying some of this tech where it makes sense, for at least a few specific reasons:

  1. Traditional SQL databases don’t fit the data model. We don’t necessarily care in most SIEM implementations about meeting the ACID standard. Shoehorning our needs into what exists holds us back.
  2. Big data tech (specifically, NoSQL database design) allows us to focus on the area of CAP that really matters to us: Partition Tolerance. Of the remaining two, we can usually settle for Availability and eventual Consistency.
  3. IT organizations consistently experience significant budget pressure as organizations focus on reducing expenses. This applies even more to security, where we provide loss avoidance rather than growing top line revenues. We need architecture that allows us to use cheaper, commodity hardware while still enabling us to maintain appropriate performance.

We haven’t reached the point yet where we need to focus too strongly on particular aspects of “big data”. Do you need Hadoop? What analysis tasks fit map-reduce algorithms? Should you try to leverage Amazon EC2 or another cloud provider? As Jon Oltsik writes:

While “big data” will intersect with security intelligence, the actual “big data” technology aspects are irrelevant. CISOs need the analytics capabilities but really don’t care what’s under the hood. Let’s focus on data analysis and situational awareness and avoid a debate about OLAP, Massively-Parallel Processing (MPP), and Hadoop.

Those things will matter when building an implementation (e.g. to a vendor). SIEM users, though, should generally focus on what capabilities they actually want, such as data sources and analysis methods.

Oltsik’s piece makes another cogent point about security intelligence:

Security intelligence demands more data. Early SIEMs collected event and log data then steadily added other data sources like NetFlow, packet capture, Database Activity Monitoring (DAM), Identity and Access Management (IAM), etc. Large enterprises now regularly collect gigabytes or even terabytes of data for security intelligence, investigations, and forensics. Many existing tools can’t meet these scalability needs.

Users will see this as the real driving force: to do the job effectively, the SIEM has to do more than just bring in firewall, IDS, and operating system logs. And it needs to support better exploratory data analysis, rather than just reporting and notifications.

I don’t know of many vendors that currently have products built on this approach, though I don’t doubt we’ll see a lot of them hurriedly slapping the label on their material even when it doesn’t fit: witness the APT debacle.

Scope expansion for data science

"Connecting to the Interweb Tubes" by Nick WheelerI’ve discussed my interest in data science and big data quite a bit on Twitter. This partly has to do with my contention that good SIEM and log analysis work should overlap significantly with data science, among other fields. It also has to do with my ongoing search for fulfillment in finding ways to work on stuff that matters (i.e. not pure infosec).

So then today I just asked the question straight out:

I got a bit of feedback from some of my usual Twitter crowd, encouraging me to simply grow the scope of this site. I have two concerns: one, will the (relatively small) existing reader base get frustrated with posts that have, at best, a tangential relationship to security? Two, will any new readers pigeonhole the blog – or me – as an information security blog, passing over the data content?

The sorts of things I intend to start including, whether here or elsewhere, include technical discussion of data analysis, walkthroughs of techniques as I’m exploring them myself, and applications in other fields. As an example, right now I have some processes running to analyze refugee trends based on data provided by the United Nations High Commissioner for Refugees.

Any thoughts, suggestions, or other pointers?

Data analysis will change the world

One of my favorite infosec thinkers, Andrew Hay, had a pair of recent posts that have given me lots to chew on.

First, he asked:

This provoked a wide-ranging conversation about what that means. We’ll find tremendous value in applying big data techniques to security data. (Actually, I think data analysis will change the world, but that’s a bit larger scope than this post can comfortable handle.) We can then start to bring in additional data feeds past what traditional SIEMs handle. Think along the lines of more OSINT, network flows, and possibly even business data. At that point, you can really start to grasp the qualitative and quantitative improvements to data protection.

The next day, he wrote an article in which he asked an oft-heard data analysis question: Where’s my ‘Minority Report’ dashboard?. We have to unpack that a little, though, because the data analysis scenes involved a few different useful things.

First, and perhaps most memorably, Cruise’s character used a gesture-based interface to work with the data he had available. As Hay notes, this tech has started to push down into consumer electronics like game consoles, but not generally into business applications like SIEM. While this might seem natural, we will have to move beyond the standard desktop metaphor and start to think of data as objects. It certainly won’t happen completely intuitively, but the long existence of similar ideas in various cultures (think mudras and sign language) and scientific research into the connection between words and gestures seems to indicate that we still have a lot of potential here.

Second, note how many disparate data feeds he had available. Apart from the fictional visualizations from the “precogs” (for which we can use surveillance video as a stand-in), he had social profiles, financial records, and more. While most of the entities we need to visualize aren’t always so human, we can assume some of the analogues I mentioned above for deploying “big data” tech. Data mining and machine learning will help here, particularly in knowledge discovery to hypothesize and test for correlations among the various data.

Third, the system latency seemed absurdly low. Try running a DB query on unstructured, near-realtime data, and tell me if it happens that immediately. While we’ve seen significant leaps in these areas, we need lots more advancement. Much of the tech today has started to move back towards a batch processing model rather than direct interaction and exploration, for example. Don’t think of this as just an engineering problem, because latency greatly matters when talking about trying to analyze data at anything remotely resembling the speed of thought.

Finally, the analyst clearly had excellent spatial reasoning skills. As younger generations continue to move into adulthood, we’ll likely see more applications of spatial reasoning. This means more research into data dimensionality: human brains don’t really visualize high-dimensional spaces very well, so we need to improve our models and analysts. It might turn out, for example, that we need to conceive of data as a hypercube as we drill down into specific nodes. Analysts already need to understand the foundations of graph theory when working in a lot of knowledge domains.

The future of data analysis excites me, and I really geek out over the possibilities. This has fractal-type potential: no matter whether we’re looking at data science from the MBA-typical “thirty-thousand foot view” or ångström altitude, we can find ways to change the world. (And if you’re working on this stuff and want some cross-domain thinking, let’s talk.)