Tag Archives: GitHub

Maltrieve: retrieving malware for research

ThreadsAs I continued to hack on mwcrawler over the last month, I found that it didn’t really meet my needs for various reasons: slowness, difficulty of maintaining and adding sources, repeated grabbing of the same URL, and lack of response from the original author. So I’ve rewritten it and released Maltrieve, which (as the name indicates) retrieves malware directly from the sources listed at a number of sites. Improvements listed in the README include:

  • Proxy support
  • Multithreading for improved performance
  • Logging of source URLs
  • Multiple user agent support
  • Better error handling

Right now, Maltrieve only looks at four meta-sources because two of the six in mwcrawler appear offline. But I have at least four more on deck, and mwcrawler didn’t parse all of its meta-sources correctly in any case. I also know of a few bugs that I haven’t figured out how to squash yet, but the core functionality works and it needs a broader audience to bang on it. Thus, I’ve tagged this version “beta-1″. Don’t rely on this for serious production, please.

If you use it, please let me know just so I can bask in the warm glow of productivity. The project itself remains under the GPL, of course. Suggestions, bug reports, etc. also would make me happy, whether via issues and pull requests on Github, contacting me on Twitter, or comments here.

Konig: malware, graph theory, and fuzzy hashes

As a small personal research and learning project, I spent a few hours this weekend writing Konig. This is intended to evolve into a framework for investigating relationships between fuzzy hashes (e.g. a corpus of malware gathered with mwcrawler) using graph-theoretical methods. Underneath, it basically just marries NetworkX and ssdeep.

At the moment, the code is fairly barebones: create the hash library based on files in a particular directory, then construct a graph of the relationships between those files where the similarity exceeds a user-specified threshold. Also, please keep in mind that my Twitter bio for a while just said “I write bad code”, and for good reason: I do. The GUI purely consists of a matplotlib window and needs a lot of work. (I have less experience with interfaces than almost anything, so keep your expectations even lower). I’ve added some very basic information on the properties of the graph (order, density, etc.), as well as the ability to select the connected component that includes a node (file) of interest.

Example output:

kmaxwell@gauss:~/src/konig$ python konig.py -d ~/data/mwcrawler/unsorted/PE32 -t 90 -i PE32.json
Loading saved hash database
Calculating fuzzy hashes for all files in /home/kmaxwell/data/mwcrawler/unsorted/PE32...
Creating graph structure for files with similarity >= 90...
Name:
Type: Graph
Number of nodes: 2932
Number of edges: 265625
Average degree: 181.1903
Graph density: 0.0618185990375
Preparing plot of graph structure...

Konig screenshot

The goals here include refreshing my knowledge of graph theory, as the last time I seriously studied this stuff, I think the OJ Simpson verdict hadn’t come back. Also, this code will help pave the way for some related work I have slated to use mwcrawler and vxcage together. In fact, I really think of Konig as a proof-of-concept implementation to throw away before doing something more useful and robust.

Sapho: threat intelligence tool

Dunecat

We all need Sapho juice sometimes.

Poking around GitHub one night for interesting projects, I ran across Sapho. I dug into it more and found that my fellow tweep Scott Roberts had written it, which only heightened my interest.

Sapho was built as an off hours project to manage intelligence developed from computer network defense activities and third party sources. Building up on the considerable resources of DokuWiki Sapho automatically generates a framework of wiki resources for capturing and analyzing cyber threat intelligence and responding.

Sapho as I understand it consists solely of a template generator for DokuWiki to help you track intrusion campaigns, adversaries & groups, and targeted malware. Unlike Collective Intelligence Framework and other tools, Sapho primarily exists as a way for humans to review the intelligence rather than other systems. If I tell another analyst that a given intrusion appears tied to group alpha, for example, then he can easily review what we know about them specifically. Of course, intelligence groups with even basic competence do this to some degree already, but Sapho allows you to create a common structure for these data.

Given the tool’s simplicity, then, we could extend it in a lot of useful ways. Scott outlines a few other potentially-related tools on the project site, like log2timeline, Cuckoo Sandbox, and Maltego / Casefile. For example, Sapho could automatically ingest the reports from these and reformat them into DokuWiki syntax. I can imagine an output plugin for CIF that does something similar for its data.

Essentially, Sapho could become a tool to transform analytical output into a common human-readable format. Taking it a step further, it could recognize certain indicator types like a hash, IP address, or similar, and automatically create wiki pages for them as a sort of correlation method.

The approach here does not really scale to large databases – but I don’t think it should. This sort of intelligence analysis works best when looking at the operational level rather than a very large scope like that of ThreatExpert. And since the tool uses the Simplified BSD license, you can take the idea and even the basic code and turn it into whatever works for you.

Benefits of git

I finally created an account on GitHub, partly for a work-related project and partly for DCPU-16 projects. I’ve quickly fallen in love with the site and with git in particular. While I’m still a git beginner, here are a few things I’ve been messing with.

  • Git Immersion: An excellent tutorial in core git usage. Slightly OS X focused, but only on the margins. Uses Ruby examples but you don’t actually need to know Ruby (or even use their exact text). Take a couple of hours and it will be worthwhile.
  • Using git for /etc: If you’re not already keeping /etc in version control of some sort, you should go back to Sysadmin School. This technique avoids lots of problems when tweaking a system’s configuration and needing to go back to a non-broken state. This won’t necessarily use all of git’s features: e.g. you might not need a remote repository, but then again it can provide a handy way to keep several systems configured more-or-less identically.
  • Using git for /home: Many people who work in Unix-derived environments keep their documents in text or text-like files, such as XML and similar. Even if you have a lot of binary files, git can usually handle them intelligently. I don’t add the files in my Downloads folder, for example, but version histories can help me find previous thoughts or (rarely) save a corrupted file.
  • Installing open source software: GitHub has lots of useful code on it, some of which may not have binaries available or may simply appreciate a helping hand. For example, selfspy “continuously monitors and stores what you are doing on your computer”, in a manner vaguely reminiscent of Stephen Wolfram’s personal analytics. Even if you don’t have the time or ability to dig deeply into a project, you can likely help with documentation or testing – things hackers often don’t enjoy.

Most importantly, now that a lot of my workflow revolves around git, I feel comfortable again. Working with the Unix philosophy fits my mental models better, and git enables that in many ways.