Getting into the guts of mwcrawler

Earlier this week, my buddy Ken Pryor mentioned a project with which I had no prior familiarity:

So I went over and dug into mwcrawler. From the project README:

mwcrawler is a simple python script that parses malicious url lists from well-known websites (i.e. MDL, Malc0de) in order to automatically download the malicious code. It can be used to populate malware repositories or zoos.

It turns out that it really is pretty simple and hackish, which fits my needs perfectly. This is all a very experimental side project just to keep me amused during the (relatively) cold weather here in Texas.

Given how much I already love Github, I forked the project, then made a few improvements to allow for the use of a proxy (for OPSEC reasons) and to specify a dump directory from the command line. Requiring the user to modify source just to change config options works fine for alpha, but a little bit of polish goes a long way. I’ve also started implementing some logging to keep the metadata (like source URLs for each file). And yes, I’ve submitted pull requests, but neither mine nor the user agent randomization patch from Ben Jackson have gotten any response from the project owner. Hopefully that will change now that the holidays have finally run their course.

Now once I have all this data, I wanted to do something with it. Just for messing around, I went with the old standby of ssdeep to find relationships. That doesn’t mean it’s a final step at all; this weekend, I’ll run them through VirusTotal API, for example, to classify known samples by hash, and perhaps also incorporate something like pyew for clustered analysis to pull out interesting features. And it features integration with thug, which I’ve not started running yet. Some bugs still exist, like unhandled exceptions when the script can’t reach the page or dependence on the semi-deprecated Beautiful Soup 3.

But my current tiny little repository includes 227 MB in 344 PE32 executables (not counting other file types like archives and such). As an extremely simple preview, even basic fuzzy hashing as mentioned above creates some interesting clusters (graph generated with awk and Maltego):

mwcrawler-ssdeep

3 Responses to Getting into the guts of mwcrawler

  1. Nice post Kyle! I have included your forked version of mwcrawler in HoneyDrive 0.2. If you are interested in honeypots, malware, secviz, etc check it out at: http://bruteforce.gr/honeydrive and let me know if you have any suggestions and/or ideas. Regards, Ion.

  2. Pingback: Maltrieve: retrieving malware for research | Threat Thoughts

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s