It’s Clear We Need A Sorting Hat

A few of my earlier posts detailed some of the methods that I use to find malware. Those methods all essentially required you to actively hunt for malware. The exceptions being my articles on honeypots where we can “let the malware come to us”.

Recently, I’ve been going through a really good classic book on malware analysis entitled “Malware Analyst’s Cookbook and DVD by Michael Hale Ligh, Steven Adair, Blake Hartstein and Matthew Richard. The book has some really great tips, tricks and methods for finding, analyzing and understanding malware. Given that the book was published in 2011, it’s not surprising that some of the techniques and tools that are presented have been superseded. Nonetheless, it is still a valuable resource and highly recommended.

Chapter 2 in the book discusses using honeypots to collect malware. I’ll state upfront that this chapter is what initially piqued my interest and prompted me to begin setting up my own honeypots. The first page of that chapter mentioned but didn’t discuss a tool called mwcollect. I had heard of this tool from other sources as well so I was curious. I quickly discovered that mwcollect was a tool that went out to a number of public malware repositories and pulled down malware samples. I also learned that a number of its sources were no longer accessible and that mwcollect was no longer being maintained.

A little more searching led me to a tool called mwcrawler created by Ricardo Diaz which seemed very similar to mwcollect but maybe a bit more recent. In fact, mwcrawler seems to use the same list of malware sources as mwcollect. This old video from TekDefense showed some promising functionality (skip ahead to the 2:40 mark). Alas, mwcrawler looks like it hasn’t been updated in about 4 years.

I found another video by TekDefense which discusses Mastiff and Maltrieve (he starts talking about Maltrieve at the 4:20 mark). Maltrieve is a fork of mwcrawler by Kyle Maxwell which has similar functionality but does things more quickly. You can find Maltrieve here. This is where I settled. In a short time, Maltrieve had downloaded quite a few malware samples.

So, now between the samples that I found using the methods discussed in previous blogs, the   downloads from my honeypots and malware that I was able to grab with maltrieve, I was left with a LOT of malware – often just named with a long file hash (see image below). Put another way, my somewhat obsessive task of collecting malware had led to another problem – how do I classify and make sense of all of this?


This is a natural progression from our malware search and a perfect segue into chapter 3 of the Malware Analyst’s Cookbook and DVD which is entitled “Malware Classification”.

The main ideas put forth by the book to tame this problem are to use ClamAv and Yara to scan and identify which malware we actually have. While this did give me some better insight into what I had in my malware dump folder, it didn’t do any active classification – I was still left with a large malware dump folder. What I really wanted to do was create a malware zoo. I needed to put the malware in the appropriate cages.

I noticed a script in the maltrieve repository called maltrievecategorizer. It uses the output taken from the Linux file command to classify all of the files in a folder and move them into subfolders by type. You can run it against the folder containing your malware collection and it will sort the files for you. Again, here’s another video from TekDefense that shows the categorizor in action. A blog article on the script is available here. The original maltrievecategorizer can also be downloaded from Kyle Maxwell’s GitHub repository here.

I liked the way that this script cleaned up my malware folder and how things were neatly placed in subfolders indicating their file type. One thing that the classifier does that I did not like is that it adds more nested subfolders based on size below each file type folder. I get the reason for doing this but I found that I lost some ready visibility as a result. For instance, when I look at the PE32 folder, I would rather see the files that fit that classification instead of seeing 4 subfolders named small, medium, large and xlarge. It was pretty easy to take that part out of the script so that’s what I did. If you’re interested, you can get my modified script here.

Overall, I’m happier with the way things are stored now. I have several methods for downloading a lot of malware and I have a fast way to subdivide it all into separate folders by file type. I can run a ClamAV or Yara scan against any (or all) of these folders to see which malware is detected among my collection. This is workable. Probably the only thing I would prefer is if I had a way to classify AND store each sample according to detected malware type. That’s a project that I’m sure I’ll address in the future.

I would be remiss if I didn’t mention theZoo which can be downloaded here. This project is really close to what I want. It’s basically a malware framework.


Here’s a short video that highlights theZoo’s abilities.

The initial installation does download all of the malware samples to your hard drive – which I want. The only thing it doesn’t do (as far as I could tell) is let you add your own samples to the repository.

Honestly, for most “normal” people (those who aren’t obsessively downloading malware like I am), theZoo is probably a very good answer.

Well, now that we’re knee-deep in malware, we’ve got work to do!

Leave a Reply

Your email address will not be published. Required fields are marked *