Automatically generating wordlist mangling rules, part 2

In the previous episode, I described the general approach. It is a bit challenging, namely :

  • The input dictionary can get really large. For example, I’m using right now the OpenWall “all” dictionary, which is 3721255 lines long, for a total of 40MB, but started with a 781MB one.
  • The password list is large, as it is the Rockyou list. To keep things simple, I chose to sort -u the list, and it is now 14338943 lines long.
  • The problem is large enough that I don’t believe any scripting language will cut it, so it is C.

I wrote two programs : one of them would find the longest matching word in each password, and output it with the corresponding prefix, suffix, and mangling rule. Then another program would load that data and solve the coverage problem. My performance constraints was that I only had 12GB of RAM available for the programs to run.

In order to match the words, first approach was to build some kind of tree with all the input words, and match them in the password list. The node structure looked like that :

#define FIRST_CHAR ' '
#define LAST_CHAR '~'

struct s_node {
    char character;
    struct s_node * next[CHARSETSIZE];
    char wordend;

Then, in order to match a word in a given password, it was required to keep N pointers (where N is the maximum length of the dictionary words you want to match), and follow the tree. At position i, pointer (i%N) would be reinitialized to the root of that tree. It turned out this was quite fast, but also used up too much memory, as each node eats 776 bytes, which means I could only store 16.6M nodes in 12GB. The next array was replaced by an avl tree (I started using the libavl that ships with Ubuntu), and the memory savings were huge. The problem was that it became pretty slow, and still took up 4GB, meaning I could not run several processes at once. It was still enough to get my first results.

Finally, I settled for a rolling hash technique, keeping the mangled words in an avl tree in order to sift through the false positives. It is faster and only eats about a Gb of memory, which means I can run several processes at the same time.

In the next episode I will describe the software that solves the covering problem, and, finally, will post some results.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s