Automatically generating wordlist mangling rules, part 3

The second part is the most complicated : it parses the output of the previous program, consolidates it, and solves the coverage problem.


This program starts with a line looking like that :

/i si1 /E sE3 Q    daniel13    dani        el13    11475

It means that with rule “/i si1 /E sE3 Q”, the string “el13” was generated by John, and was the longest found in the password “daniel13”. It was originally “eliE” or something like that in the source dictionary. The corresponding prefix is “dani”, and the suffix is “”. “daniel13″ was found at the line 11475 of the password file.

The program parses this, creates a rule ( /i si1 /E sE3 Q A0″dani” ), and adds password #11475 to the rule coverage.

Coverage problem

Once all the lines are parsed, it has a data structure with all the rules, and each rule has another data structure with all the passwords found. The program then proceeds like this :

  1. drop all rules where the coverage is less than a certain quantity of passwords
  2. find the rule with the largest coverage and print it (the max rule)
  3. remove all passwords in the max rule from all the other rules
  4. start again except if there are not rules to work on

My first approach was to use an AVL tree to represent all rules, and an AVL tree in each rule to represent coverage. A bit field could not be used because of the huge amount of memory it would require. It was slow as hell, so I improved the following items.

Streamlined process

The previous four steps can be done in four distinct loops. They can also be done in one. Not much to add on this …

Password removal optimization

At first, I iterated through the passwords in the max rule, and checked their presence in the other rules. It was of course stupid as all other rules are guaranteed to have less passwords in them than the max rule, so it was much more efficient the other way around. A bloom filter also helped reducing the quantity of searches in the large rule tree.

AVL to linked list

It was useless to keep the rules in an tree once the parsing was done, so I converted them into a linked list after the first pruning. Once that was done, the coverage problem would be solved in mere seconds.

Faster parsing

With an increase in test rules, it turned out that the loading and parsing part took the most time. An obvious solution was to multithread the loading part, but it also became obvious that this was really inefficient. Indeed, every time I added a rule to test, I could compute the string matching (see part 2) once, but then I had to reload all the rules to get the final result.

Makefile to the rescue

I now work with a makefile, which gives me free parallelization. There is a new step involved : the parsing is done by a “slimming” process that reads a single file, consolidates its content just as previously, then prunes it and outputs a parsed version in binary format. That way, adding a rule leads to much shorted analysis time. Indeed, the final steps now only reads a couple hundreds of megabytes of binary data instead of parsing 30 gigabytes of ascii data.

In the next episode I will post my results.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s