The second part is the most complicated : it parses the output of the previous program, consolidates it, and solves the coverage problem.
This program starts with a line looking like that :
/i si1 /E sE3 Q daniel13 dani el13 11475
It means that with rule “/i si1 /E sE3 Q”, the string “el13” was generated by John, and was the longest found in the password “daniel13”. It was originally “eliE” or something like that in the source dictionary. The corresponding prefix is “dani”, and the suffix is “”. “daniel13″ was found at the line 11475 of the password file.
The program parses this, creates a rule ( /i si1 /E sE3 Q A0″dani” ), and adds password #11475 to the rule coverage.
Once all the lines are parsed, it has a data structure with all the rules, and each rule has another data structure with all the passwords found. The program then proceeds like this :
- drop all rules where the coverage is less than a certain quantity of passwords
- find the rule with the largest coverage and print it (the max rule)
- remove all passwords in the max rule from all the other rules
- start again except if there are not rules to work on
My first approach was to use an AVL tree to represent all rules, and an AVL tree in each rule to represent coverage. A bit field could not be used because of the huge amount of memory it would require. It was slow as hell, so I improved the following items.
The previous four steps can be done in four distinct loops. They can also be done in one. Not much to add on this …
Password removal optimization
At first, I iterated through the passwords in the max rule, and checked their presence in the other rules. It was of course stupid as all other rules are guaranteed to have less passwords in them than the max rule, so it was much more efficient the other way around. A bloom filter also helped reducing the quantity of searches in the large rule tree.
AVL to linked list
It was useless to keep the rules in an tree once the parsing was done, so I converted them into a linked list after the first pruning. Once that was done, the coverage problem would be solved in mere seconds.
With an increase in test rules, it turned out that the loading and parsing part took the most time. An obvious solution was to multithread the loading part, but it also became obvious that this was really inefficient. Indeed, every time I added a rule to test, I could compute the string matching (see part 2) once, but then I had to reload all the rules to get the final result.
Makefile to the rescue
I now work with a makefile, which gives me free parallelization. There is a new step involved : the parsing is done by a “slimming” process that reads a single file, consolidates its content just as previously, then prunes it and outputs a parsed version in binary format. That way, adding a rule leads to much shorted analysis time. Indeed, the final steps now only reads a couple hundreds of megabytes of binary data instead of parsing 30 gigabytes of ascii data.
In the next episode I will post my results.