Complexity and Test-first 3

The story came from here.

At XPDay last year I ran a session exploring some of the ideas about complexity and TDD that I've discussed here before. Attendees at the session got a very early copy of a tool, "measure", which calculates some of the statistics I'm intersted in. From that came some good feedback which I've incorporated into a new version of the tool, avilable here. Note to XPDay attendees: this is a much more informative bit of software than the one you have, I suggest you upgrade.

The session will be repeated at Spa this year.

New Tool Features

Measure now shows the parameters for both a Zipf and Pareto distribution modelling how the complexity is found in the codebase. It also shows the R-squared value for the regression through the data points. This was a particular request from the XPDay session and, with due care and attention, can be used to get some idea of how good a fit the distributions are.

Measure also now accepts a -h flag and will display some information about how to interpret the results.

New Results

Using this updated tool, I examined some more codebases (and reëxamined some from before). This table table shows the Zipf distribution paramters. It's getting to be quite difficult to find Java code these days that doens't come with tests, so some of the codebases with an N shown there are ones from SourceForge that haven't had a commit made for several years.
CodebaseSlopeInterceptR^2Automatedunit tests?
Jasml 0.10.953.520.73N
Smallsql 0.161.545.870.80Y
m-e-c scehdule α3-101.694.940.92N
Xcool 0.11.934.600.84N
MarsProject 2.792.337.90.96Y
Log4j 1.2.14 2.437.340.96Y

So, the lower slope and higher incercept of JRuby was a surprise. But note also that the R-squared is quite low. This causes me to refine my thinking about these (candidate) metrics. A low R-squared means that the linear regression through the (log-log) points describing the complexity distribution is not good. This suggests that the actual distribution then is not modelled well by a Zipf type of relation. Maybe there's something about language implementations that produces this? It's worth checking against, say, Jython.

Meanwhile, those codebases with a high R-squared (say, 0.9 or above) seem to confirm my hypothesis that (now I must say "in the case of codebases with very strongly Zipf distributions of complexity) having tests produces a steeper slope to the (log-log) linear regression throught the distribution. And it looks as if a slope of 2.0 is the breakpoint. Still lots more to do though.

Further Investigation

I'd be particularly interested from anyone who finds a codebase with an R-squared of 0.9 or more and tests, but a slope less than 2.0

The discovery of codeases whos complexity distribution is not well modelled by Zipf is not a surprise. The fact that such codebases do not match the tests => steeper slope hypothesis is very interesting and suggests strongly that the next feature ot add to measure is a chart of the actual complexity distribution to eyball.

Note to publishers of code

This sort of research would be impossible without your kind generosity in publishing your code, and I thank your for that. If your code appears here, and has been given a low intercept, this should not be interpreted as indicating that your code is of low quality, merely that it does not exibit particularly strongly the property that I'm investigating.

The Story continues here.


Nic McPhee said...

Cool tool - thanks for sharing! (I came here from a recent post by Uncle Bob.)

I ran this on a code base we've developed over about a decade to support our work in evolutionary computation and got:

Zipf Distribution Parameters
* slope magnitude: 2.42
* intercept: 5.98
* R-squared: 0.85
Pareto Distribution Parameters
* slope magnitude: 2.09
* intercept: 6.58
* R-squared: 0.93

The weaker R-squared on the Zipf appears to be because of a higher than expected number of methods with very high CC (> 70), so the line drifts up some on the right.

This code has gone through several generations and major re-writes, and started before we'd ever heard of unit testing (or agile). Some of the subsequent re-writes and extensions have been TDD, other sections have been tested after the fact, and others (especially highly stochastic sections) frankly aren't well tested :-(.

I don't know to what degree these numbers are interesting from your perspective, but I certainly found it interesting to run measure on our code. Thanks for the info.

keithb said...

Hi Nick,
Thanks for your interest, and for your numbers. What I've seen before is that parts of a codebase that have been developed in different ways can have very different numbers, you might find it illuminating to dig around a bit.