Sunday, 22 November 2015

Revisiting the Cone

In a twitter discussion with folk on the #NoEstimates thread again (I don't know why I go back there) Henrik Ebbeskog (@henebb) stated there is no reason to fix time. Absolutely, and as mentioned to others though, fixing time means the other dimensions of software delivery [are allowed to, or must] vary. This isn't a surprise to most of us in the agile space, nor is it a surprise to management scientists either.

As part of the discussion, Henrik mentioned that you could fix time to 1 second and it reminded me of a discussion I once had with an old Senior PM (at the time), Duncan McCreadie. It centred around agile development some 5 years ago and in the discussion, he stated that if he wanted to monitor an realign the delivery, he'd look to bring delivery rates down to every day or every 4 hours. He was spot on with this and I agree.

The reason is in part due to the Cone of Uncertainty I keep banging on about mathematically, and even get heckled at, but that doesn't change the math, nor does it change the empirical numbers, which also back it up.

Why Delivery Rate Matters

If you deliver software continuously, each delivery gives you knowledge about your process that you didn't have before. What has happened, has happened, you can't change that, but you can both learn from it and consider it's variance zero (it's happened - there is no variance, it's a certainty) and if you are, then make things happen faster.

This is like I illustrate in:

http://goadingtheitgeek.blogspot.co.uk/2014/10/cone-head.html

In essence:
You've delivered a coin flip. It's now become known and in a fixed number of delivery cycles, this changes the expected outcome for all coin flips and it's potential overall variance.
Each flip of a coin is a delivery. Now, substitute the words 'coin flip' with 'story' and 'all coin flips' to 'major release' and reread the above.

As you can see from the graphs of actual project data shown at:

http://goadingtheitgeek.blogspot.co.uk/2015/04/monitoring-value.html

This applies across the board. There isn't a real delivery process in the entire world which doesn't follow this rule. The only possible case, is if the process extends to infinity, since this just pushes out the variance to perpetuity, which follows the last [sketched] graph at:

http://goadingtheitgeek.blogspot.co.uk/2015/04/lean-agile-metrics-like-it-or-not-stats.html

However, you'll note that nothing exists to infinite time. Even the Universe isn't considered to be able to exist at infinite time and I'd argue that your project budget would run out before then. Using faster monitoring approaches, as well as employing lean architectures, you will make the most of that budget as well as making it easier to either align or find a new direction.



E

Sunday, 15 November 2015

Reading the Maximum of an Array

There's an interesting question on Quora doing the rounds at the moment. It's entitled:

"What's the fastest algorithm to find the largest number in an unsorted array?"

Traditional O(N)

To repeatedly find the largest value of any array supplied to a function is O(N). You have to look at each element once, since you're checking if it is a new maximum. However can we do better?

I postulated this left of field idea in response to a comment from the original poster, which stated their lecturer asked them "what is it that an array automatically has which may be used to improve on this?"


"How about setting the array 'Base' address (pointer to the address of the first item)  to the largest number? After all, all you want is the one dereferenced pointer value. Not to sort or search the array. So you're then effectively always reading the value at address array[0]. Therefore O(1) each time"

In principle, this is the same as other answers appearing which propose storing the index of the maximum location as you build the array.

The Building

Building an array somewhere is at least an O(1) operation. This lower bound is defined using the Omega asymptotic notation. So

\[\Omega(1)\]


The Reading

Here is where it gets interesting. Normally, to find the maximum value of an array by iterating through each element, we'd get a $O(n)$ algorithm as we iterate through each item and compare it to our maximum. i.e. the algorithm (in C#):

        public static int Maxi(ArrayList array)
        {
            var maxValue = 0;
            for(var i = 0; i < array.Count; i++)
            {
                if ((int)array[i] > maxValue)
                    maxValue = (int)array[i];
            }

            return maxValue;
        }

Is an $O(n)$ algorithm.

Efficiency Step

My proposal was essentially to create the efficiency, by the combination of the two operations into one. So you would build the array but keep hold of the maximum value as you build it. You can encapsulate it in a class, we'll call it MyMaximus, just for fun and it is essentially:

    public class MyMaximusArray
    {
        public int Maximum { get; private set; }

        private MyMaximusArray()
        {
        }

        public static MyMaximusArray CreateFrom(int arrayLength)
        {
            var result = new MyMaximusArray();
            var random = new Random();
            var array = new ArrayList(arrayLength);
            
            for (var i = 0; i < arrayLength; i++)
            {
                var value = random.Next(int.MaxValue);
                array.Add(value);

                if (value > result.Maximum)
                    result.Maximum = value;
            }

            return result;
        }
    }


Now, just reading the Maximum property gives you the maximum value. You can alternatively substitute the value of Maximum to be the index in the static function and you have what I was suggesting.

The expectation is you'd substitute a reduction in array build time on building for the read time when locating the maximum item in the array. This is particularly suitable where the data doesn't change.

How Much Quicker?

So, the proof of the pudding is in the eating. Hence, I wrapped the two of these this up in a little console app with a tick-timer which read the ticks at the start and end and output the result, including for the sub-phases of building the array and reads of the maximum. For those people unfamiliar with ticks, they are a measure of 1/10,000 of a millisecond (1/10,000,000) which is sufficient for most applications.

The tests were each run three times, over a logarithmic scale from 1 to 10,000,000 and an average taken across the three.

The algorithms were then modified to behave as if they would be written to once and the maximum read many times.

The results were pretty conclusive:

Build Time

The expectation would be the 'search optimised' algorithm (MyMaximus) would perform worse than the regular algorithm when first building the array and sure enough it did, though surprisingly, not as much I thought. Both algorithms on this stage of the process would be O(n), with only a difference in coefficient. The 'doubling' due to the introduction of the if comparison didn't quite occur, though I speculate this may be due to JIT optimisation on the .NET platform.




Maximum Value Search 

Here is where the MyMaximus version was expected to make the gains according to the maths. This also behaved as expected:


The blue line is following the axis because it genuinely was zero. Here are the actual data points:

Maximum item search times (ticks)

The reason it is zero is that I am running this on a 4 GHz system, with a 20x clock multiplier and 1866MHz ram. All in all, this means it can carry out a CPU cycle including memory access (the slowest part of this process) therefore one read instruction occurs every 0.000000000526 of a second, which if a tick is 0.0000001 of a second, will never register. Hence, this result.

Total Time 

The combination of build and run fulfils the full scenario. Here, we expected the MyMaximus read to achieve similar asymptotic performance on the single run, but perform substantially better on the continual searches, tending towards $\Omega(n)$ the more searches that happened.

single run comparison


Total Search Performance by Size of Array

So overall, the performance of MyMaximus versus a regular search resulted in a small and I'd argue insignificant (chi-squared another day) win in the single search case. What happens in the case of the array being queried for it's maximum multiple times? The expectation is that the average will start off about the same when building the array, but the performance of the queries will be much faster with MyMaximus.

To test this, I created the same setup, but this time asking for the maximum 1000 times per array size. The end results seemed to live up to that expectation:



So what exactly happens?

It's pretty straightforward. The lower bound of the MyMaximus algorithm is $\Omega(n)$ whilst the lower bound of the regular algorithm is  $O(n)$ and that's the best you can do. So MyMaximus tends to the lower bound over time, whilst the regular algorithm does not.



Conclusion

The context of a particular algorithm is as important to the asymptotic complexity as the individual optimisation itself. This is well a known idea in Electronic engineering, as logic circuitry, such as NAND gate representaitons of other circuitry, are often evaluated to remove unnecessary gates and save money. Here you're doing it to save time, which in this era of cloud computing, also happens to save you money and indeed, losses if customers are not put off.

In any case, the question on Quora had an academic slant. Just be aware there's more there than meets the eye :)