Left vs Right Recursion: Results

Previously we described a potential transformation that we believe will increase the amount of parallelism available at runtime, and use a constant number of stacks rather than a linear number of stacks with respect to the number of parallel conjunctions in a singly-recursive predicate. Read more to see the underwhelming results and conclusion.

First we shall see the effect this optimisation has when compiled for sequential execution, that is thread safety is not used, there is only one program thread, and no parallelism is exploited. Each program is executed 25 times and the mean and standard deviation wall-clocks times are shown. The program used here is the ray-tracer we've seen in other examples. In this case the render_rows predicate is recursive and makes a call to render_row and render_rows in it's recursive case. In the right-recursive (or normal) example the recursive call is the second call, tail recursion is not used as a all is made after this to concatenate the resulting pixel values into a cord, this final call is significantly cheaper than the other calls. In the left-recursive case the recursive call is made first.

Mean Standard Deviation
Right Recursion 125.00s 0.61s
Left Recursion 104.23s 0.34s

As we can see the left-recursive case is much faster than the right recursive case. This was unexpected, I thought that there would be a negligible difference between these alternatives during sequential execution. I cannot explain this, but I suspect it has something to do with cache performance.

The following table shows the results of running this in a parallel grade. --max-contexts-per-thread was set to 2 (the default) for this table. The standard deviation for each score is shown in parentheses.

One processor Two processors Four processors
Right Recursion 119.03 (1.65) 116.67 (1.36) 113.60 (1.00)
Left Recursion 105.12 (2.03) 104.95 (1.45) 104.31 (1.58)

While left recursion is an optimisation here also, but little benefit is gained due to parallelism, especially in the left recursive case, (adding extra processors does not significantly speed up the left recursive case. This is probably due to some of the known problems in the parallel runtime system, namely that --max-contexts-per-thread has to be set to a somewhat high value before the benefits of parallelism are seen as it limits the actual amount of parallel work that we can exploit at runtime.

Lets see what happens when we set --max-contexts-per-thread to 16.

One processor Two processors Four processors
Right Recursion 118.29 (1.71) 100.28 (0.69) 93.89 (0.60)
Left Recursion 104.29 (0.72) 104.38 (0.85) 104.59 (1.34)

Well, As we've seen before --max-contexts-per-thread makes it easier for the runtime to exploit the parallelism in a program. But that's obviously not true for left recursive code. This is not what I expected, naïvely I was expecting this to be a simple transformation that could be applied to all programs to produce more optimal execution times.

What is actually happening is that the runtime tracks how many engines are idle and how long the global spark queue is. Any new potentially-parallel work is allocated onto the global queue provided that there is at least one idle engine and we haven't exceeded the --max-contexts-per-thread limit in terms of either contexts or sparks. If these tests fail the work is allocated to a local queue that is worked on sequentially by the current engine. In the case of right-recursive code the parallel conjunction within a recursive call is more-likely to be executed on a different engine, since it's in the forked off computation. Therefore, any locally queued work created by recursive parallel conjunctions is likely to be added to the local queues of different contexts.

Alternatively in the left-recursive case, recursive parallel conjunctions are executed by the current engine. And since no synchronisation needs to be done between entering the recursive parallel conjunctions the parallel jobs are forked off very quickly. This almost certainly fills the run queue quickly and then allocates the remaining work to the local queue where it will execute sequentially.

The more general problem here, as identified by Peter Wang in the past, is that the decision whether to parallelise a computation is done prematurely, in this case well before the code will actually be executed. This decision makes the assumption that the runtime conditions observed at the time of the condition are the same as those observed before the conjunct might be executed.

It's well understood that work stealing queues are an appropriate solution to this.