Local spark scheduling is bad

  • warning: pg_query() [function.pg-query]: Query failed: ERROR: column u.signature_format does not exist LINE 1: ...e, u.uid, u.name AS registered_name, u.signature, u.signatur... ^ in /usr/share/drupal6/includes/database.pgsql.inc on line 138.
  • user warning: query: SELECT c.cid as cid, c.pid, c.nid, c.subject, c.comment, c.format, c.timestamp, c.name, c.mail, c.homepage, u.uid, u.name AS registered_name, u.signature, u.signature_format, u.picture, u.data, c.thread, c.status FROM comments c INNER JOIN users u ON c.uid = u.uid WHERE c.nid = 2 AND c.status = 0 ORDER BY c.thread DESC LIMIT 50 OFFSET 0 in /usr/share/drupal6/modules/comment/comment.module on line 992.

I've been trying to improve the performance of parallel programs because by default it appears to be terrible...

In particular I've been using a modified version of the icfp_2000 ray tracer. It has been modified to be trivially parallelisable. And it is therefore reasonable to expect decent performance when parallelising it. It has been modified to render a row at a time, and within each row render a pixel at a time. A render_rows predicate has two independent calls in a conjunction, render_row and render_rows (recursive). These calls are independent and therefore a call must exist to merge their results (we use concatenation of cords), therefore render_rows is not tail-recursive. By making this conjunction a parallel conjunction we can easily parallelise this program. This code can be found at progs/icfp2000_par_pbone within the benchmarks CVS module.

When running this program with MERCURY_OPTIONS="-P4" in a parallel grade it performs marginally better than a sequential version. Although we can show that such a small improvement can come from using a parallel-mark phase in the garbage collector (which is enabled in all parallel grades). The performance continues to improve as we increase --max-contexts-per-thread which allows for more parallelism by scheduling more computations on the global spark queue.

Graph showing wall-time for icfp2000 with varing values for --max-contexts-per-spark

The above graph shows boxplots of the wall time (from 10 samples) of the icfp_2000 ray-tracer as we vary the value of --max-contexts-per-thread. The first boxplot shows the execution of the same program compiled for sequential execution. The other plots double the number of --max-contexts-per-spark starting from the default of two.

  mean standard deviation
main_asmfast-gc 85.23 0.39
main_asmfast-gc-par_p4_c2 76.76 0.19
main_asmfast-gc-par_p4_c4 73.74 0.32
main_asmfast-gc-par_p4_c8 70.08 1.04
main_asmfast-gc-par_p4_c16 66.41 0.51
main_asmfast-gc-par_p4_c32 63.35 1.20
AttachmentSize
icfp2000_max-contexts-per-thread.png6.73 KB