Results of an October 2006 discussion on performance in comp.lang.ada

Intended audience: all readers.


Matrix Multiply Comparison (GFORTRAN) from comp.lang.ada

In October 2006, a thread on a very specific performance question was started on comp.lang.ada which attempted to compare the performance of some simple matrix multiply code between GNAT and GFORTRAN.

The original poster was looking for the correct set of flags to get the GNAT performance to be similar to the GFOTRAN performance. The consensus of the group was that the two codes were not quite similar in that the Ada version was timing some of the final IO and that there was timing of the random number generator being done. While timing the random number libraries might be interesting, the real "meat" was an attempt to compare the performance on the matrix multiply code itself.

The initial version of the modified consensus codes follow. Note that in this case I am asserting what the consensus of the group was about what should be timed and not (perhaps) a consensus of the "best" implementation in either language.

Initial (Consensus version) Ada

with Ada.Numerics.Float_Random;      use Ada.Numerics.Float_Random;
with Ada.Command_Line;               use Ada.Command_Line;
with Ada.Text_IO;                    use Ada.Text_IO;
with Ada.Calendar;                   use Ada.Calendar;

procedure tst_array is
   package F_IO is new Ada.Text_IO.Float_IO(Float);
   package D_IO is new Ada.Text_Io.Fixed_Io(Duration);
   type Real_Matrix is array(Integer range <>,Integer range <>) of
Float;
   type Matrix_Access is access Real_Matrix;

   N    : Positive;
   G    : Ada.Numerics.Float_Random.Generator;

   A,B,C    : Matrix_Access;
   Start, Finish : Ada.Calendar.Time;
   Sum : Float := 0.0;
begin
   N := Positive'Value (Argument (1));


   A := new Real_Matrix(1..N, 1..N);
   B := new Real_Matrix(1..N, 1..N);
   C := new Real_Matrix(1..N, 1..N);
   for I in A'range(1) loop
      for J in A'range(2) loop
         A(I,J) := Ada.Numerics.Float_Random.Random(G);
         B(I,J) := Ada.Numerics.Float_Random.Random(G);
      end loop;
   end loop;

   Start := Ada.Calendar.Clock;
   for I in A'range(1) loop
      for J in A'range(2) loop
         Sum := 0.0;
         for R in A'range(2) loop
            Sum := Sum + A(I,R)*B(R,J);
         end loop;
         C(I,J) := Sum;
      end loop;
   end loop;

   Finish := Ada.Calendar.Clock;

   F_IO.Put (C(1,1)); F_IO.Put (C(1,2)); New_Line;
   F_IO.Put (C(2,1)); F_IO.Put (C(2,2)); New_Line;

   Put ("Time: "); D_IO.Put(Finish-Start); New_Line;New_Line;

end tst_array;

Initial (Consensus version) FORTRAN 95

program test_array
   integer,parameter :: seed = 86456
   integer :: N
   integer :: numArgs
   real, dimension(:,:), allocatable :: a, b, c
   real, dimension(:,:), allocatable :: d, e, f
   real :: sum
   character*100 buffer
   real :: begin, finish
   integer :: I, J, R

   call getarg(1,buffer)
   read(buffer,*) N

   call srand(seed)


   allocate( a(N,N) )
   allocate( b(N,N) )
   allocate( c(N,N) )
   forall (I = 1:N, J = 1:N)
      a(i,j) = rand()
      b(i,j) = rand()
   end forall

   !c = matmul(a, b)

   begin = second()
   do I = 1,N
      do J = 1,N
         sum = 0.0
         do R = 1,N
            sum = sum + A(I,R)*B(R,J)
         end do
         C(I,J) = sum
      end do
   end do
   finish = second()
   print *, c(1,1), c(1,2), c(2,1), c(2,2)

   print *, 'Time: ', (finish-begin)

end program test_array

Initial Timing Results

All evaluations were done with an array size of 600 (as specified by the command line parameter). Runtimes represent the smallest of 3 trials with rounding when appropriate in the last digit. -gnatp (suppress checks) is used on all Ada runs.

EnvironmentCommon FlagsGNAT RuntimeGFORTRAN RuntimeGFORTRAN Performance Factor
Centos 4.4 Dual Xeon 2.8 Ghz gcc version 4.2.0 20061020-O2 -march=pentium4 -fomit-frame-pointer3.371.342.5
Centos 4.4 Dual Xeon 2.8 Ghz gcc version 4.2.0 20061020-O2 -march=pentium4 -fomit-frame-pointer -mfpmath=sse -msse33.321.342.48

What's Happening

While one could make arguments that there might be better ways to implement this, the approach that is used in both cases is essentially identical and not doing anything particularly exotic in either language. So, what is happening. Digging into the generated assembly language see that for some reason, GNAT is doing a particularly poor job on the inner loop.

Note in both cases below, I've omitted some of the loop setup and just shown the part that runs 'N' times.

Ada Inner Loop

a01:	83 c6 01             	add    $0x1,%esi
 a04:	89 f0                	mov    %esi,%eax
 a06:	2b 44 24 68          	sub    0x68(%esp),%eax
 a0a:	03 44 24 18          	add    0x18(%esp),%eax
 a0e:	8b 6c 24 20          	mov    0x20(%esp),%ebp
 a12:	f3 0f 10 4c 85 00    	movss  0x0(%ebp,%eax,4),%xmm1
 a18:	8b 17                	mov    (%edi),%edx
 a1a:	8b 47 0c             	mov    0xc(%edi),%eax
 a1d:	8b 5f 08             	mov    0x8(%edi),%ebx
 a20:	8b 4c 24 10          	mov    0x10(%esp),%ecx
 a24:	29 d9                	sub    %ebx,%ecx
 a26:	89 f5                	mov    %esi,%ebp
 a28:	29 d5                	sub    %edx,%ebp
 a2a:	89 ea                	mov    %ebp,%edx
 a2c:	83 c0 01             	add    $0x1,%eax
 a2f:	29 d8                	sub    %ebx,%eax
 a31:	01 c0                	add    %eax,%eax
 a33:	31 db                	xor    %ebx,%ebx
 a35:	01 c0                	add    %eax,%eax
 a37:	0f 48 c3             	cmovs  %ebx,%eax
 a3a:	0f af d0             	imul   %eax,%edx
 a3d:	8d 0c 8a             	lea    (%edx,%ecx,4),%ecx
 a40:	8b 44 24 24          	mov    0x24(%esp),%eax
 a44:	f3 0f 10 04 01       	movss  (%ecx,%eax,1),%xmm0
 a49:	f3 0f 59 c1          	mulss  %xmm1,%xmm0
 a4d:	f3 0f 58 d0          	addss  %xmm0,%xmm2
 a51:	3b 74 24 6c          	cmp    0x6c(%esp),%esi
 a55:	75 aa                	jne    a01 <_ada_tst_array+0x29a>

FORTRAN Inner Loop

 353:	8d 04 0b             	lea    (%ebx,%ecx,1),%eax
 356:	f3 0f 10 02          	movss  (%edx),%xmm0
 35a:	8b 6c 24 64          	mov    0x64(%esp),%ebp
 35e:	f3 0f 59 44 85 04    	mulss  0x4(%ebp,%eax,4),%xmm0
 364:	f3 0f 58 c8          	addss  %xmm0,%xmm1
 368:	83 c1 01             	add    $0x1,%ecx
 36b:	01 f2                	add    %esi,%edx
 36d:	39 f9                	cmp    %edi,%ecx
 36f:	75 e2                	jne    353 <MAIN__+0x353>

Both the GNAT version and the GFORTRAN version are (in this case) generating scalar use of the SSE unit to do single precision multiply/accumulate needed for the inner loop. Both are doing so in a relatively compact manner. Where GNAT is falling apart is in the calculation of the locations of the data from the arrays.

Initial Thoughts

While performance factors like this are substantially different than what I've become used to in a more global sense, the performance difference in this case is substantial and can't be ignored. A test case has been created from the original code along with a top level discussion of the resulting performance and trees. This has been submitted to the GCC bug database as issue 29543 Ada produces substantially slower code than FORTRAN for identical inputs - looping over double subscripted arrays.

Other Versions

After the bug was submitted, there were suggestions from some on bugzilla as well as alternate approaches posted on comp.lang.ada. In addition, at least one poster indicated that he was unable to produce the same poor performance with gcc 4.0.2. So I gathered a few different versions of the program and compiled them with two versions of GCC for comparison. All tests were run on a lightly loaded Centos 4.4 Dual Xeon 2.8 Ghz machine with multiple runs to get the "best" time for each version. There is still some variability in "session" to "session" numbers. From a few eye-balled runs It looks like you can trust these to about +-0.03 seconds.

Semantically Equivalent Versions

Some of the changes suggested in bugzilla and comp.lang.ada were able to improve the performance of the code but at the expense (in my opinion) of changing the user visible semantics of the test. In particular, it was suggested that part of the poor performance was do to additional indirection in the Ada version caused by using an access type to an uncontrained array.

I must admit that my knowledge of fortran is a little rusty and that the original comparison program was written using a newer dialect (F95) than I have ever used but it is my belief that a pointer to an unconstrained array (or at the very least simply using unconstrained arrays in general) is required for semantic equivalence.

So, first we will look at the differences and timing of the versions deemed by me to fair (apologies in advance for the naming conventions on these which really got out of hand) :

Ada Versions

  • tst_array - A slightly modified version of a posting from comp.lang.ada that started it all. Modification basically limits the timing to just the matrix multiply and makes the output quieter.
  • tst_array_2 - Modify the above version to do the matrix multiply in a procedure (inside of the main procedure) however, the matrix values are not passed as parameter. They are just accessed directly. This actually seemed like a silly step but the run-time results were surprising.
  • tst_array_3 - A very small variation on the original tst_array. The only difference is that the actual work of the matrix multiply is done via a procedure call where we de-reference the access types as we call the procedure.
  • tst_array_4 - Again, no real changes here except that I made a new package with the matrix type definitions and the matrix multiply routine now in that package. Note, in most ways, this is probably the most "pure" version since it is probably the closest to what one would end up with if writing more than little toy programs.
  • tst_array_task_2 - A two task version that splits the work of the matrix multiply into two tasks. A few notes. First, I am running the benchmarks on machine with 2 Xeons so you can get some real speedup from this. Second, this version is similar to one posted on comp.lang.ada however that version used fixed length arrays and thus does not show up in my list of equivalent programs.

FORTRAN Versions

  • tst_array - A slightly modified version of a posting from comp.lang.ada that started it all. Modification basically limits the timing to just the matrix multiply and makes the output quieter.
  • tst_array_matmul - A version that uses the matmul library call to do the work rather than user code. Again I am not trying to find the world's fastest matrix multiply but felt this should be included for completeness.

Run-times GCC 4.0.3

ProgramRuntime (Seconds)
tst_array2.10
tst_array_24.26
tst_array_31.68
tst_array_42.02
tst_array_task_20.88
tst_array (FORTRAN)1.37
tst_array_matmul (FORTRAN)0.86

Run-times GCC 4.2.0 (non released 20061020 version)

ProgramRuntime (Seconds)
tst_array3.33
tst_array_24.64
tst_array_31.87
tst_array_42.13
tst_array_task_21.03
tst_array (FORTRAN)1.34
tst_array_matmul (FORTRAN)0.42

Run-time Comments

  • There are all sorts of interesting things to see here. First, GCC 4.2.0 is slower in every case for Ada) with essentially no slow-down for the FORTRAN code (and quite a speed-up for the library call).
  • The 2 task version is more than 2x faster than the original code (though just about right when compared to its closest cousin which is tst_array_3
  • Without any obvious changes to what "should" be happening, the performance of the code is all over the map. On the single threaded code there is more than a 2x speed difference between some very very similar code. Code that is so similar that I don't think users would expect to see this sort of difference.
  • While it is possible to get native GNAT to beat native GFORTRAN code, it (on this benchmark), it takes two processors and tasks. Now one could argue that is somewhat fair because tasking is built in, but matmul is "built-in" to FORTRAN, is single threaded and still trounces the 2 task GNAT version.

Semantically Non-Equivalent Versions

  • tst_array_bugzilla - A modification from the original version posted in the bug report (which was tst_array) to step back from pointers to unconstrained arrays and instead use pointers to fixed length arrays (however the size of the type is still detected at run-time). Stepping back to fixed length arrays seems to go against the spirit and capability of the original FORTRAN version.
  • tst_array_task_1 - A two task version that splits the work of the matrix multiply into two tasks. It inherits the fixed length array "problem" of tst_array_bugzilla so again is deemed less fair.

Run-times GCC 4.0.3

ProgramRuntime (Seconds)
tst_array_bugzilla1.44
tst_array_task_10.75

Run-times GCC 4.2.0 (non released 20061020 version)

ProgramRuntime (Seconds)
tst_array_bugzilla1.68
tst_array_task_10.86

Run-time Comments

Again we see the same pattern of reduced performance on GCC 4.2.0 compared to GCC 4.0.3.

Conclusion

The tasking GNAT Ada version is indeed faster (on a dual processor machine at least) than the single task version of the inline FORTRAN but still quite a bit slower than the call to the intrinsic matmul routine. GNAT's performance on this admittedly extremely narrow benchmark is very poor. The example is matrix multiplication but (without evidence other than guessing based on looking at the assembly language) I fear that this sort of performance will extend to any 2d array traversal. Having said that guess, be aware that I would have never guessed I'd find a > 2x difference between the fastest and slowest single threaded Ada versions I presented here...So my guessing track record here is nothing to get excited about.

All of the code and scripts used to prepare this writeup are present in the GNUAda project SVN archive.


Ada programming, © 2005,2006 the Authors, Content is available under GNU Free Documentation License.