Consider kernel:
... DO j = i+1, n A(j,i) = A(j,i)/Swap(i) A(j,i+1:n) = A(j,i+1:n) - A(j,i)*Swap(i+1:n) Y(j) = Y(j) - A(j,i)*Temp END DO
Want to minimise communications in loop:
!HPF$ ALIGN Y(:) WITH A(:,*) ! Y aligned with each col of A !HPF$ ALIGN Swap(:) WITH A(*,:) ! Swap aligned with each row of A !HPF$ DISTRIBUTE A(CYCLIC,CYCLIC) ! onto default grid
CYCLIC gives a good load balance.
For more information, click here