Consider kernel:
...
DO j = i+1, n
A(j,i) = A(j,i)/Swap(i)
A(j,i+1:n) = A(j,i+1:n) - A(j,i)*Swap(i+1:n)
Y(j) = Y(j) - A(j,i)*Temp
END DO
Want to minimise communications in loop:
!HPF$ ALIGN Y(:) WITH A(:,*)
! Y aligned with each col of A
!HPF$ ALIGN Swap(:) WITH A(*,:)
! Swap aligned with each row of A
!HPF$ DISTRIBUTE A(CYCLIC,CYCLIC) ! onto default grid
CYCLIC gives a good load balance.
For more information, click here