paint-brush
MPI workloads performance on MapR Data Platform Part 2 — Matrix Multiplicationby@anicolaspp
625 reads
625 reads

MPI workloads performance on MapR Data Platform Part 2 — Matrix Multiplication

by Nicolas A PerezFebruary 12th, 2019
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

In the <a href="https://medium.com/@anicolaspp/mpi-workloads-performance-on-mapr-data-platform-part-1-c801f3d5fe08" target="_blank"><strong><em>first part</em></strong></a><strong><em> </em></strong>of this series, we showed how we can use MPI on top of the MapR Data Platform to successfully find prime numbers within a rather large range. Also, we compare our <code class="markup--code markup--p-code">Sieve of Eratosthenes</code> implementation in MPI and the same algorithm in Spark just to discover how they both behave while looking at some interesting implementation details.

Company Mentioned

Mention Thumbnail

Coin Mentioned

Mention Thumbnail
featured image - MPI workloads performance on MapR Data Platform Part 2 — Matrix Multiplication
Nicolas A Perez HackerNoon profile picture

In the first part of this series, we showed how we can use MPI on top of the MapR Data Platform to successfully find prime numbers within a rather large range. Also, we compare our Sieve of Eratosthenes implementation in MPI and the same algorithm in Spark just to discover how they both behave while looking at some interesting implementation details.

In the second part of the series, we are going to implement Matrix Multiplication in both, MPI and Apache Spark. Again, we will look at how each implementation behaves when running on MapR. However, our MPI implementation will be based on Cannon Algorithm while in Spark we will use the MLlib BlockMatrix functions for multiplying matrices.

Cannon Algorithm

There are implicit implications in parallel computing. The memory requirements tend to increase as we increase the number of processors participating in the computation. This is not trivial but proven.

The Cannon algorithm is extremely scalable. That means that by increasing the number of processors we keep the memory requirements low. The following is a scalability analysis on Matrix Multiplication using Matrix to Matrix multiplication against Block Decomposition Matrix Multiplication used byCannon Algorithm.

As we can see, in the first part ( Matrix-Matrix ) we don’t get any scalability since the memory depends on P — the number of processors. In the second part ( Cannon-Algorithm ) the memory requirement is independent of the number of processors, more specifically, it is a constant which allows us to scale way better.

Cannon Algorithm Implementation using MPI

Now, we are going to show the main pieces of the implementation, it’s core. Then we will link the entire source code to be reviewed for those interested in it.

First of all, let’s look at the functions we are going to use so we can grasp a better idea about the application flow.

Notice, that reading and writing the matrices from/to file, happens in parallel and distributed. Each processor is in charge or its own block (piece) of the matrix in question.

Our application then becomes the following.

As we can see, this is not a simple piece of code, but it presents the core elements on Cannon Algorithms. The entire source code can be found in this Github Repository.

In Spark, the same can be achieved by simple constructs. The following code snippet shows this process.

As we can see it is very simplistic, but let’s see how fast it is compared to our MPI version.

Using MPI

1000x1000 matrix multiplication

mpirun -np 16 multiply 1000x1000.in 1000x1000.in 1000x1000.out

elapsed time: 2.082910 // around 2 seconds

10000x10000 matrix multiplication

mpirun -np 16 multiply 10000x10000.in 10000x10000.in 10000x10000.out

elapsed time: 307200 // around 5 mins

In Spark

1000x1000 matrix multiplication

multiply("1000x1000.txt")(sc)

res19: Long = 4118 // around 4 seconds

10000x10000 matrix multiplication

multiply("10000x10000.txt")(sc)

res0: Long = 973943 // around 16 mins

As we can see, the time taken by Spark increases incredibly fast, yet MPI is able to run the process with a small time penalty. At the same time, when multiplying 10000x10000 matrices, I ran out of heap space in Spark and it had to be increased to 12g per executor to run this operation.

Again, there is a tradeoff we cannot avoid. That is simplicity and reliability from Spark compared to high performance but difficult of coding and maintaining from MPI.

At the end of the day, it is all about choosing the right tool for the job and MapR Data Platform is able to run without any problems workloads build on Spark or MPI.

Notes on MPI reading from files

In order to multiply matrices, our MPI application must read the corresponding input from files. These files are stored in the MapR-FS which supports different interfaces to interact with it such as HDFS, POSIX, and others. Our C code for MPI uses regular POSIX calls to read the matrices so that each process reads only a part of the matrix that we call blocks. When using Spark, Spark uses, when possible, data locality which helps with speeding up the process, since each processor reads its data locally from the node it is running. In MPI, that is dependable on the MapRPOSIX client.

Conclusions

Even though MPI presents some limitations when interacting with the MapR distributed file system, the overall processing time is quite impressive when compared to what Spark offers for Matrix Multiplication. Spark, on the other hand, offers constructs for doing these operations with easy while keeping the entire process reliable and fault tolerant, features lacking on MPI.

The tradeoff between simple usage and high performance has been there for years and cannot be ignored, being aware of it helps us to decide on every situation.