欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

matrix multiplication

程序员文章站 2022-07-15 12:29:32
...

Overview of Matrix Multiplication

matrix multiplication

For Matrix Multiplication A ( i , j ) ∗ B ( j , k ) = C ( i , k ) A(i,j) * B(j,k) = C(i,k) A(i,j)B(j,k)=C(i,k)

( i , k ) = ( i , j 1 ) ( j 1 , k ) + ( i , j 2 ) ( j 2 , k ) + . . . + ( i , j n ) ( j n , k ) (i,k) = (i,j_1)(j_1,k) + (i,j_2)(j_2,k)+...+(i,j_n)(j_n,k) (i,k)=(i,j1)(j1,k)+(i,j2)(j2,k)+...+(i,jn)(jn,k)

The i i i row of the A matrix and the k k k column of the B matrix are multiplied and then summed. The rows and rows of A and the columns and columns of B do not affect each other and can be parallelized.

For non-parallel matrix multiplication, the complexity is O ( n 3 ) O(n^3) O(n3)

For matrix multiplication using map reduce,speed up can be:

S = 1 1 M + 1 N S = \frac{1}{\frac{1}{M} + \frac{1}{N}} S=M1+N11

How to implement Map & Reduce

Take the following matrix multiplication as an example:
1 2 3 4 × 5 6 7 8 \begin{matrix} 1&2\\ 3&4 \end{matrix} \times \begin{matrix} 5&6\\ 7&8 \end{matrix} 1324×5768

The basic idea is to use the coordinates of the final result (matrix C) as the key, so that the row of A and the column of B can be determined, and the product of the corresponding elements of the row and column can be further summed.

Specific method 1

The first method is to find each element of each row of matrix A and the corresponding element of each column of matrix B, combine them in pairs, and use the position of their final result as their key. In the reduce phase, according to this representation, all the values at the same position are multiplied and added.

Mapper

For matrix A: k e y : ( i , k ) key: (i,k) key:(i,k) , v a l u e : ( A , j , A i j ) value: (A, j, A_{ij}) value:(A,j,Aij)

For matrix B: k e y : ( i , k ) key: (i,k) key:(i,k) , v a l u e : ( B , j , A j k ) value: (B, j, A_{jk}) value:(B,j,Ajk)

Reducer

For all ( i , k ) = = > S u m m a t i o n ( A i j ∗ B j k ) (i,k) ==> Summation(A_{ij}*B_{jk}) (i,k)==>Summation(AijBjk) for all j j j

# k, i, j computes the number of times it occurs.
# Here all are 2, therefore when k=1, i can have 
# 2 values 1 & 2, each case can have 2 further
# values of j=1 and j=2. Substituting all values 
# in formula

k=1  i=1  j=1   ((1, 1), (A, 1, 1)) 
          j=2   ((1, 1), (A, 2, 2))   
     i=2  j=1   ((2, 1), (A, 1, 3))
          j=2   ((2, 1), (A, 2, 4)) 

k=2  i=1  j=1   ((1, 2), (A, 1, 1))
          j=2   ((1, 2), (A, 2, 2))   
     i=2  j=1   ((2, 2), (A, 1, 3))
          j=2   ((2, 2), (A, 2, 4)) 

Computing the mapper for Matrix B

i=1  j=1  k=1   ((1, 1), (B, 1, 5)) 
          k=2   ((1, 2), (B, 1, 6))   
     j=2  k=1   ((1, 1), (B, 2, 7))
          j=2   ((1, 2), (B, 2, 8)) 

i=2  j=1  k=1   ((2, 1), (B, 1, 5))
          k=2   ((2, 2), (B, 1, 6))   
     j=2  k=1   ((2, 1), (B, 2, 7))
          k=2   ((2, 2), (B, 2, 8)) 

Therefore computing the reducer:

# We can observe from Mapper computation 
# that 4 pairs are common (1, 1), (1, 2),
# (2, 1) and (2, 2)
# Make a list separate for Matrix A & 
# B with adjoining values taken from 
# Mapper step above:

(1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}
        Blist ={(B, 1, 5), (A, 2, 7)}
        Now Aij x Bjk: [(1*5) + (2*7)] =19 -------(i)

(1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)}
        Blist ={(B, 1, 6), (A, 2, 8)}
        Now Aij x Bjk: [(1*6) + (2*8)] =22 -------(ii)

(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}
        Blist ={(B, 1, 5), (A, 2, 7)}
        Now Aij x Bjk: [(3*5) + (4*7)] =43 -------(iii)

(2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)}
        Blist ={(B, 1, 6), (A, 2, 8)}
        Now Aij x Bjk: [(3*6) + (4*8)] =50 -------(iv)

From (i), (ii), (iii) and (iv) we conclude that
((1, 1), 19)
((1, 2), 22)
((2, 1), 43)
((2, 2), 50)

Specific method 2

The basic data block size processed by the mapper of the second method is larger, which is a whole row of data. Then the search and operation of the corresponding elements are handed over to the reducer.

Mapper

For matrix A: k e y : ( i , k ) key: (i,k) key:(i,k) , v a l u e : ( A , i , [ . . . . ] ) value: (A, i, [....]) value:(A,i,[....])

For matrix B: k e y : ( i , k ) key: (i,k) key:(i,k) , v a l u e : ( B , k , [ . . . . ] ) value: (B, k, [....]) value:(B,k,[....])

The difference from the above is that the key-value pairs sent by the mapper directly send an entire row of the A matrix or an entire column of the B matrix to the reducer.

Reducer:

Multiply the key value to the same corresponding row and corresponding column and then sum them.

Specific method 3

Divide the matrix into n blocks, each block contains n rows (n columns). Then perform operations between blocks.

At this time, the key is no longer the row number and column number represented by i , k i,k i,k, but the block number.


Java code

According to the data files I have, I chose method 2.

The format of the data file is as follows:

Matrix A(1024 * 1024):
matrix multiplication
L is the identifier of matrix A, the number is the row number i i i, and the rest is the value of each element of the matrix, the length is 1024.

Matrix B(1024 * 1024):
matrix multiplication

This data file is matrix B that has been transposed, which is convenient for data reading and manipulation. Therefore, each row here represents each column of matrix B.

R is the identifier of matrix B, and the number at the second position is the column number j j j,


below are the main part of the code:

// <mapper>
public class Mapper extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, Text>{

    private Logger logger = Logger.getLogger(Mapper.class);

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        Configuration conf = context.getConfiguration();
        int m = Integer.parseInt(conf.get("m"));
        int p = Integer.parseInt(conf.get("p"));
        String line = value.toString();
        String[] indicesAndValue = line.split(" ", 3);
        Text outputKey = new Text();
        Text outputValue = new Text();
        
        for (int i = 0; i < indicesAndValue.length; ++i) {
            System.out.println(indicesAndValue[i]);
        }
        if (indicesAndValue[0].equals("L")) {
            try {
                for (int k = 1; k <= p; k++) {  // 这个范围内的 k 都要用到这个 i 所在的行
                    outputKey.set(indicesAndValue[1] + "," + k); // (i, k)
                    String[] values = Arrays.copyOfRange(indicesAndValue, 2, indicesAndValue.length);
                    outputValue.set("L," + indicesAndValue[1] + "," + Arrays.toString(values)); // (L, lineno, [....])
                    context.write(outputKey, outputValue);
                }
            } catch (Exception e) {
                logger.error(e.getMessage());
                logger.error("--------Left--------");
                logger.error(indicesAndValue);
            }
        }
        else {
            try {
                for (int i = 1; i <= m; i++) {
                    outputKey.set(i + "," + indicesAndValue[1]); // (i, k)
                    String[] values = Arrays.copyOfRange(indicesAndValue, 2, indicesAndValue.length);
                    outputValue.set("R," + indicesAndValue[1] + "," + Arrays.toString(values));  // (R, lineno, [....])
                    context.write(outputKey, outputValue);
                }
            } catch (Exception e) {
                logger.error(e.getMessage());
                logger.error("--------Right--------");
                logger.error(indicesAndValue);
            }
        }
    }
}
// <reducer>
public class Reducer extends org.apache.hadoop.mapreduce.Reducer<Text, Text, Text, Text>{
    @Override
    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        String[] value;
        HashMap<String, String> hashA = new HashMap<String, String>();
        HashMap<String, String> hashB = new HashMap<String, String>();
        for (Text val : values) {
            value = val.toString().split(",");
            if (value[0].equals("L")) { // hashA: (i,k),[]
                hashA.put(key.toString(), value[2]);
            }
            else {                      // hashB: (i,k),[]
                hashB.put(key.toString(),value[2]);
            }
        }
        int result = 0;
        if (hashA.containsKey(key.toString()) && hashB.containsKey(key.toString())) {
            String Row = hashA.get(key.toString());
            String Column = hashB.get(key.toString());
            String[] row = Row.replaceAll("\\[", "").replaceAll("\\]", "").split(" ");
            String[] column = Column.replaceAll("\\[", "").replaceAll("\\]", "").split(" ");

            for (int i = 0; i < row.length; ++i) {
                byte[] bytes = parseHexBinary(row[i]);
                byte[] bytes1 = parseHexBinary(column[i]);
                ByteBuffer wrapped = ByteBuffer.wrap(bytes); // big-endian by default
                ByteBuffer wrapped1 = ByteBuffer.wrap(bytes1);
                int r = wrapped.getInt();
                int c = wrapped1.getInt();

                result += r * c;
            }
        }
        if (result != 0) {
            System.out.printf("%d\n", result);
            // key 应该是 (i, k)
            context.write(null, new Text(key.toString() + "," + Float.toString(result)));
        }
    }
}

github address


Result analysis

The following results are obtained by running on a distributed cluster.

When the number of mappers is 3 and the number of reducers is 1, the experimental results are as follows:

The time consumed by the CPU is 1426690ms = 142.669s

Since the number of reducers is set to 1, there is no parallel operation in the reduce phase.

The total time consumed by Map is: 251578ms = 251.578s

The total time consumed by the Reducer is: 920940ms = 920.94s

Since we assign the multiplication and addition operations of the matrix to the reducer, the mapper is mainly responsible for the correspondence between the rows and columns of the matrix, so the reducer takes more time.

matrix multiplication

matrix multiplication

When the number of mappers is 3 and the number of reducers is 8, the experimental results are as follows:

matrix multiplication

The total time consumed by Map is: 262805 = 262.805s

The total time consumed by the Reducer is: 1653154ms = 1653.154s

Increasing the number of reducers did not increase the execution speed. The reason may be that when the amount of data is small, one reducer can complete all the work. The addition of reducers increases the time consumption of collaborative work between different reducers and synchronization of all data blocks.


Data file preprocessing

The original data file uses the hexadecimal ASCII code, so it needs to converted to an integer

Hex and bytes and Int conversion

How to read ASCII data similar to hexadecimal format in data file?

Python:

Val = int(struct.unpack("I", binascii.a2b_hex(Matrix[i][k][2:]))[0])

By converting hex to binary, and then using the unpack method to interpret the binary data.

Java:

byte[] bytes = parseHexBinary(row[i]);
ByteBuffer wrapped = ByteBuffer.wrap(bytes); // big-endian by default 
int r = wrapped.getInt();

Read the byte stream, and then use ByteBuffer to convert to an integer.


How to modify the number of mapper and reducer

REF

Note that it is not recommended to actively modify the number of mappers.

Number of mappers set to run are completely dependent on

  1. File Size
  2. Block Size

The MapReduce framework in Hadoop allows us only to

  1. suggest the number of Map tasks for a job
  2. demand that it provide n reducers.

Reference

http://dblab.xmu.edu.cn/blog/820-2/

https://www.geeksforgeeks.org/matrix-multiplication-with-1-mapreduce-step/

https://github.com/shask9/Matrix-Multiplication-Hadoop

https://www.jianshu.com/p/c6c543e4975e

https://docs.python.org/zh-cn/3/library/binascii.html

https://docs.python.org/zh-cn/3/library/struct.html

https://www.journaldev.com/776/string-to-array-java

https://*.com/questions/6885441/setting-the-number-of-map-tasks-and-reduce-tasks

https://*.com/questions/28861620/hadoop-map-reduce-total-time-spent-by-all-maps-in-occupied-slots-vs-total-time

相关标签: Java mapreduce