Tuesday, July 7, 2020

GPU based parallel computations in browser with GPU.js

I've been tinkering with General Purpose GPU (GPGPU) programming in JavaScript. The original motivation for this was to accelerate the compute-intensive LBM fluid simulation from my previous post.

This video was the launching pad for me towards in-browser GPGPU. It shows how the WebGL API can be used for leveraging the massively parallel computational potential of graphics cards. While originally intended for graphics operations in the web browser, a web port of OpenGL, if you will, WebGL can be (ab)used for parallelized general purpose computations in the hundreds of cores that can be expected in today's GPUs. I found out that there are a couple of options for GPGPU in JavaScript. By far the most popular is the GPU.js library, owing to its simplicity and the level of abstraction it provides.
There is the turbo.js library (if you can call it that, the source isn't much to look at) but I didn't know what to make of it, documentation is non-existent and examples but one or two. But maybe that's because it requires writing the kernel code (a kernel code is a piece of code that runs in all the cores of a GPU simultaneously, each instance typically operating in different parts of the same data structure, say different elements of an array, thus allowing for parallel computations) in GLSL and I don't know GLSL.
One of the cornerstones of GPGPU code (read kernel), from what I've gathered, is that it has to be able to access different parts of a data structure using the same piece of code (the kernel). GPU.js does this beautifully using this.thread.x for 1D array data, this.thread.x and this.thread.y for 2D array data and this.thread.x, this.thread.y and this.thread.z for 3D array data. Each instance of the kernel code residing in an independent GPU core is assigned different values for these variables. Hence, each kernel code instance on each core can operate on different parts of the given array at the same time.
GLSL is a language for writing shaders (a fancy term for kernel code but used in graphics operations). Maybe I was unwilling to go too deep into the topic but I couldn't as easily find a similar construct in GLSL and hence turbo.js using a few google searches. Maybe that's how things go for a shading language - maybe a GLSL kernel instance doesn't  need to know which core it is residing in - or may be not, I don't know. So, after emailing the author of turbo.js with similar inquiries hoping for some resolution, I just stuck to GPU.js
There seem to have been a couple more attempts at providing GPGPU functionalities using WebGL apart from these two libraries, as I found out in this excellent article titled General-Purpose Computation on GPUs in the Browser Using gpu.js, namely WebCLGL.js and WebMonkeys but I didn't bother. All but GPU.js seem to be but dead or discontinued or unable to garner much attention.
GPU.js is great, kernel code can be written in JavaScript, has nice abstractions and conveniences, has relatively good documentation, great examples, seems to be in (active?) development, seems to have more users and all that but there's still some issues that I didn't have to do much tinkering to run into. One of these is about large loops in kernels. Whenever there's over 1100 or so iterations, on my PC at least, the code fails and returns all zeroes. A code example in the documentation in the GitHub repo for the project reads as following:
The above code is for multiplication of two square matrices a and b each of size 512. If the size of the matrices is increased to anything greater than ~1100, the code fails(at least on my computer), as I described before. There is a GitHub issue regarding this exact problem. Apparently it's got to do with the way 'loop unrolling' is done by the (graphics?) driver.
One way to deal with it, according to the discussion in the GitHub issue, seems to be to break the loop up into smaller loops. I tried that myself but didn't get it to work. So, I had to divide the matrices into blocks of smaller matrices(a process known as partitioning) and operate on each block as if they were the elements of the original matrices. That worked. Now I could do a 2000x2000 matrix multiplication using blocksize of 1000 and GPU.js wouldn't give me all zeroes. Hoorah!

Now some benchmarks on my computer:
10x10 matrix: CPU - 0 ms | GPU - 500 ms
100x100 matrix: CPU - 7 ms | GPU - 1,200 ms
256x256 matrix: CPU - 128 ms | GPU - 2,700 ms
512x512 matrix: CPU - 2,200 ms | GPU - 3,000 ms
750x750 matrix: CPU - 9,400 ms | GPU - 3,200 ms
1000x1000 matrix: CPU - 25,000 ms | GPU - 5,000 ms
2000x2000 matrix: CPU - 224,000 ms | GPU - 40,000 ms (using blocksize of 1000. Smaller blocksizes take more time since a smaller number of computations are done in parallel then)

This goes without saying that this is not the most optimized matrix multiplication program out there, not by a long shot, and the way that partitioning is achieved here has a lot of say in how the algorithm performs for larger matrices(this can be gauged from the whopping, rather disappointing jump from 5,000 to 40,000 ms in going from matrix size of 1000 to 2000). Notwithstanding this, matrix multiplication seems to be the de-facto hello world program in parallel-computation land.

That's it for this post. I just wanted to share my experience with this amazing piece of technology and the kind of difference it can make versus the CPU in a variety of parallelizable tasks that go far beyond the simple matrix multiplication problem.

Here's the code I used for the benchmarks.

P.S. My browser(Firefox Developer Edition version 79.0b4 64-bit) seems to use my built-in graphics Intel HD Graphics (GPU0) and not my dedicated card AMD Radeon R7 M265 (GPU1). I don't know if there's a way to change that.