Friday, October 4, 2024

Nepal Companies Scraper

A recent YouTube video led me to this site where we can obtain the details of any company registered in Nepal using its registration number. I found out that the number is enumerable i.e. you can enter any number in there and if one or more companies exist that correspond to the number, a few details such as the name of the company, established date, address, type and last contact date with the government office are shown. The fact that all of it wasn't available in bulk was a problem that I could tackle while getting to learn the latest in parallel programming in C# (it's been a while since I last wrote something in the language, or at least feels like it). So, a few days ago, I fired up Visual Studio and created a new console app targeting .NET 8 - the latest and greatest of the framework.

The first order of business was obviously to figure out the back-end API calls made by the webpage's interface upon clicking the Search button. Chrome devtools were as helpful here as always. I identified four different form parameters for the POST request's body and the cookie that was required to be set in the header. I made it work manually in Postman, then transitioned to code it up as simple as I could in C# - first using a simple for loop. I had to make it work sequentially before I could move on to multi-threading/concurrency. It did work really well. But of course, the speed wasn't all that great. Each POST request took around 130 ms, so if I were to check 1 million registration numbers, it would take 3 months. Obviously, I wanted to do it faster, and I knew there was something that the much acclaimed TPL (Task Parallel Library) of the .NET/Framework ecosystem could do here. I had used threads (extensively) before, and the threadpool's higher level methods for queuing up blocks of code to be executed by different threads as they become available. However, I had never had to dive into the depths of the TPL. Tasks were still relatively new to me. Sure, I knew how to use methods that returned them, await them and so on, but I hadn't ever felt a need to do Task-based (which I now know is _somewhat_ synonymous with asynchronous) programming. Most things I've done are CPU-heavy (if that) rather than extensively network/filesystem-oriented. So, it makes sense now with the luxury of hindsight, that I was able to coast through in my C# journey up to this point without having to learn much about handling repeated inherently I/O-bound operations that require the program to wait for response from an external source , such as performing HTTP requests. I mean I could probably put together something using just plain-old threads to perform somewhat close to the optimal that's achievable using the TPL (basically partition the full range, and synchronize the threads), but I wanted to learn this new thing. So, the first thing I tried was the humble yet powerful Parallel.For() loop. The only experience guiding me about this beast was that one time when I had written a prime number finder to compare the number-crunching speeds of C#, Java and C. This thing had wiped the floor back then. So, I was pretty confident that it would do great this time as well. Upon giving it a whirl, I found a few problems with it. First, there were an unacceptably large number of failed probes; it looked like the server was unable to serve however many threads it spun up. Second, it looked like my CPU was being utilized like crazy for what little it actually delivered. I now know that Parallel.For() doesn't do a great job of working optimally with asynchronous operations. At first glance, it seems reasonable that because HTTP requests fired in each iteration/thread are inherently blocking (and I did not write the request code in the idiomatic asynchronous style using await on every async method such as httpClient.GetAsync(), instead doing .Result and blocking the thread entirely - I did not know the difference!) of course it makes sense for another thread to start while the thread is blocked. However, I later learned that that is not the best way to maximize simultaneous requests. Just increasing the number of threads does not automatically mean better "throughput" (how many registration numbers got probed at a time) because - a) the server might not allow too many requests at the same time from the same IP, b) the threads are getting blocked for no good reason (this is with the .Result approach), taking precious CPU resources away from the availability of actually useful operations. Now I realize that both these drawbacks of the naive Parallel.For() loop approach could be mitigated by the use of encapsulated Tasks within the parallelized block of code and using locking mechanisms such as Semaphores, but it's a bit of a hack. While I have provided explanations on the cons of Parallel.For() with what I now know, back when I was actually testing it out, I did not bother getting into the weeds at all - I just knew that Parallel.For() wasn't working. Any internet search on "C# web scraping simultaneous requests" was yelling that Parallel.For() was for CPU-bound operations as opposed to I/O-bound operations such as network requests, which was better served with concurrent asynchronous programming, namely Task.WhenAll(). So, that's where I went next.

The approach with Task.WhenAll() is that you create an expression-tree using LINQ that when enumerated basically creates a Task for a specific registration number. It goes like: Enumerable.Range(1, 1_000_000).Select(async i => await YourHttpRequestMethodAsync(i);)

The above (pseudo)code basically maps each number from 1 to 1M to a separate asynchronous lambda expression that checks for companies corresponding to the given registration number (and does some side-effective thing updating a ConcurrentBag or whatever). The result is an IEnumerable of Task that can be run simultaneously using Task.WhenAll(). Of course it's possible to do it using a simple for loop instead, to generate a new Task for each invocation of YourHttpRequestMethodAsync(), but the above method appears to be more established because it doesn't end up actually creating all the Tasks at once - it only gives an IEnumerable, not a List or some other collection that requires holding them all in memory for no good reason most of the time - but that's a secondary point here. Okay, enough tangents! Now, this method worked great. The suggested way to limit concurrency here is to make use of a form of shared locks, specifically a Semaphore or SemaphoreSlim, before the actual invocation of the YourHttpRequestMethodAsync() method so that only a defined number of tasks run are active at any given time. Of note here is that a running Task is different from a running Thread, something I found intriguing (it's still something I'm not totally clear on even now). Even the same thread may be home to different tasks and a task may be run in different threads. It's all handled by the runtime, and as you can see, it's a higher level abstraction than threads. It is also what I suspect allows it to be so performant with little effort from the developer's standpoint (the runtime however, I'm sure, needs to do a lot more gymnastics here than when just using threads). I let my code run with this approach overnight and in about 6 hours or so, it had processed through all 1 million registration numbers and yielded just under 400k companies. By this point, I had also graduated from collecting all the harvested companies in memory and dumping them all once the whole process ended, to writing them in batches to a SQLite DB, for which I wrote another class called DBStore for handling the inputs from multiple threads, reporting metrics regarding its internal queue size, fill and drain rates, and to modulate efficiently the offloading rate from the queue and so on, but that's a whole another discussion - one that I'm omitting here since much of the code is self-documenting and the class is heavily commented. Suffice to say, this mini-project was mostly complete. However, Visual Studio's intellisense had shown me another Parallel.For-like method i.e. Parallel.ForAsync(). Given that I had seen discussions on the internet (mostly StackOverflow) regarding Parallel.ForEachAsync(), I was curious how well for ForAsync() would work. So, that's what I tried next.

Parallel.ForAsync() is a brand new method, only added to the language back in November of 2023. But I was targeting .NET 8 - the latest and the greatest, so I had no concerns. While Parallel.For() didn't natively understand the asynchronous programming model and so an asynchronous block of code given to it wouldn't be executed as one might expect unless a Task wrapping around the code was used as a workaround, ForAsync() was built for this purpose! You could plop in the same async i => await YourHttpRequestMethodAsync(i); asynchronous lambda without any change that you used for Task.WhenAll(), and it would work just fine! The upside was you didn't need to manually create Semaphore locks or bother with creating an enumerable of Tasks. You can simply control the "degree of parallelism" (note the level of abstraction in terminology, which I think is telling) using ParallelOptions and MaxDegreeOfParallelism (which is passed in as an argument to the method) and the runtime will do the rest. It's not all that different from Task.WhenAll() but it feels like it's the proper way to do something like this - it handles CPU-intensive works just as the regular Parallel.For() would, while giving first-class support to asynchronous operations such as network requests. Performance wise, comparing the two methods, I felt like I could push Parallel.ForAsync() more than I could Task.WhenAll() even when not placing any limit on concurrency using SemaphoreSlims; it seemed like Task.WhenAll() is pretty self-balancing and self-limiting and generally robust. Contrast this with Parallel.ForAsync()'s MaxDegreeOfParallelism, where it was obvious that it was much easier to overdo "parallelism" and get throttled right away. For the actual numbers (I ran a few simple tests for a period of 2 minutes using the two approaches), take a look at my repo.

There are more details in the readme file in my repo, such as the interesting problem of registration numbers and company mixups due to shared instances of HttpClient (with the same jsession cookie) being used in multiple threads at the same time, how the overall network speed plays a bigger role than the specific parallel programming technique used, and so on.

Here's the rough MDD I jotted down throughout this project:

naive parallel.for loop was implemented without allowing for server-side rate limits/timeouts. performance was poor. the same HttpClient instance with the same jsession cookie was utilised for all parallel iterations, so likely caused server-side session mixups leading to the wrong companies being reported for a requested regNum. also, lots of failed regNums (851744) for 154875 harvested companies at the end. ran overnight, about 12 hours for probing regNums 1 to 1 million. this was with MaxDegreeOfParallelism set to 4. had tested it out with many numbers including no limits. no limits mode basically led to instant request denials. higher numbers led to extremely low yields (harvested:tested ratio). so settled on 4. even then, unsatisfactory performance, as mentioned above.


in about 2 hours, harvested companies stabilised at 339489 using task.whenall(). stabilisation due to no company details being available for queried regNum. rate started out at around 1000 regNums processed (and an equal no. of harvested companies, with no failed regNums) per 10 seconds and slowly dropped to around 500 per 10 seconds prolly due to server-side throttling. failed regNums caused only due to server-side issues/throttling. for about 424k tested regNums, 700 failed regNums. very gradual rise in failed regNums. this approach applied the instantiation of a brand new HttpClient on every task to avoid potential server-side jsession mixups. this approach also has exponential back-off in case of timeouts and 3-times retry logic. processed 999815 regNums in 6 hours. as for the code, i noticed no difference between giving .Select() an async lambda or a normal sync lambda.


12648 harvested in 120 seconds using sqlite offloading (dynamic rate)


354862 harvested in 19785 seconds using sqlite offloading with simple sleep method. 519 failed regNums.


 Note: I stopped at 1 million because I wasn't getting any hits after around the 330k mark of harvested companies (which normally got reached at around the 2 hour mark using either of the two techniques - probing the whole 1 million took another 4 of course, just without many results).