Friday, October 4, 2024

Nepal Companies Scraper

A recent YouTube video led me to this site where we can obtain the details of any company registered in Nepal using its registration number. I found out that the number is enumerable i.e. you can enter any number in there and if one or more companies exist that correspond to the number, a few details such as the name of the company, established date, address, type and last contact date with the government office are shown. The fact that all of it wasn't available in bulk was a problem that I could tackle while getting to learn the latest in parallel programming in C# (it's been a while since I last wrote something in the language, or at least feels like it). So, a few days ago, I fired up Visual Studio and created a new console app targeting .NET 8 - the latest and greatest of the framework.

The first order of business was obviously to figure out the back-end API calls made by the webpage's interface upon clicking the Search button. Chrome devtools were as helpful here as always. I identified four different form parameters for the POST request's body and the cookie that was required to be set in the header. I made it work manually in Postman, then transitioned to code it up as simple as I could in C# - first using a simple for loop. I had to make it work sequentially before I could move on to multi-threading/concurrency. It did work really well. But of course, the speed wasn't all that great. Each POST request took around 130 ms, so if I were to check 1 million registration numbers, it would take 3 months. Obviously, I wanted to do it faster, and I knew there was something that the much acclaimed TPL (Task Parallel Library) of the .NET/Framework ecosystem could do here. I had used threads (extensively) before, and the threadpool's higher level methods for queuing up blocks of code to be executed by different threads as they become available. However, I had never had to dive into the depths of the TPL. Tasks were still relatively new to me. Sure, I knew how to use methods that returned them, await them and so on, but I hadn't ever felt a need to do Task-based (which I now know is _somewhat_ synonymous with asynchronous) programming. Most things I've done are CPU-heavy (if that) rather than extensively network/filesystem-oriented. So, it makes sense now with the luxury of hindsight, that I was able to coast through in my C# journey up to this point without having to learn much about handling repeated inherently I/O-bound operations that require the program to wait for response from an external source , such as performing HTTP requests. I mean I could probably put together something using just plain-old threads to perform somewhat close to the optimal that's achievable using the TPL (basically partition the full range, and synchronize the threads), but I wanted to learn this new thing. So, the first thing I tried was the humble yet powerful Parallel.For() loop. The only experience guiding me about this beast was that one time when I had written a prime number finder to compare the number-crunching speeds of C#, Java and C. This thing had wiped the floor back then. So, I was pretty confident that it would do great this time as well. Upon giving it a whirl, I found a few problems with it. First, there were an unacceptably large number of failed probes; it looked like the server was unable to serve however many threads it spun up. Second, it looked like my CPU was being utilized like crazy for what little it actually delivered. I now know that Parallel.For() doesn't do a great job of working optimally with asynchronous operations. At first glance, it seems reasonable that because HTTP requests fired in each iteration/thread are inherently blocking (and I did not write the request code in the idiomatic asynchronous style using await on every async method such as httpClient.GetAsync(), instead doing .Result and blocking the thread entirely - I did not know the difference!) of course it makes sense for another thread to start while the thread is blocked. However, I later learned that that is not the best way to maximize simultaneous requests. Just increasing the number of threads does not automatically mean better "throughput" (how many registration numbers got probed at a time) because - a) the server might not allow too many requests at the same time from the same IP, b) the threads are getting blocked for no good reason (this is with the .Result approach), taking precious CPU resources away from the availability of actually useful operations. Now I realize that both these drawbacks of the naive Parallel.For() loop approach could be mitigated by the use of encapsulated Tasks within the parallelized block of code and using locking mechanisms such as Semaphores, but it's a bit of a hack. While I have provided explanations on the cons of Parallel.For() with what I now know, back when I was actually testing it out, I did not bother getting into the weeds at all - I just knew that Parallel.For() wasn't working. Any internet search on "C# web scraping simultaneous requests" was yelling that Parallel.For() was for CPU-bound operations as opposed to I/O-bound operations such as network requests, which was better served with concurrent asynchronous programming, namely Task.WhenAll(). So, that's where I went next.

The approach with Task.WhenAll() is that you create an expression-tree using LINQ that when enumerated basically creates a Task for a specific registration number. It goes like: Enumerable.Range(1, 1_000_000).Select(async i => await YourHttpRequestMethodAsync(i);)

The above (pseudo)code basically maps each number from 1 to 1M to a separate asynchronous lambda expression that checks for companies corresponding to the given registration number (and does some side-effective thing updating a ConcurrentBag or whatever). The result is an IEnumerable of Task that can be run simultaneously using Task.WhenAll(). Of course it's possible to do it using a simple for loop instead, to generate a new Task for each invocation of YourHttpRequestMethodAsync(), but the above method appears to be more established because it doesn't end up actually creating all the Tasks at once - it only gives an IEnumerable, not a List or some other collection that requires holding them all in memory for no good reason most of the time - but that's a secondary point here. Okay, enough tangents! Now, this method worked great. The suggested way to limit concurrency here is to make use of a form of shared locks, specifically a Semaphore or SemaphoreSlim, before the actual invocation of the YourHttpRequestMethodAsync() method so that only a defined number of tasks run are active at any given time. Of note here is that a running Task is different from a running Thread, something I found intriguing (it's still something I'm not totally clear on even now). Even the same thread may be home to different tasks and a task may be run in different threads. It's all handled by the runtime, and as you can see, it's a higher level abstraction than threads. It is also what I suspect allows it to be so performant with little effort from the developer's standpoint (the runtime however, I'm sure, needs to do a lot more gymnastics here than when just using threads). I let my code run with this approach overnight and in about 6 hours or so, it had processed through all 1 million registration numbers and yielded just under 400k companies. By this point, I had also graduated from collecting all the harvested companies in memory and dumping them all once the whole process ended, to writing them in batches to a SQLite DB, for which I wrote another class called DBStore for handling the inputs from multiple threads, reporting metrics regarding its internal queue size, fill and drain rates, and to modulate efficiently the offloading rate from the queue and so on, but that's a whole another discussion - one that I'm omitting here since much of the code is self-documenting and the class is heavily commented. Suffice to say, this mini-project was mostly complete. However, Visual Studio's intellisense had shown me another Parallel.For-like method i.e. Parallel.ForAsync(). Given that I had seen discussions on the internet (mostly StackOverflow) regarding Parallel.ForEachAsync(), I was curious how well for ForAsync() would work. So, that's what I tried next.

Parallel.ForAsync() is a brand new method, only added to the language back in November of 2023. But I was targeting .NET 8 - the latest and the greatest, so I had no concerns. While Parallel.For() didn't natively understand the asynchronous programming model and so an asynchronous block of code given to it wouldn't be executed as one might expect unless a Task wrapping around the code was used as a workaround, ForAsync() was built for this purpose! You could plop in the same async i => await YourHttpRequestMethodAsync(i); asynchronous lambda without any change that you used for Task.WhenAll(), and it would work just fine! The upside was you didn't need to manually create Semaphore locks or bother with creating an enumerable of Tasks. You can simply control the "degree of parallelism" (note the level of abstraction in terminology, which I think is telling) using ParallelOptions and MaxDegreeOfParallelism (which is passed in as an argument to the method) and the runtime will do the rest. It's not all that different from Task.WhenAll() but it feels like it's the proper way to do something like this - it handles CPU-intensive works just as the regular Parallel.For() would, while giving first-class support to asynchronous operations such as network requests. Performance wise, comparing the two methods, I felt like I could push Parallel.ForAsync() more than I could Task.WhenAll() even when not placing any limit on concurrency using SemaphoreSlims; it seemed like Task.WhenAll() is pretty self-balancing and self-limiting and generally robust. Contrast this with Parallel.ForAsync()'s MaxDegreeOfParallelism, where it was obvious that it was much easier to overdo "parallelism" and get throttled right away. For the actual numbers (I ran a few simple tests for a period of 2 minutes using the two approaches), take a look at my repo.

There are more details in the readme file in my repo, such as the interesting problem of registration numbers and company mixups due to shared instances of HttpClient (with the same jsession cookie) being used in multiple threads at the same time, how the overall network speed plays a bigger role than the specific parallel programming technique used, and so on.

Here's the rough MDD I jotted down throughout this project:

naive parallel.for loop was implemented without allowing for server-side rate limits/timeouts. performance was poor. the same HttpClient instance with the same jsession cookie was utilised for all parallel iterations, so likely caused server-side session mixups leading to the wrong companies being reported for a requested regNum. also, lots of failed regNums (851744) for 154875 harvested companies at the end. ran overnight, about 12 hours for probing regNums 1 to 1 million. this was with MaxDegreeOfParallelism set to 4. had tested it out with many numbers including no limits. no limits mode basically led to instant request denials. higher numbers led to extremely low yields (harvested:tested ratio). so settled on 4. even then, unsatisfactory performance, as mentioned above.


in about 2 hours, harvested companies stabilised at 339489 using task.whenall(). stabilisation due to no company details being available for queried regNum. rate started out at around 1000 regNums processed (and an equal no. of harvested companies, with no failed regNums) per 10 seconds and slowly dropped to around 500 per 10 seconds prolly due to server-side throttling. failed regNums caused only due to server-side issues/throttling. for about 424k tested regNums, 700 failed regNums. very gradual rise in failed regNums. this approach applied the instantiation of a brand new HttpClient on every task to avoid potential server-side jsession mixups. this approach also has exponential back-off in case of timeouts and 3-times retry logic. processed 999815 regNums in 6 hours. as for the code, i noticed no difference between giving .Select() an async lambda or a normal sync lambda.


12648 harvested in 120 seconds using sqlite offloading (dynamic rate)


354862 harvested in 19785 seconds using sqlite offloading with simple sleep method. 519 failed regNums.


 Note: I stopped at 1 million because I wasn't getting any hits after around the 330k mark of harvested companies (which normally got reached at around the 2 hour mark using either of the two techniques - probing the whole 1 million took another 4 of course, just without many results).

Wednesday, July 31, 2024

ConstDisp

With dark themes nearly everywhere these days, I have been noticing that my eyes have a hard time whenever I see white/near-white renderings on my screen - whatever the specifics (webpages, native applications, videos, etc.). I wanted a program that would automatically adjust the brightness accordingly to fit a target brightness I specify. ConstDisp is the result. I first fiddled with a few WMI queries that actually change the screen brightness as one would do using say a keyboard combination (in laptops anyway, not sure if that works with most external monitors) but I ditched that idea because it didn't feel granular enough nor did it seem like it would work with all display hardware. This program works by cheating. All it does is create a topmost window whose background color we choose and whose opacity we modulate to achieve a user-specified target for the "average brightness" of the screen. That's mostly it. Of course, the way it calculates "average brightness" is arbitrary but it's good enough for me. This quantity is simply the mean of all red, green and blue channel values of the pixels of the screen. There could be other ways to get this metric but that's how it's done right now. I wanted to do something open source in C++, so I chose it; for the GUI, I used my SimWin library. In the process, I added a few more features that were needed for this project to SimWin as well. 

Creating the GUI for the program was not fun (really missed C#'s RAD) but implementing the core functionality was, and I got to learn a lot about the language.

Here's a screenshot:




Here's the repo and the binary.

Note: There's no app icon, so it appears as a generic executable. The language standard used here is C++14 but most modern windows should have no problem running the statically linked binary.

Wednesday, May 29, 2024

MCQ Scraper

This is a simple .NET 8 console application that scrapes MCQs from different online sources and consolidates them into a single SQLite DB. Check the schema of the table in the attached picture below.

Currently, it only collects Civil Engineering-related questions from IndiaBix for a select few categories, but this can be extended to other sources as well.

The program itself is pretty brittle in its current state - no allowance has been made for exceptions that I didn't encounter during development. Logging is a pretty basic custom implementation as well, and it was an after-thought (log file is constantly opened and written to every time a new line is to be logged, so be sure to disable it if you feel it hitting performance too much). Still, in the current form, the operation was completed in a matter of a couple minutes. 

Regarding dependencies, it uses simple Regex matching when possible and also HtmlAgilityPack for parsing the scraped HTML pages into manageable DOMs for easier filtering.

The resulting output DB file is available here.

Console output from the program in action


Top few entries from the output DB


UPDATE (31 May, 2024):
I've worked in a few improvements to the program and pushed the code changes to the repository. Now, it can also create Anki-flashcards so that the scraped content is actually useful. Also equally importantly, the program also has a bugfix/QoL enhancement so that the <img> tags are also checked for in the question texts/prompts as well - a possibility that I had ignored before - and the referenced images are downloaded, base64-encoded and replaced in place, ready for use in HTML as-is.

For the past day or two, I was wondering if I should create a separate app (I was thinking maybe a Vue-powered SPA) to actually make use of the DB; but a couple of hours' worth of internet research led me to believe that it would be a lot more efficient if this DB could be made available to Anki instead of reinventing the wheel all over again. Oh, and this apparently really popular app was first brought to my attention by ChatGPT-4o when I asked it "what is the best strategy to memorize around 4500 mcqs for a competitive exam?" Anki was literally the first option in the list it returned where it suggested to use it as a Spaced Repetition System (SRS) for "active learning". I had to download and play around with it for a bit before I could actually understand its workings. I also searched for any and all add-ons the Anki community had for Multiple Choice Questions and a tried one or two of them. But I didn't really like them too much - and I didn't know how I could export the questions from the DB to a form that these add-ons would understand either. So, taking the advice of a wise Reddit user somewhere, I just ditched the idea of using any sort of add-on altogether for MCQs and began exploring the import and export format used by Anki. It turned out that the program can import/export in a few different formats but I chose the simple plaintext format for obvious reasons and studied its structure. I created a new default deck - a term I have gathered means a collection of cards sharing some similarity; so in this case, a deck can represent a category - such as building-materials, or surveying, etc. Then I created two cards manually from inside Anki, each one following the pattern of the question prompt and the multiple options on the front side and the answer on the back side of the card. This is also where I learned that Anki has first-class support for HTML content in its flash-card contents, which was perfect for me since I didn't have to worry about all the tags in my scraped questions and options, not to mention the <img> tags. Anyway, then I exported the cards from Anki's File menu (still not sure if it exports just a single deck or all of them to the same file) to a plaintext file and studied the structure of the exported file. It was really intuitive, and I figured it would be easy enough to automate this process to convert the questions from my scraped DB to this format that Anki understood (I had also successfully tried externally modifying the export file and importing it back to Anki again, without any hiccup). So, that's what I did. I simply added a class library project to the Visual Studio solution and added in the functionality to get read-to-import Anki plaintext files from the SQLite DB produced by MCQer. And voila:

The *.txt files are the import-ready Anki plaintext files produced by our program

Confirming an import in Anki

Browsing the imported deck of cards (notice the images)

Another file, another deck of cards imported

A flashcard, when in action

Sunday, February 4, 2024

Gorkhapatra By Date

I used to visit the official Gorkhapatra epaper site and download the latest epaper to see if there were any PSC vacancies, up until a few months ago. I eventually stopped doing that, but until then, for about two years, everyday, I downloaded the daily epaper and uploaded it to DriveHQ. I had managed to exhaust two entire accounts' worth of storage with just Gorkhapatra pdf's. One thing I felt was lacking in the site was that the ability to download epapers older than 1 week, and the system was generally clunky and unreliable. So, this is why Gorkhapatra By Date was conceived.

Initially I had thought that I had to save the actual pdf's, because I wasn't sure the links would be valid past the 1-week-old mark but as I've learned in the past two weeks of developing this service, they are. So, I decided to just save the available links and do away with downloading the actual files.

Architecture-wise, the service is really simple. It's made up of two parts, one of which is just a CodeIgniter-based API server that retrieves the link to the epaper for the requested date, if present of course, from a table in the database. It also has another endpoint that simply lists the available dates. The links themselves are scraped from the original epaper site using a simple PHP script running as a cron job every 6 hours and the table is updated with any new link available. The script could technically be run just once a day typically a few tens of minutes past 12 AM (I've observed over the years that that's around the most likely time of upload of the day's paper to the site, though not always) but I've learned not to rely on this pattern.

The cron script doesn't just scrape the links and blindly insert them to the DB's table either. The original links present on the site are not direct links to the pdf's, however, the direct links are readily available with some simple URL string manipulation, so it does that as well. 

Also, since we need to be sure which date the pdf belongs to, we cannot just rely on the file name of the pdf (which is supposed to be a combination of the English and the Nepali dates of the day, but is very often wrong). As a solution, we actually download the contents of the pdf's and parse them (using smalot's pdf parser library for PHP) and obtain their metadata. Even after going through all this, the date specified in the metadata isn't always reliable. So, we perform a regex pattern matching on the first page of the pdf's after they've been parsed by the library. Only then do we have a reliable date for each file (unless the pattern of date printed in the pdf changes as well, in which case, we'll need to update our regex). 

The parsing itself turns out to be really CPU-intensive (the script finished parsing 8 pdf's comfortably within a minute on my local machine while it easily took 25-30 on my shared hosting server) but memory is where I hit a wall. Turns out, one of the pdf's was 22 pages and a simple memory_get_usage() function call logged into the script after the call to the parser revealed that the script was eating up ~170MB of memory for the file. My shared hosting config was set to 128MB for a script, so, as was expected, the script crashed upon reaching this particular pdf among the scraped list of 8: "Fatal Error: Allowed Memory Size of 134217728 Bytes Exhausted ...". Turns out, in a shared hosting environment, modifying the php.ini file isn't possible. Thankfully, the cPanel did have a page where I could change a bunch of different parameters for PHP, including memory_limit, which was initially set to 128MB that I promptly set to 512MB (the same as my development environment in XAMPP). Side note: I learnt after much back and forth with the hosting support that the change from cPanel modifies, to my utter befuddlement initially, not the php.ini file (which as they pointed out is global across the server, and is set to 128MB) but the httpd.conf file (which is separate for each user account on the server), which has precedence over the php.ini file. I learned that the memory limit set inside scripts have the highest precedence, then the httpd.conf file, only then followed by php.ini. I also learned about the php --ini and php --info commands and the phpinfo() function. The first one gives you info regarding where the php.ini file that the currently installed PHP binary is configured to use is located. The second one gives you info about the actual PHP configurations including memory_limit, and the third one is just the PHP function to do the same. Neat stuff.

Okay, once we have the date and the link for all available epapers we scraped (usually 8), we just insert them to the table and let MySQL handle the duplicates.

The front end is hosted on the main domain (ajashra.com) while the API server resides in a subdomain (api.ajashra.com), so the API server needs to allow CORS with the Access-Control-Allow-Origin header set to "https://ajashra.com". This was new for me and it took about half a day to fully iron out using the "before" filter feature in CodeIgniter4. Some cool stuff I learned there. The Same Origin Policy that necessitated this is only enforced by web browsers, so, while calling the API from front-end code from an origin other than the main domain will fail, the API still works for any non-web-browser consumer.

The JS library - "Vanilla Calendar" was used for the front-end to present an intuitive calendar interface where a user can see which dates are available for download. Clicking on any available date on the calendar directly leads to the corresponding epaper being downloaded.


Here's the URL for the service.