Wednesday, May 29, 2024

MCQ Scraper

This is a simple .NET 8 console application that scrapes MCQs from different online sources and consolidates them into a single SQLite DB. Check the schema of the table in the attached picture below.

Currently, it only collects Civil Engineering-related questions from IndiaBix for a select few categories, but this can be extended to other sources as well.

The program itself is pretty brittle in its current state - no allowance has been made for exceptions that I didn't encounter during development. Logging is a pretty basic custom implementation as well, and it was an after-thought (log file is constantly opened and written to every time a new line is to be logged, so be sure to disable it if you feel it hitting performance too much). Still, in the current form, the operation was completed in a matter of a couple minutes. 

Regarding dependencies, it uses simple Regex matching when possible and also HtmlAgilityPack for parsing the scraped HTML pages into manageable DOMs for easier filtering.

The resulting output DB file is available here.

Console output from the program in action


Top few entries from the output DB


UPDATE (31 May, 2024):
I've worked in a few improvements to the program and pushed the code changes to the repository. Now, it can also create Anki-flashcards so that the scraped content is actually useful. Also equally importantly, the program also has a bugfix/QoL enhancement so that the <img> tags are also checked for in the question texts/prompts as well - a possibility that I had ignored before - and the referenced images are downloaded, base64-encoded and replaced in place, ready for use in HTML as-is.

For the past day or two, I was wondering if I should create a separate app (I was thinking maybe a Vue-powered SPA) to actually make use of the DB; but a couple of hours' worth of internet research led me to believe that it would be a lot more efficient if this DB could be made available to Anki instead of reinventing the wheel all over again. Oh, and this apparently really popular app was first brought to my attention by ChatGPT-4o when I asked it "what is the best strategy to memorize around 4500 mcqs for a competitive exam?" Anki was literally the first option in the list it returned where it suggested to use it as a Spaced Repetition System (SRS) for "active learning". I had to download and play around with it for a bit before I could actually understand its workings. I also searched for any and all add-ons the Anki community had for Multiple Choice Questions and a tried one or two of them. But I didn't really like them too much - and I didn't know how I could export the questions from the DB to a form that these add-ons would understand either. So, taking the advice of a wise Reddit user somewhere, I just ditched the idea of using any sort of add-on altogether for MCQs and began exploring the import and export format used by Anki. It turned out that the program can import/export in a few different formats but I chose the simple plaintext format for obvious reasons and studied its structure. I created a new default deck - a term I have gathered means a collection of cards sharing some similarity; so in this case, a deck can represent a category - such as building-materials, or surveying, etc. Then I created two cards manually from inside Anki, each one following the pattern of the question prompt and the multiple options on the front side and the answer on the back side of the card. This is also where I learned that Anki has first-class support for HTML content in its flash-card contents, which was perfect for me since I didn't have to worry about all the tags in my scraped questions and options, not to mention the <img> tags. Anyway, then I exported the cards from Anki's File menu (still not sure if it exports just a single deck or all of them to the same file) to a plaintext file and studied the structure of the exported file. It was really intuitive, and I figured it would be easy enough to automate this process to convert the questions from my scraped DB to this format that Anki understood (I had also successfully tried externally modifying the export file and importing it back to Anki again, without any hiccup). So, that's what I did. I simply added a class library project to the Visual Studio solution and added in the functionality to get read-to-import Anki plaintext files from the SQLite DB produced by MCQer. And voila:

The *.txt files are the import-ready Anki plaintext files produced by our program

Confirming an import in Anki

Browsing the imported deck of cards (notice the images)

Another file, another deck of cards imported

A flashcard, when in action

Sunday, February 4, 2024

Gorkhapatra By Date

I used to visit the official Gorkhapatra epaper site and download the latest epaper to see if there were any PSC vacancies, up until a few months ago. I eventually stopped doing that, but until then, for about two years, everyday, I downloaded the daily epaper and uploaded it to DriveHQ. I had managed to exhaust two entire accounts' worth of storage with just Gorkhapatra pdf's. One thing I felt was lacking in the site was that the ability to download epapers older than 1 week, and the system was generally clunky and unreliable. So, this is why Gorkhapatra By Date was conceived.

Initially I had thought that I had to save the actual pdf's, because I wasn't sure the links would be valid past the 1-week-old mark but as I've learned in the past two weeks of developing this service, they are. So, I decided to just save the available links and do away with downloading the actual files.

Architecture-wise, the service is really simple. It's made up of two parts, one of which is just a CodeIgniter-based API server that retrieves the link to the epaper for the requested date, if present of course, from a table in the database. It also has another endpoint that simply lists the available dates. The links themselves are scraped from the original epaper site using a simple PHP script running as a cron job every 6 hours and the table is updated with any new link available. The script could technically be run just once a day typically a few tens of minutes past 12 AM (I've observed over the years that that's around the most likely time of upload of the day's paper to the site, though not always) but I've learned not to rely on this pattern.

The cron script doesn't just scrape the links and blindly insert them to the DB's table either. The original links present on the site are not direct links to the pdf's, however, the direct links are readily available with some simple URL string manipulation, so it does that as well. 

Also, since we need to be sure which date the pdf belongs to, we cannot just rely on the file name of the pdf (which is supposed to be a combination of the English and the Nepali dates of the day, but is very often wrong). As a solution, we actually download the contents of the pdf's and parse them (using smalot's pdf parser library for PHP) and obtain their metadata. Even after going through all this, the date specified in the metadata isn't always reliable. So, we perform a regex pattern matching on the first page of the pdf's after they've been parsed by the library. Only then do we have a reliable date for each file (unless the pattern of date printed in the pdf changes as well, in which case, we'll need to update our regex). 

The parsing itself turns out to be really CPU-intensive (the script finished parsing 8 pdf's comfortably within a minute on my local machine while it easily took 25-30 on my shared hosting server) but memory is where I hit a wall. Turns out, one of the pdf's was 22 pages and a simple memory_get_usage() function call logged into the script after the call to the parser revealed that the script was eating up ~170MB of memory for the file. My shared hosting config was set to 128MB for a script, so, as was expected, the script crashed upon reaching this particular pdf among the scraped list of 8: "Fatal Error: Allowed Memory Size of 134217728 Bytes Exhausted ...". Turns out, in a shared hosting environment, modifying the php.ini file isn't possible. Thankfully, the cPanel did have a page where I could change a bunch of different parameters for PHP, including memory_limit, which was initially set to 128MB that I promptly set to 512MB (the same as my development environment in XAMPP). Side note: I learnt after much back and forth with the hosting support that the change from cPanel modifies, to my utter befuddlement initially, not the php.ini file (which as they pointed out is global across the server, and is set to 128MB) but the httpd.conf file (which is separate for each user account on the server), which has precedence over the php.ini file. I learned that the memory limit set inside scripts have the highest precedence, then the httpd.conf file, only then followed by php.ini. I also learned about the php --ini and php --info commands and the phpinfo() function. The first one gives you info regarding where the php.ini file that the currently installed PHP binary is configured to use is located. The second one gives you info about the actual PHP configurations including memory_limit, and the third one is just the PHP function to do the same. Neat stuff.

Okay, once we have the date and the link for all available epapers we scraped (usually 8), we just insert them to the table and let MySQL handle the duplicates.

The front end is hosted on the main domain (ajashra.com) while the API server resides in a subdomain (api.ajashra.com), so the API server needs to allow CORS with the Access-Control-Allow-Origin header set to "https://ajashra.com". This was new for me and it took about half a day to fully iron out using the "before" filter feature in CodeIgniter4. Some cool stuff I learned there. The Same Origin Policy that necessitated this is only enforced by web browsers, so, while calling the API from front-end code from an origin other than the main domain will fail, the API still works for any non-web-browser consumer.

The JS library - "Vanilla Calendar" was used for the front-end to present an intuitive calendar interface where a user can see which dates are available for download. Clicking on any available date on the calendar directly leads to the corresponding epaper being downloaded.


Here's the URL for the service.