Wednesday, May 29, 2024

MCQ Scraper

This is a simple .NET 8 console application that scrapes MCQs from different online sources and consolidates them into a single SQLite DB. Check the schema of the table in the attached picture below.

Currently, it only collects Civil Engineering-related questions from IndiaBix for a select few categories, but this can be extended to other sources as well.

The program itself is pretty brittle in its current state - no allowance has been made for exceptions that I didn't encounter during development. Logging is a pretty basic custom implementation as well, and it was an after-thought (log file is constantly opened and written to every time a new line is to be logged, so be sure to disable it if you feel it hitting performance too much). Still, in the current form, the operation was completed in a matter of a couple minutes. 

Regarding dependencies, it uses simple Regex matching when possible and also HtmlAgilityPack for parsing the scraped HTML pages into manageable DOMs for easier filtering.

The resulting output DB file is available here.

Console output from the program in action

Top few entries from the output DB

UPDATE (31 May, 2024):
I've worked in a few improvements to the program and pushed the code changes to the repository. Now, it can also create Anki-flashcards so that the scraped content is actually useful. Also equally importantly, the program also has a bugfix/QoL enhancement so that the <img> tags are also checked for in the question texts/prompts as well - a possibility that I had ignored before - and the referenced images are downloaded, base64-encoded and replaced in place, ready for use in HTML as-is.

For the past day or two, I was wondering if I should create a separate app (I was thinking maybe a Vue-powered SPA) to actually make use of the DB; but a couple of hours' worth of internet research led me to believe that it would be a lot more efficient if this DB could be made available to Anki instead of reinventing the wheel all over again. Oh, and this apparently really popular app was first brought to my attention by ChatGPT-4o when I asked it "what is the best strategy to memorize around 4500 mcqs for a competitive exam?" Anki was literally the first option in the list it returned where it suggested to use it as a Spaced Repetition System (SRS) for "active learning". I had to download and play around with it for a bit before I could actually understand its workings. I also searched for any and all add-ons the Anki community had for Multiple Choice Questions and a tried one or two of them. But I didn't really like them too much - and I didn't know how I could export the questions from the DB to a form that these add-ons would understand either. So, taking the advice of a wise Reddit user somewhere, I just ditched the idea of using any sort of add-on altogether for MCQs and began exploring the import and export format used by Anki. It turned out that the program can import/export in a few different formats but I chose the simple plaintext format for obvious reasons and studied its structure. I created a new default deck - a term I have gathered means a collection of cards sharing some similarity; so in this case, a deck can represent a category - such as building-materials, or surveying, etc. Then I created two cards manually from inside Anki, each one following the pattern of the question prompt and the multiple options on the front side and the answer on the back side of the card. This is also where I learned that Anki has first-class support for HTML content in its flash-card contents, which was perfect for me since I didn't have to worry about all the tags in my scraped questions and options, not to mention the <img> tags. Anyway, then I exported the cards from Anki's File menu (still not sure if it exports just a single deck or all of them to the same file) to a plaintext file and studied the structure of the exported file. It was really intuitive, and I figured it would be easy enough to automate this process to convert the questions from my scraped DB to this format that Anki understood (I had also successfully tried externally modifying the export file and importing it back to Anki again, without any hiccup). So, that's what I did. I simply added a class library project to the Visual Studio solution and added in the functionality to get read-to-import Anki plaintext files from the SQLite DB produced by MCQer. And voila:

The *.txt files are the import-ready Anki plaintext files produced by our program

Confirming an import in Anki

Browsing the imported deck of cards (notice the images)

Another file, another deck of cards imported

A flashcard, when in action