Sunday, February 4, 2024

Gorkhapatra By Date

I used to visit the official Gorkhapatra epaper site and download the latest epaper to see if there were any PSC vacancies, up until a few months ago. I eventually stopped doing that, but until then, for about two years, everyday, I downloaded the daily epaper and uploaded it to DriveHQ. I had managed to exhaust two entire accounts' worth of storage with just Gorkhapatra pdf's. One thing I felt was lacking in the site was that the ability to download epapers older than 1 week, and the system was generally clunky and unreliable. So, this is why Gorkhapatra By Date was conceived.

Initially I had thought that I had to save the actual pdf's, because I wasn't sure the links would be valid past the 1-week-old mark but as I've learned in the past two weeks of developing this service, they are. So, I decided to just save the available links and do away with downloading the actual files.

Architecture-wise, the service is really simple. It's made up of two parts, one of which is just a CodeIgniter-based API server that retrieves the link to the epaper for the requested date, if present of course, from a table in the database. It also has another endpoint that simply lists the available dates. The links themselves are scraped from the original epaper site using a simple PHP script running as a cron job every 6 hours and the table is updated with any new link available. The script could technically be run just once a day typically a few tens of minutes past 12 AM (I've observed over the years that that's around the most likely time of upload of the day's paper to the site, though not always) but I've learned not to rely on this pattern.

The cron script doesn't just scrape the links and blindly insert them to the DB's table either. The original links present on the site are not direct links to the pdf's, however, the direct links are readily available with some simple URL string manipulation, so it does that as well. 

Also, since we need to be sure which date the pdf belongs to, we cannot just rely on the file name of the pdf (which is supposed to be a combination of the English and the Nepali dates of the day, but is very often wrong). As a solution, we actually download the contents of the pdf's and parse them (using smalot's pdf parser library for PHP) and obtain their metadata. Even after going through all this, the date specified in the metadata isn't always reliable. So, we perform a regex pattern matching on the first page of the pdf's after they've been parsed by the library. Only then do we have a reliable date for each file (unless the pattern of date printed in the pdf changes as well, in which case, we'll need to update our regex). 

The parsing itself turns out to be really CPU-intensive (the script finished parsing 8 pdf's comfortably within a minute on my local machine while it easily took 25-30 on my shared hosting server) but memory is where I hit a wall. Turns out, one of the pdf's was 22 pages and a simple memory_get_usage() function call logged into the script after the call to the parser revealed that the script was eating up ~170MB of memory for the file. My shared hosting config was set to 128MB for a script, so, as was expected, the script crashed upon reaching this particular pdf among the scraped list of 8: "Fatal Error: Allowed Memory Size of 134217728 Bytes Exhausted ...". Turns out, in a shared hosting environment, modifying the php.ini file isn't possible. Thankfully, the cPanel did have a page where I could change a bunch of different parameters for PHP, including memory_limit, which was initially set to 128MB that I promptly set to 512MB (the same as my development environment in XAMPP). Side note: I learnt after much back and forth with the hosting support that the change from cPanel modifies, to my utter befuddlement initially, not the php.ini file (which as they pointed out is global across the server, and is set to 128MB) but the httpd.conf file (which is separate for each user account on the server), which has precedence over the php.ini file. I learned that the memory limit set inside scripts have the highest precedence, then the httpd.conf file, only then followed by php.ini. I also learned about the php --ini and php --info commands and the phpinfo() function. The first one gives you info regarding where the php.ini file that the currently installed PHP binary is configured to use is located. The second one gives you info about the actual PHP configurations including memory_limit, and the third one is just the PHP function to do the same. Neat stuff.

Okay, once we have the date and the link for all available epapers we scraped (usually 8), we just insert them to the table and let MySQL handle the duplicates.

The front end is hosted on the main domain (ajashra.com) while the API server resides in a subdomain (api.ajashra.com), so the API server needs to allow CORS with the Access-Control-Allow-Origin header set to "https://ajashra.com". This was new for me and it took about half a day to fully iron out using the "before" filter feature in CodeIgniter4. Some cool stuff I learned there. The Same Origin Policy that necessitated this is only enforced by web browsers, so, while calling the API from front-end code from an origin other than the main domain will fail, the API still works for any non-web-browser consumer.

The JS library - "Vanilla Calendar" was used for the front-end to present an intuitive calendar interface where a user can see which dates are available for download. Clicking on any available date on the calendar directly leads to the corresponding epaper being downloaded.


Here's the URL for the service.

No comments:

Post a Comment