Friday, April 12, 2019

Youtube audio downloader

This is a Python 3.7 script to download DASH audio files from YouTube.
I was trying to revamp my app: Geetify which used a third-party website to extract OGG audio from YouTube. I wondered if it would be possible to do it on my own. Sure enough, a quick google search led me here through this StackOverflow post. I honestly didn't think the task could be so straightforward. From the SO post and with youtube-dl's source on github, I gathered the following:
  • YouTube stores video and audio files separately and has a bunch of different formats (mp4, webm, audio, video, 1080p, 720p, 480p and so on) for each video
  • The links for these resources are right there in the page source of the corresponding video, specifically, this information is stored in a JSON config object seemingly to be used by a 'ytplayer' (YouTube's video player?)
    Source of a YouTube video page with ytplayer.config object
    ytplayer.config object from the developer console
  • The "args" key of this object holds a "player_response" key whose value is a JSON string.
    player_response string
    player_response object
  • The player_response object contains "streamingData" which in turn contains the key "adaptiveFormats" which is an array of objects each corresponding to different media formats of the YouTube video. Each of these objects contains useful information such as the type and size of the media, dimensions, duration etc and of course the direct URL to the media file.
    adaptiveFormats array
Once I figured out this structure, I quickly fired up my PyCharm IDE and implemented this extraction logic. All was fine except the download speed. The rate at which youtube-dl downloaded a media file from YouTube was wayyyy faster than the rate that my simple urllib.request.urlretrieve() did. It would take close to three minutes to download a 3 MB audio. I tried my code with other direct file download links. Curiously, the problem didn't exist outside of YouTube. I was getting close to my connection speed during the download. I was convinced it must be google throttling my download speed. A couple of SO posts further reinforced this suspicion. Then, I tried adding different headers to my download request having switched to the requests module for ease of handling headers and stuff. The logic was, google was probably flagging the download request as a bot/scraper's doing seeing that there was no additional header in the request as would have been present were it a human trying to watch a video. So, I added a couple of 'normal' headers :
{'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-us,en;q=0.5'}
Needless to say, that didn't work. And I knew it was just a hit or miss anyway.
I researched more about how popular download managers such as Internet Download Manager perform these large downloads 'properly'.
I looked to see if there was any information in the response from the YouTube server regarding this and looked up a few of the response headers. Then somehow, I stumbled upon this and thought perhaps I should try the 'Range' HTTP header to specify a fixed range of data bytes of the file in each request until the entirety of the file had been downloaded. MDN was my reference. When I was done translating this to code, I hit run - and voila! It worked! The file downloaded almost instantly. So, it wasn't the lack of a user-agent header or some other anti-bot feature that was the problem but instead my failure to realize that the server wanted to provide a HTTP 206 Partial Content response to media file queries, which without me specifying a 'Range' in the request header wasn't going to happen.
Anyway, it was a good exercise and a good problem to solve. Now to Android Studio for the actual reason this started.
Code is available here

Edit:
ytplayer.config.args.player_response.streamingData doesn't always seem to have the key 'adaptiveFormats'. This path worked fine days ago but all of a sudden today, it just up and quit being there. I don't know what caused it but anyway, referring back to the original youtube-dl source from GitHub and with some Google Chrome Developer Tools fiddling, I discovered that ytplayer.config.args.adaptive_fmts holds essentially the same data in URL-encoded string format. The different adaptive format media info are delimeted by comma ','. This fact can be used to tokenize the string. Each token can further be split by ampersand '&' which delineates the different parameters pertaining to the individual media. The media size is represented by the key 'clen' and media type by 'type'. The direct link to the media rests still in the key 'url'.

Edit:
Another twist. Using data from ytplayer.config.args.adaptive_fmts does always ensure working URLs for the various media formats of a video. However, as I learned the hard way, depending on whatever factors YouTube utilizes, sometimes videos - even ones that could be downloaded seamlessly before - can not be downloaded. Attempt at downloading media using the URLs from adaptive_fmts gives an HTTP 403 'Access is denied' error. Turns out, and of course this is from studying youtube-dl's source, sometimes, and in some videos, YouTube uses a kind of encryption. Say, the following is a token - one media file's information - obtained by splitting ytplayer.config.args.adaptive_fmts string from this video's webpage with a comma ',':
sp=signature&xtags=&quality_label=1080p&projection_type=1&fps=24&bitrate=4384085&s=22842844852A956D7573EA30727E30F9DE2013D783BFF5.61C376EEF3FEE2A54A57DBC972DAC4E972C270A7777&lmt=1540713480713402&type=video%2Fmp4%3B+codecs%3D%22avc1.640028%22&size=1920x1080&init=0-714&clen=92248685&index=715-1250&itag=137&url=https%3A%2F%2Fr2---sn-fapo3ox25a-3uhs.googlevideo.com%2Fvideoplayback%3Fid%3Do-AIgcdy13GCdDPGcw75ItSjM5wanZJWPZAxICUV_Ux5hm%26aitags%3D133%252C134%252C135%252C136%252C137%252C160%252C242%252C243%252C244%252C247%252C248%252C278%26itag%3D137%26source%3Dyoutube%26requiressl%3Dyes%26mm%3D31%252C26%26mn%3Dsn-fapo3ox25a-3uhs%252Csn-npoe7ner%26ms%3Dau%252Conr%26mv%3Dm%26pcm2cms%3Dyes%26pl%3D24%26ei%3DbvK6XMbALMiI1AbXkbWgAw%26initcwndbps%3D360000%26mime%3Dvideo%252Fmp4%26gir%3Dyes%26clen%3D92248685%26dur%3D219.719%26lmt%3D1540713480713402%26mt%3D1555755499%26fvip%3D2%26keepalive%3Dyes%26c%3DWEB%26txp%3D5432432%26ip%3D77.94.26.114%26ipbits%3D0%26expire%3D1555777230%26sparams%3Dip%252Cipbits%252Cexpire%252Cid%252Caitags%252Csource%252Crequiressl%252Cmm%252Cmn%252Cms%252Cmv%252Cpcm2cms%252Cpl%252Cei%252Cinitcwndbps%252Cmime%252Cgir%252Cclen%252Cdur%252Clmt%26key%3Dyt8
If we're to directly use the media URL here (the part after 'url='), after URL decoding of course, we're going to be greeted with a HTTP 403 Access is denied error. Strangely, doing exactly that worked just yesterday. Anyway, however and whenever YouTube decides to put in encryption to a video notwithstanding, we can surmise that encryption exists by checking the presence of a parameter 's' in this token - or try downloading using the URL naively and check for a 403 status code in the response. youtube-dl checks for the presence of 's' param. That's what I do as well. In the above token, we can clearly see the following string:
s=22842844852A956D7573EA30727E30F9DE2013D783BFF5.61C376EEF3FEE2A54A57DBC972DAC4E972C270A7777
The part after 's=' is the encrypted signature. We need to decrypt it - in fact, YouTube just scrambles the characters around - and tack it on the end of the URL as the parameter 'signature'. To find the decryption function, we need to look at ytplayer.config.assets. If this path contains the key 'js', its value contains path to a JavaScript file responsible for the signature decryption among a ton of other things- probably setting things up for the HTML5 player? If there is no 'js' key, the video probably uses SWF player(from what I can gather from youtube-dl source). For this video, at the time of writing, ytplayer.config.assets is the following:
{css: "/yts/cssbin/player-vflJHbzHK/www-player-webp.css", 
js: "/yts/jsbin/player_ias-vfloNowYZ/en_US/base.js"}
I'm not a hundred percent sure but judging by the way YouTube likes to randomize things, this JSON object could change any time even for this video. Anyway, going to youtube.com/yts/jsbin/player_ias-vfloNowYZ/en_US/base.js and downloading the file(~1.2MB), one thing is immediately clear: it is a mess. youtube-dl uses some regex filters to search for and extract the decryption function in this JavaScript file. youtube-dl also has a light JavaScript interpreter built expressly for executing JS code frequently employed in decryption functions. Though the decryption functions vary between videos, the range of JS functionality used in them is limited : split(), slice(), splice(), reverse(), join(), operators, function calls and so on. Using the JS interpreter to run the extracted decryption code on the 's' value from above gives the unscrambled signature. In this example, the decrypted signature turns out to be
42852A956D7573EA30727E30F9DE2013D783BFF5.61C376EEF3FEE2A54A57DBC972DAC4E972C270A7
So, all we need to do to get a valid URL is tack on '&signature=42852A956D7573EA30727E30F9DE2013D783BFF5.61C376EEF3FEE2A54A57DBC972DAC4E972C270A7' to the end of the URL in the token presented earlier:
https%3A%2F%2Fr2---sn-fapo3ox25a-3uhs.googlevideo.com%2Fvideoplayback%3Fid%3Do-AIgcdy13GCdDPGcw75ItSjM5wanZJWPZAxICUV_Ux5hm%26aitags%3D133%252C134%252C135%252C136%252C137%252C160%252C242%252C243%252C244%252C247%252C248%252C278%26itag%3D137%26source%3Dyoutube%26requiressl%3Dyes%26mm%3D31%252C26%26mn%3Dsn-fapo3ox25a-3uhs%252Csn-npoe7ner%26ms%3Dau%252Conr%26mv%3Dm%26pcm2cms%3Dyes%26pl%3D24%26ei%3DbvK6XMbALMiI1AbXkbWgAw%26initcwndbps%3D360000%26mime%3Dvideo%252Fmp4%26gir%3Dyes%26clen%3D92248685%26dur%3D219.719%26lmt%3D1540713480713402%26mt%3D1555755499%26fvip%3D2%26keepalive%3Dyes%26c%3DWEB%26txp%3D5432432%26ip%3D77.94.26.114%26ipbits%3D0%26expire%3D1555777230%26sparams%3Dip%252Cipbits%252Cexpire%252Cid%252Caitags%252Csource%252Crequiressl%252Cmm%252Cmn%252Cms%252Cmv%252Cpcm2cms%252Cpl%252Cei%252Cinitcwndbps%252Cmime%252Cgir%252Cclen%252Cdur%252Clmt%26key%3Dyt8&signature=42852A956D7573EA30727E30F9DE2013D783BFF5.61C376EEF3FEE2A54A57DBC972DAC4E972C270A7
URL-decoding (done twice here for readability) the above gives:
https://r2---sn-fapo3ox25a-3uhs.googlevideo.com/videoplayback?id=o-AIgcdy13GCdDPGcw75ItSjM5wanZJWPZAxICUV_Ux5hm&aitags=133,134,135,136,137,160,242,243,244,247,248,278&itag=137&source=youtube&requiressl=yes&mm=31,26&mn=sn-fapo3ox25a-3uhs,sn-npoe7ner&ms=au,onr&mv=m&pcm2cms=yes&pl=24&ei=bvK6XMbALMiI1AbXkbWgAw&initcwndbps=360000&mime=video/mp4&gir=yes&clen=92248685&dur=219.719&lmt=1540713480713402&mt=1555755499&fvip=2&keepalive=yes&c=WEB&txp=5432432&ip=77.94.26.114&ipbits=0&expire=1555777230&sparams=ip,ipbits,expire,id,aitags,source,requiressl,mm,mn,ms,mv,pcm2cms,pl,ei,initcwndbps,mime,gir,clen,dur,lmt&key=yt8&signature=42852A956D7573EA30727E30F9DE2013D783BFF5.61C376EEF3FEE2A54A57DBC972DAC4E972C270A7
Which should be downloadable. Of course, the above link doesn't work because I've changed the 'ip' parameter. But done properly for any video, these steps should provide the right URLs for the required media files.

Btw, youtube-dl seems to directly use ytplayer.config.args.adaptive_fmts for getting a hold of the media URLs coupled with encrypted signature checks and decryptions if needed of course. It doesn't first check if downloadable links are available in ytplayer.config.args.player_response.streamingData.adaptiveFormats. I might have to switch to directly using ytplayer.config.args.adaptive_fmts right off the bat myself since it always seems to have the links to all available media anyway.
Also, interesting lines in youtube-dl's youtube.py:
youtube.py line 1794,1831,1891,1130(loads signature from cache),1144(loads decryption function from player code)