Finding And Fixing Broken YouTube IDs On GiantBomb.com

Avatar image for paulwgraham
paulwgraham

15

Forum Posts

0

Wiki Points

0

Followers

Reviews: 0

User Lists: 0

Edited By paulwgraham

Awhile ago I came across a video on giantbomb.com that couldn't play because of a bad YouTube video ID. I decided to dig deeper and ended up finding hundreds of broken YouTube IDs and some misplaced videos.

The Videos

Giant Bomb has over 14,000 videos on the site. When a user visits a video page the video will either be served by Giant Bomb and played in a custom video player or will be served by YouTube and played in an embedded YouTube player. Whether the video is served from Giant Bomb or from YouTube depends on a number of factors but generally speaking logged-out users tend to get the YouTube version.

The API

Giant Bomb has a API that any user can use to get the data associated with videos on the site.

Like any API the Giant Bomb API has is its own set of quirks. I started with the API by typing the endpoints and query parameters into Chrome and seeing what the API sent back. This seemed to work fine but as soon as I moved on to Python and the popular Requests module I encountered the first API quirk and got nothing but errors. It turns out that the Giant Bomb API doesn't like the User Agent that the Requests module sends by default. One custom User Agent header later and I started getting useful data back from the API.

The data that the API returns for an individual video looks something like the following (edited down to the relevant bits):

{

"deck":"Managing a mercenary group of mechs takes a lot of patience, hard work, and jumping.",

"guid":"2300-13060",

"id":13060,

"length_seconds":4508,

"name":"Quick Look: BattleTech",

"publish_date":"2018-05-13 06:00:00",

"site_detail_url":"https:\/\/www.giantbomb.com\/videos\/quick-look-battletech\/2300-13060\/",

"user":"ybbaaabby",

"video_type":"Quick Looks",

"video_show":{"id":3,"title":"Quick Looks"},

"video_categories":[{"id":3, "name":"Quick Looks"}],

"youtube_id":"5nC2PPLl3ec"

}

Notice that the data includes things like the video's guid, the video name, and the ID of corresponding video on YouTube. Also, notice that the above data lists the video's type, the video's show, and the categories the video belongs to.

Misplaced Videos

Putting aside the broken YouTube embed I began to wonder about what other video data could cause errors that wouldn't be caught by the GB CMS.

Looking at the video's type, show, and categories it's obvious that these three fields have something to do with where a video is located on the site. It also stands to reason that values for these fields are manually chosen when a video is added.

Where there is a manual process there will be mistakes. So I set about trying to find some of them. Fortunately for my purposes Giant Bomb videos tend to follow a standard naming convention for each show type. For example Quick Looks titles tend to begin with the words "Quick Look" and Bombcast video titles tend to contain the word "Bombcast". So for each show/category I created a regular expression that would match the expected title. I then ran a search that examined every video title looking for videos that matched a naming convention but weren't in the expected show/category. I also looked for videos that were in a show/category but didn't have a title that matched the expected naming convention. I found a few possibly misplaced videos including this one:

{

'deck': "When you tell the squad that you can't come to the club because you have to defeat the Heartless.",

'guid': '2300-12505',

'id': 12505,

'length_seconds': 4355,

'name': 'Kingdom Heartache: Episode 6: Wind of the Forgotten Sorrow',

'publish_date': '2017-09-10 06:00:00',

'site_detail_url': 'https://www.giantbomb.com/videos/kingdom-heartache-episode-6-wind-of-the-forgotten-/2300-12505/',

'user': 'benpack',

'video_type': None,

'video_show': None,

'video_categories': [],

'youtube_id': 'Jd5VAdPyttY'

}

which was surprising because I would have expected the CMS to warn users when they tried to add a video with no type, show, or category.

Broken YouTube IDs

Back to the broken YouTube embed. I wanted to find other videos with broken YouTube ID. For this I would need two things:

1. The stored YouTube ID for every Giant Bomb video.

2. A way to check if a given YouTube ID is valid.

It was easy enough to get the YouTube IDs using the Giant Bomb API. To check if a given YouTube ID is valid I turned to the YouTube Data API. The first thing I tried was to take every YouTube ID for every video on Giant Bomb and check it's validity using the https://www.googleapis.com/youtube/v3/videos endpoint. But with over 14k videos on the site I quickly ran out of quota for YouTube API requests.

So I shifted my approach to searching. To save on quota I first downloaded a list of all videos on giantbomb.com using the Giant Bomb API. Then I downloaded a list of all YouTube IDs on the Giant Bomb YouTube channel using the https://www.googleapis.com/youtube/v3/playlistItems endpoint and the ID of the "uploads" playlist.

I then compared the two lists to find potentially bad YouTube IDs. Any YouTube ID on the list provided by the Giant Bomb API but not on the list grabbed from the YouTube API could be a bad YouTube ID.

Interestingly, it's not sufficient for a YouTube ID to be missing from the YouTube channel for it to be declared bad. This is because on the Giant Bomb site there are videos such as the following that are associated with valid videos on other YouTube channels.

{

'deck': 'Jeff answers your fantastic questions and is finally confronted by a horrible truth.',

'guid': '2300-12968',

'id': 12968,

'length_seconds': 317,

'name': 'Quick Question with Jeff Bakalar: Ep. 10 - Jeff is a Lip Smacker',

'publish_date': '2018-04-10 07:59:00',

'site_detail_url': 'https://www.giantbomb.com/videos/quick-question-with-jeff-bakalar-ep-10-jeff-is-a-l/2300-12968/',

'user': 'vinny',

'video_type': 'Features',

'video_show': None,

'video_categories': [{'api_detail_url': 'https://www.giantbomb.com/api/video_category/2320-8/',

'id': 8,

'name': 'Features',

'site_detail_url': 'https://www.giantbomb.com/videos/features/'}],

'youtube_id': 'NIHmjbi6Dfc'

}

I then used the YouTube API to check every YouTube ID in the reduced pool of candidates. This worked well and in the end I found 205 broken YouTube IDs.

Interestingly while looking for bad YouTube IDs I came across some videos like these:

{

'deck': "It's time for a new generation to see the controversial and scandalous horrors that await in Night Trap.", 'guid': '2300-12563',

'id': 12563,

'length_seconds': 1648,

'name': 'Quick Look: Night Trap - 25th Anniversary Edition',

'publish_date': '2017-10-05 06:00:00',

'site_detail_url': 'https://www.giantbomb.com/videos/quick-look-night-trap-25th-anniversary-edition/2300-12563/',

'user': 'ybbaaabby',

'video_type': 'Quick Looks',

'video_show': (removed for brevity),

'video_categories': (removed for brevity),

'youtube_id': 'Night Trap - 25th Anniversary Edition: Quick Look'

}

{

'deck': "A few more Mario Party mini-game pitches for you to ponder: Block Jock, Boo's Cruise, POW WOW, Luigi Squeegee, Blooper Scooper.",

'guid': '2300-11201',

'id': 11201,

'length_seconds': 537,

'name': 'Best of Giant Bomb: 99 - Piranha Pajamas',

'publish_date': '2016-05-28 06:00:00',

'site_detail_url': 'https://www.giantbomb.com/videos/best-of-giant-bomb-99-piranha-pajamas/2300-11201/',

'user': 'turboman',

'video_type': 'Best of Giant Bomb',

'video_show': (removed for brevity),

'video_categories': (removed for brevity),

'youtube_id': 'Y6BeQ4BOnY'

}

Notice the YouTube IDs. The first is obviously invalid. It should look something like 'n6WelVKtDgQ' and not a bunch of words. The second YouTube ID only has 10 characters in it. A typical YouTube ID has 11.

This suggests to me that at some point in the Giant Bomb's history inputting a YouTube ID was a manual process and that they weren't checked with a regular expression.

Orphaned YouTube Videos

It occurred to me that the YouTube videos that where supposed to be pointed to by the broken YouTube IDs on giantbomb.com might actually still exist on YouTube but with different YouTube IDs.

I already had a list of all YouTube IDs stored on Giant Bomb. I also already had a list of all YouTube IDs from the Giant Bomb YouTube channel. So to search for videos that were on YouTube but not associated with a video on the Giant Bomb site I did the reverse of the search I did earlier. I looked for YouTube IDs that were on the YouTube channel but not on the list of YouTube IDs I gathered from the Giant Bomb API.

This new search yielded 837 orphaned videos. A few of the videos were clearly intended to be YouTube exclusives. However, a bunch of the videos returned clearly weren't intended to be YouTube exclusives including this one:

https://www.youtube.com/watch?v=enOmD9yL_Hg

Which was the video that inspired me to dig into this stuff to begin with.

Possible Matches

I then turned my attention to investigating whether it was possible to automate or semi-automate the process of matching each video with a broken YouTube ID on giantbomb.com with the correct corresponding orphaned video on the Giant Bomb YouTube channel.

The first thought I had was to match the videos based on title. Unfortunately the title that a video on giantbomb.com has doesn't necessarily corresponded to the title the video has on YouTube. They are usually close but not exactly the same. It seems like it's up to the CMS users to name the YouTube versions as they see fit. For example video titles often end up like this:

Quick Look: Dragon Ball Z: Kakarot [giantbomb.com]

Dragon Ball Z: Kakarot: Quick Look [YouTube]

So I couldn't rely on an exact title match in order to algorithmically pair giantbomb.com videos with their YouTube counterparts but I could at least use them to narrow down the search.

By finding the Levenshtein distance between the title of an orphaned video and the title of a video on giantbomb.com I could measure how similar their titles were.

For example the Levenshtein distance between this title (found on giantbomb.com):

Quick Look: Harry Potter and the Deathly Hallows: Part 2

and this title (found on YouTube):

Quick Look: Harry Potter and the Deathly Hallows pt. 2

is five.

So I calculated the Levenshtein distance between every video title on giantbomb.com and the title of a given orphaned video. I was then able to rank every video title on giantbomb.com for how similar it is to that given orphaned YouTube video's title. By taking only the lowest scores from this ranked list I then had a pool of possible matches for that orphaned video.

Now that I had a pool of possible matches for each orphaned video I now needed a way to determine which candidate video, if any, was the correct match. I googled around to see if I could find any software or techniques for video fingerprinting. I couldn't find any. It then occurred to me that I could just find matches based on audio and that there are all kinds of services like Shazam that use audio fingerprinting to do song identification. So sure enough with some more googling I was able to find an open source audio fingerprinting package. It was meant for music identification but I decided to see if I could make it work for my purposes.

Finger Printing Audio

This turned out to be the hardest part of the entire project. Not for any deep technical reason but because the software package I choose to do the audio fingerprinting had a lot of broken dependencies and was prone to crashing.

Since this analysis would take a fair amount of time and bandwidth the first thing I did was spin up a Digital Ocean droplet on which to do the work. Running in a tmux session I started pulling down the data needed for comparison. For YouTube I was able to pull down just the audio portion of each orphaned video using youtube-dl. But for the candidate videos on giantbomb.com I had to download each entire video. Which I then split using ffmpeg keeping only the audio portion to save disk space. I then used the open source audio fingerprinting package to do the comparisons.

In the end after dealing with countless crashes and problems I was only able to generate match data for a small slice of the orphaned YouTube videos.

Oddities

During my investigation I came across some of the cruft that accumulates on a big site after it's been around for awhile. Such oddities include:

Two Posts For The Same Video:

These two entries seem to share a video and comments section. I don't know why this happened but my best guess was that this was a double post and they wanted to merge the comments sections.

https://www.giantbomb.com/videos/quick-look-puyo-puyo-tetris/2300-11999/

https://www.giantbomb.com/videos/quick-look-puyo-puyo-tetris/2300-9838/

The Shortest Video Title (according to the API):

Shirts!

The Longest Video Title (according to the API):

We Can't Even Remember if We're Supposed to be Working Today, So Here's a Week-Old Lightning Returns: Final Fantasy XIII

The Longest Video (according to the API):

Extra Life: 2017 - Alex Navarro

Number Of Seconds Of Video On The Site (according to the API):

28207774

Avatar image for rorie
rorie

6270

Forum Posts

1143

Wiki Points

0

Followers

Reviews: 1

User Lists: 3

#1 rorie  Staff

Thanks for all the work here! I'll see if someone can try to fix all these, if possible.

Avatar image for paulwgraham
paulwgraham

15

Forum Posts

0

Wiki Points

0

Followers

Reviews: 0

User Lists: 0

Honestly, I doubt this stuff actually affects much of anyone. But it was a fun project and in the end I felt like I had to do *something* with the data I had collected. Anyway, thanks for indulging me.