Use internal apis not Puppeteer


Home | Blogs

I know so many people who want to get some data from a page and end up using Puppeteer to get all the data. I will first explain to you why using something like puppeteer is such overkill. What Puppeteer does is run a headless version of chromium then make all the requests with it. You know how browsers take up all of your memory? It is the same with Puppeteer.

There are 2 types of websites with how you have to get the data. Between server-side rendered websites and client-side rendered websites, server-side rendered websites are the easiest for us, as all of the information we want is there for us and we don't have to do any digging.

I will use a YouTube search for this example, as the responses are server-side rendered, so getting the links is easy.

The way we know if a website is server-side rendered or not is by going to the website you want to scrape, opening the network monitor in inspect element, refreshing the page, and clicking the first response. Now you want to click on the response tab, and it should show you a website that kind of looks something like the one you are on now. You might be able to see the information you want, or you might not. Let's copy the id to a YouTube video, like the FtutLA63Cp8 in https://www.youtube.com/watch?v=FtutLA63Cp8, and let's click on raw in the response and control+f the id. We see something that looks like "videoId":"FtutLA63Cp8" but this is a little hard to search for, so let's do control+f and paste it again. The next link is to the thumbnail, but let's see if there is something even easier. We see something like /watch?v=FtutLA63Cp8; this is exactly what we want; it looks like what goes after the https://www.youtube.com! Just remember that after the url, it has a " character; that will be very helpful later.

Let's make a shell script to get all the results. I am going to use something called grep, but the main reason I will use it is because it has support for some regex. If you want to learn regex, you can go to regexr. Our shell script can now be turned into this

curl "https://www.youtube.com/results?search_query=Bad+Apple" | grep -o "/watch?v=[^\"]*"

That will give us all of our videos as search results, which is not too bad. If you can't find your data in the first request, that means it is client-side rendered.

Let's hope that it's something like OSU and getting the most played beatmaps (or every map they have played because we can do that), because we can control when it makes the requests and make it much easier to find.

What we can start with is going to the website without inspect element open. Once the page is fully loaded, open inspect element and look at the network tab. You should see nothing going on, and this is perfect. Let's now click the Show More button on the OSU website.

Wow, that is a lot of requests, but don't freak out; look at the type of request; it's all jpegs, and we know the data can't be a jpeg, so let's just scroll up until we see something that isn't a jpeg. Scrolling up a little shows a request labeled "json." We like json, and if we look at the url, https://osu.ppy.sh/users/7562902/beatmapsets/most_played?limit=51&offset=5, that looks promising. Let's go ahead and click on the response tab. It has 51 lists of items, and it has some metadata about some beatmaps. This is perfect!

Let's also look back at that URL. It has some url parameters, limit and offset. This is when we can start playing around with these numbers to see what we can push them to. Setting the limit to anything high seems to always give us 100 results. If we set the offset to some high number (more than they have played), we can see that it just returns an empty list.

This can now be converted into any language you want to get all the played maps of a person; just set the limit to 100, and start the offset at 0, and just add 100 until the list it returns is not 100 in length.

Now we get to the hard stuff like Twitch; they try and make it as hard as possible to scrape their website. Let's go to twitch.tv and look at someone's videos, let's do the same we did on OSU, and just wait a while for everything to load in. Now say we only want recent broadcasts. Let's open inspect element and click on the recent broadcast view all button.

Ok, if we don't look at all the images, we see 6 requests to this url: https://gql.twitch.tv/gql#origin=twilight. This does not appear to give us much, but it is all we have. We know it can't be the two requests that were made after all the images, so let's look at the four before.

Click on the first one and click on response; we don't see much here, so let's just go one more down. This has some data, but it looks like it has a bunch of other data with it, so let's go one more down. That one also has nothing further down (if you look at the size you can see what requests make data), and looking through some of the data, I see a list of 30 objects, and inside they have a lot of information about the streamer's recent broadcasts.

Let's always test it by copying the request as curl and pasting it in the terminal. Remember to remove all the header stuff so we can see what is actually needed. It gives back an error code saying that the Client-ID header is needed, and man, that sucks, but let's copy the request as curl and remove all the headers except the client id.

Yay, we get all of our data! One thing you should think about is whether this client-id can actually be used if I put it in a program. This is when you can either try from a different IP, wait a few hours or days, or even get a friend to test it. One other way we can check is to pray that yt-dlp has the site we are using in its code. Thankfully, they do have Twitch in their code, so we can look at their source code to see if they do anything with a client-id. Thankfully, they do and we can see they have a variable called _CLIENT_ID. This is perfect because we know that we don't have to do any logic to find the client-id

Some websites will be like Twitch and require a header, but it does not tell us what headers are needed; it just does not work. If we test it with all the headers, it works, but with none of them, it does not work. This is when you just have to remove one header at a time and see if it still responds to you. Depending on what the header is, you will have to do what I will discuss for the last website.

This is a website where they have some headers that eventually expire, so you will have to get a new one and make two requests. This can be a pain because every site will try to do it differently. I can only give you some tips beyond this point.

You can try and go to the Debugger tab and do a ctrl+shift+f to try and find where the text inside the header comes up. What you will most likely have to do is just look at every request your browser sends and, one by one, go through them and search for that text until you find a URL you have to make a request to to get a user ID so you can use that in the next request.