Trying to scrape pages - will only do a couple hundred at a time

Avatar image for lb3ptman
LB3PTMAN

9

Forum Posts

0

Wiki Points

0

Followers

Reviews: 0

User Lists: 0

I am trying to scrape the API for some information. I get about 300 or so in everytime I run my code before it stops with a "cannot open the connection" messagge. Was just wondering if anyone had any idea what was causing this issue and how it could be resolved?

Here is my code in R.

I have confirmed that when it is running it is gathering the data what seems to be 100% accurate

while (i < 76127) {

urltemp = paste(url1,i,url2, sep = "")

con = url(urltemp, "rb")

json = fromJSON(con)

new = tryCatch(data.frame(json), error=function(e) NULL)

data = rbind(data, new)

i = i + 1

}

Avatar image for indy626
Indy626

2

Forum Posts

0

Wiki Points

0

Followers

Reviews: 0

User Lists: 0

@lb3ptman:

Never thought I’d see a stackoverflow like item on GB, but it is cool to see!

Without a url to test, this is a little hard to recommend - but there are two things that may occurring.

If I was debugging it, I’d try to see if there is something with the size of the batches that we’re going through.

You may be having a connection timeout.

So start small and continue to increase the size. Maybe in the 300s there’s an issue with url13000url2.

If you I’d the problem is a connection time out then start in 200 batch increments.

Happy scraping!

Avatar image for indy626
Indy626

2

Forum Posts

0

Wiki Points

0

Followers

Reviews: 0

User Lists: 0

Also-I didn’t see the above thread. (Sorry about that!)

Let me try a quick implementation of the above with the site’s API (the source of the question - silly me) and see if I can send a script.

Avatar image for rorie
rorie

7887

Forum Posts

1502

Wiki Points

0

Followers

Reviews: 4

User Lists: 3

https://www.giantbomb.com/forums/api-developers-3017/api-rate-limiting-1786442/

Might be relevant?

Avatar image for lb3ptman
LB3PTMAN

9

Forum Posts

0

Wiki Points

0

Followers

Reviews: 0

User Lists: 0

https://www.giantbomb.com/api/game/3030-25577/?api_key=[YOUR_KEY]&format=json&field_list=name,original_release_date

is the URL, but of course in the code has my API key in it.

I would try to delay the rquests a second, but I am unsure how to do a delay in R, is anyone aware of how to so I could test it to see if that worked?

Avatar image for lb3ptman
LB3PTMAN

9

Forum Posts

0

Wiki Points

0

Followers

Reviews: 0

User Lists: 0

Also the number after "3030-" is replaced with the next number in the sequence of course. On 27,00 at this point