scraper_750x410

How to create an HTML5 Reddit’s Image Scraper using Phaser

Regarding Reddit JSON API, you can get a JSON document by adding /.json to any Reddit URL. This can be used to extract various data from any subreddit. To show how to do that in HTML5 using Phaser framework, we will create a Reddit’s Image Scraper application.

For the start, try the next url to get a JSON document for the /r/pics subreddit :

https://www.reddit.com/r/pics/.json

So this can be used to extract various data from any subreddit making an image viewer/browser for instance.

Furthermore, all images and their data scraped from a Reddit JSON could be also used to make some simple games. Here are some ideas:

  1. guessing which of two images has a better/worse upvotes score
  2. guessing a right range of upvotes for an image choosing between 2 or 3 different ranges
  3. guessing a title of an image choosing between 2 or 3 different titles as fast as possible

For now, we will create a Reddit’s Image Scraper tool for generating database of all scraped images which will be later used in a game. It is developed in HTML5 using Phaser framework and DebugOut script.

Here you can try the application to see what we are going to make. After you tap on the screen, the program will start to fetch data from the /r/pics subreddit and generate an output JSON file database.json at the end of the process!

 


 

 

The program works as a state machine fetching and processing data through the next states:

  1. At the start, the program is just waiting on mouse click in STATE_START.
  2. In the next step (STATE_LOAD_JSON) it loads an input JSON file from https://www.reddit.com/r/pics/top.json which contains a certain number of input data records defined in the LIMIT variable (line 22).
  3. When the input JSON file is loaded the program goes to STATE_LOAD_COMPLETE.
  4. The first input record with the url of the first image will be retrived in STATE_GET_IMAGE_URL.
  5. The validity of an image is checked in STATE_CHECK_IMAGE_URL.
  6. If an image is not valid the program will go to STATE_IMAGE_FAIL and try to fix its url but only if that link points to an Imgur page.
  7. If an image is valid the program will go to STATE_IMAGE_OK and write its data record (title, url, thumb and upvotes) to an output using DebugOut script.
  8. When an image is completely processed the program goes to STATE_IMAGE_PROCESSED and does the next:
    • If all data records from the input JSON file are processed then:
      • either go back to STATE_LOAD_JSON to load the next input JSON file
      • or go to STATE_FINISH to  save the output JSON file locally on disk and finish the program.
    • Else go back to STATE_GET_IMAGE_URL to fetch the next image.

 

And here is the fully commented code of the scraper.js script:

 

Here is an example of the generated output JSON file with 5 records:

In the next part we will see how to use this output JSON file to make a real game! So stay tuned!

 

 


Leave a Reply

Your email address will not be published. Required fields are marked *