How to create an HTML5 Reddit's Image Scraper using Phaser

Regarding Reddit JSON API, you can get a JSON document by adding /.json to any Reddit URL. This can be used to extract various data from any subreddit. To show how to do that in HTML5 using Phaser framework, we will create a Reddit’s Image Scraper application.

For the start, try the next url to get a JSON document for the /r/pics subreddit :

https://www.reddit.com/r/pics/.json

So this can be used to extract various data from any subreddit making an image viewer/browser for instance.

Furthermore, all images and their data scraped from a Reddit JSON could be also used to make some simple games. Here are some ideas:

guessing which of two images has a better/worse upvotes score
guessing a right range of upvotes for an image choosing between 2 or 3 different ranges
guessing a title of an image choosing between 2 or 3 different titles as fast as possible

For now, we will create a Reddit’s Image Scraper tool for generating database of all scraped images which will be later used in a game. It is developed in HTML5 using Phaser framework and DebugOut script.

Here you can try the application to see what we are going to make. After you tap on the screen, the program will start to fetch data from the /r/pics subreddit and generate an output JSON file database.json at the end of the process!

<br /> Your browser do not support [iframe] tag to display an embedded object!<br />

Play in Full Screen

The program works as a state machine fetching and processing data through the next states:

At the start, the program is just waiting on mouse click in STATE_START.
In the next step (STATE_LOAD_JSON) it loads an input JSON file from https://www.reddit.com/r/pics/top.json which contains a certain number of input data records defined in the LIMIT variable (line 22).
When the input JSON file is loaded the program goes to STATE_LOAD_COMPLETE.
The first input record with the url of the first image will be retrived in STATE_GET_IMAGE_URL.
The validity of an image is checked in STATE_CHECK_IMAGE_URL.
If an image is not valid the program will go to STATE_IMAGE_FAIL and try to fix its url but only if that link points to an Imgur page.
If an image is valid the program will go to STATE_IMAGE_OK and write its data record (title, url, thumb and upvotes) to an output using DebugOut script.
When an image is completely processed the program goes to STATE_IMAGE_PROCESSED and does the next:
- If all data records from the input JSON file are processed then:
  - either go back to STATE_LOAD_JSON to load the next input JSON file
  - or go to STATE_FINISH to save the output JSON file locally on disk and finish the program.
- Else go back to STATE_GET_IMAGE_URL to fetch the next image.

And here is the fully commented code of the scraper.js script:

/***********************************************************************************/
var game;

// define all App states
var STATE_START = 1;
var STATE_LOAD_JSON = 2;
var STATE_LOAD_COMPLETE = 3;
var STATE_GET_IMAGE_URL = 4;
var STATE_CHECK_IMAGE_URL = 5;
var STATE_IMAGE_FAIL = 6;
var STATE_IMAGE_OK = 7;
var STATE_IMAGE_PROCESSED = 8;
var STATE_FINISH = 9;
var STATE_IDLE = 10;
	
// reddit's url for scraping
var REDDIT_URL = 'https://www.reddit.com/r/pics/';

// name of the output data file
var OUTPUT_FILE = 'database.json';
	
var LIMIT = 5; // maximum number of items to return per a reddit's json input file
var MIN_OUT_RECORDS = 5; // minimal number of output data records to be generated

var imageValidity = 0; // image validity status = {0-image checking, 1-image error, 2-image is valid}

/***********************************************************************************/

window.onload = function () {
	// Google Chrome and Firefox are blocking cross-domain image in WebGL (because of a security issue). 
	// To bypass this use Phaser.CANVAS!
	game = new Phaser.Game(480, 800, Phaser.CANVAS, 'game');
	
	game.state.add('Scraper', Scraper);
	game.state.start('Scraper');
};

/***********************************************************************************/

var Scraper = function(game){};

Scraper.prototype = {
	create : function(){
		// set scale options
		game.scale.scaleMode = Phaser.ScaleManager.SHOW_ALL;
		game.scale.pageAlignVertically = true;
		game.scale.pageAlignHorizontally = true;
		game.scale.setScreenSize(true);
		
		// set stage and world options
		game.stage.backgroundColor = "#ecf0f1";		
		game.world.setBounds(0, 0, game.width, game.height);
		
		// create a text object for displaying event logs
		txtEventLog = game.add.text(10, 60, '\n', {font: "18px Arial", fill: "#222"});
		
		// create a fixed top layer to display App title
		var grpTitleLayer = game.add.group();
		grpTitleLayer.fixedToCamera = true;
		grpTitleLayer.cameraOffset.setTo(0, 0);
		
		var graphics = game.make.graphics(0, 0);
		graphics.beginFill(0x34495e);
		graphics.drawRect(0, 0, game.width, 80);
		
		var title = game.make.text(
			game.width/2, 40, 
			'Reddit\'s Image Scraper 1.0', 
			{font: "30px Arial", fill: "#fff", align: "center"}
		);
		title.anchor.setTo(0.5, 0.5);
		
		grpTitleLayer.add(graphics);
		grpTitleLayer.add(title);
		
		// create an output object for writing data in a json output file
		// see more about DebugOut script on https://github.com/inorganik/debugout.js
		output = new debugout();
		
		updateEventLog("Tap to start scraping data from Reddit’s Json at:\n"+REDDIT_URL+"\n\n");
		
		// set initial App state
		state = STATE_START;
	},
	
	update : function(){		
		switch(state){
			case STATE_START: 
				// start scraping data when mouse is down
				if (game.input.mousePointer.isDown || game.input.pointer1.isDown){
					updateEventLog("Data scraping started.\n");
					
					output.log('{\"data\":');
					output.log('  [');
					
					counter = 0; // counter of currently generated output data records
					after =''; // after field used for loading the next reddit's json input file
					
					state = STATE_LOAD_JSON;
				}
				break;
				
			case STATE_LOAD_JSON:
				// load reddit's json input file
				updateEventLog("    Loading JSON input file...\n");
				
				isInputFileLoaded = false; // flag to know if a reddit's json input file is loaded	
				
				var loader = new Phaser.Loader(game);
				loader.json('input_file', getInputFileName());
				loader.onLoadComplete.addOnce(onInputFileLoaded, this);
				loader.start();
	
				state = STATE_LOAD_COMPLETE;
				break;
				
			case STATE_LOAD_COMPLETE:
				// wait on reddit's json input file to be loaded and then fetch data from it
				if (isInputFileLoaded){
					items = game.cache.getJSON('input_file').data.children;
					after = game.cache.getJSON('input_file').data.after;
					
					index = 0; // set index to the first record in items array
					
					state = STATE_GET_IMAGE_URL;
				}
				break;
				
			case STATE_GET_IMAGE_URL:
				// get image url from the current record in items array and start checking its validity
				updateEventLog("        Checking image... ");
				
				checkImageUrl(items[index].data.url);
				
				state = STATE_CHECK_IMAGE_URL;
				break;
				
			case STATE_CHECK_IMAGE_URL:	
				// wait on checking image validity to be completed
				if (imageValidity == 1) state = STATE_IMAGE_FAIL;
				else if (imageValidity == 2) state = STATE_IMAGE_OK;
				break;
				
			case STATE_IMAGE_FAIL:
				// given a link to a web page, try to get a direct link to the image
				// (currently supporting only imgur page links)
				var isImgur = false;
				var imgurPattern = /imgur\.com/;
				
				if (imgurPattern.test(items[index].data.url)) {
				    // make sure it's not gifv (gifv won't display right)
					var gifvPattern = /.gifv$/;
					if(!gifvPattern.test(items[index].data.url))
					{
						items[index].data.url = items[index].data.url + '.png';
						isImgur = true;
					}
				}

				if (isImgur){
					state = STATE_IMAGE_OK;
				} else {
					updateEventLog("Error!\n");
					state = STATE_IMAGE_PROCESSED;
				}
				
				break;
				
			case STATE_IMAGE_OK:
				// write json output data for a valid image
				counter++;
				
				updateEventLog("OK &gt; Writing data record "+counter+".\n");
				
				if (counter &gt; 1) output.log('   ,');
				output.log('   {\"title\":\"' + items[index].data.title.replace(/"/g, '\\"') + '\",');
				output.log('    \"url\":\"' + items[index].data.url + '\",');
				output.log('    \"icon\":\"' + items[index].data.thumbnail + '\",');
				output.log('    \"score\":' + items[index].data.score + '}');
				
				state = STATE_IMAGE_PROCESSED;
				break;
				
			case STATE_IMAGE_PROCESSED:
				index++; // set index to the next record in items array
				
				if (index == items.length){ 
					// if all records from items array are fetched 
					// then either finish with scraping or load the next json input file
					if (counter &gt; MIN_OUT_RECORDS) state = STATE_FINISH;
					else state = STATE_LOAD_JSON;
					
				} else {
					// else get the next image url
					state = STATE_GET_IMAGE_URL;
				}
				
				break;
				
			case STATE_FINISH:
				// finish scraping data
				updateEventLog("Data scraping finished.\n" + counter + " data records created.\n");
				
				output.log('  ]');
				output.log('}');
				
				// generate the output json file
				output.downloadLog(OUTPUT_FILE);
				
				updateEventLog("Output file "+OUTPUT_FILE+" generated!\n");
				
				state = STATE_IDLE;
				break;
				
			case STATE_IDLE:
				break;
		}
	}
}

// Updates event log with a new message.
function updateEventLog(message){
	txtEventLog.setText(txtEventLog.text + message);
		
	var newBoundWidth = txtEventLog.x + txtEventLog.width + txtEventLog.x;
	var newBoundHeight = txtEventLog.y + txtEventLog.height;
		
	game.world.setBounds(
		0, 0, 
		newBoundWidth &gt; game.width ? newBoundWidth : game.width, 
		newBoundHeight &gt; game.height ? newBoundHeight : game.height
	);
		
	game.camera.y = txtEventLog.height;
}

// Gets the name of a reddit's json input file.
function getInputFileName(){
	return REDDIT_URL + 'top.json?sort=top&amp;t=all&amp;limit=' + LIMIT + '&amp;after='+after;
}
	
// Sets isInputFileLoaded flag to true when a reddit's json input file is loaded.
function onInputFileLoaded(){
	isInputFileLoaded = true;
}
	
// Checks if url points to an actual image.
function checkImageUrl(url) {
	// here is some of many javascript regex for checking url validity
	var urlPattern = /(http|ftp|https):\/\/[\w-]+(\.[\w-]+)+([\w.,@?^=%&amp;:\/~+#-]*[\w@?^=%&amp;\/~+#-])?/;
				
	if (urlPattern.test(url)){ // if url is valid then start checking if this is an actual image
		imageValidity = 0; // set image validity to checking status
			
		var img = new Image();
		img.onerror = function(){ imageValidity = 1; }; // not an image
		img.onload = function(){ imageValidity = 2; }; // valid image
		img.src = url;
		
	} else {
		imageValidity = 1; // set image validity to error status because url is not valid
	} 
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

/***********************************************************************************/

var game;

// define all App states

var STATE_START = 1;

var STATE_LOAD_JSON = 2;

var STATE_LOAD_COMPLETE = 3;

var STATE_GET_IMAGE_URL = 4;

var STATE_CHECK_IMAGE_URL = 5;

var STATE_IMAGE_FAIL = 6;

var STATE_IMAGE_OK = 7;

var STATE_IMAGE_PROCESSED = 8;

var STATE_FINISH = 9;

var STATE_IDLE = 10;

// reddit's url for scraping

var REDDIT_URL = 'https://www.reddit.com/r/pics/';

// name of the output data file

var OUTPUT_FILE = 'database.json';

var LIMIT = 5; // maximum number of items to return per a reddit's json input file

var MIN_OUT_RECORDS = 5; // minimal number of output data records to be generated

var imageValidity = 0; // image validity status = {0-image checking, 1-image error, 2-image is valid}

/***********************************************************************************/

window.onload = function () {

// Google Chrome and Firefox are blocking cross-domain image in WebGL (because of a security issue).

// To bypass this use Phaser.CANVAS!

game = new Phaser.Game(480, 800, Phaser.CANVAS, 'game');

game.state.add('Scraper', Scraper);

game.state.start('Scraper');

};

/***********************************************************************************/

var Scraper = function(game){};

Scraper.prototype = {

create : function(){

// set scale options

game.scale.scaleMode = Phaser.ScaleManager.SHOW_ALL;

game.scale.pageAlignVertically = true;

game.scale.pageAlignHorizontally = true;

game.scale.setScreenSize(true);

// set stage and world options

game.stage.backgroundColor = "#ecf0f1";

game.world.setBounds(0, 0, game.width, game.height);

// create a text object for displaying event logs

txtEventLog = game.add.text(10, 60, '\n', {font: "18px Arial", fill: "#222"});

// create a fixed top layer to display App title

var grpTitleLayer = game.add.group();

grpTitleLayer.fixedToCamera = true;

grpTitleLayer.cameraOffset.setTo(0, 0);

var graphics = game.make.graphics(0, 0);

graphics.beginFill(0x34495e);

graphics.drawRect(0, 0, game.width, 80);

var title = game.make.text(

game.width/2, 40,

'Reddit\'s Image Scraper 1.0',

{font: "30px Arial", fill: "#fff", align: "center"}

);

title.anchor.setTo(0.5, 0.5);

grpTitleLayer.add(graphics);

grpTitleLayer.add(title);

// create an output object for writing data in a json output file

// see more about DebugOut script on https://github.com/inorganik/debugout.js

output = new debugout();

updateEventLog("Tap to start scraping data from Reddit’s Json at:\n"+REDDIT_URL+"\n\n");

// set initial App state

state = STATE_START;

update : function(){

switch(state){

case STATE_START:

// start scraping data when mouse is down

if (game.input.mousePointer.isDown || game.input.pointer1.isDown){

updateEventLog("Data scraping started.\n");

output.log('{\"data\":');

output.log(' [');

counter = 0; // counter of currently generated output data records

after =''; // after field used for loading the next reddit's json input file

state = STATE_LOAD_JSON;

}

break;

case STATE_LOAD_JSON:

// load reddit's json input file

updateEventLog(" Loading JSON input file...\n");

isInputFileLoaded = false; // flag to know if a reddit's json input file is loaded

var loader = new Phaser.Loader(game);

loader.json('input_file', getInputFileName());

loader.onLoadComplete.addOnce(onInputFileLoaded, this);

loader.start();

state = STATE_LOAD_COMPLETE;

break;

case STATE_LOAD_COMPLETE:

// wait on reddit's json input file to be loaded and then fetch data from it

if (isInputFileLoaded){

items = game.cache.getJSON('input_file').data.children;

after = game.cache.getJSON('input_file').data.after;

index = 0; // set index to the first record in items array

state = STATE_GET_IMAGE_URL;

}

break;

case STATE_GET_IMAGE_URL:

// get image url from the current record in items array and start checking its validity

updateEventLog(" Checking image... ");

checkImageUrl(items[index].data.url);

state = STATE_CHECK_IMAGE_URL;

break;

case STATE_CHECK_IMAGE_URL:

// wait on checking image validity to be completed

if (imageValidity == 1) state = STATE_IMAGE_FAIL;

else if (imageValidity == 2) state = STATE_IMAGE_OK;

break;

case STATE_IMAGE_FAIL:

// given a link to a web page, try to get a direct link to the image

// (currently supporting only imgur page links)

var isImgur = false;

var imgurPattern = /imgur\.com/;

if (imgurPattern.test(items[index].data.url)) {

// make sure it's not gifv (gifv won't display right)

var gifvPattern = /.gifv$/;

if(!gifvPattern.test(items[index].data.url))

{

items[index].data.url = items[index].data.url + '.png';

isImgur = true;

}

if (isImgur){

state = STATE_IMAGE_OK;

} else {

updateEventLog("Error!\n");

state = STATE_IMAGE_PROCESSED;

}

break;

case STATE_IMAGE_OK:

// write json output data for a valid image

counter++;

updateEventLog("OK > Writing data record "+counter+".\n");

if (counter > 1) output.log(' ,');

output.log(' {\"title\":\"' + items[index].data.title.replace(/"/g, '\\"') + '\",');

output.log(' \"url\":\"' + items[index].data.url + '\",');

output.log(' \"icon\":\"' + items[index].data.thumbnail + '\",');

output.log(' \"score\":' + items[index].data.score + '}');

state = STATE_IMAGE_PROCESSED;

break;

case STATE_IMAGE_PROCESSED:

index++; // set index to the next record in items array

if (index == items.length){

// if all records from items array are fetched

// then either finish with scraping or load the next json input file

if (counter > MIN_OUT_RECORDS) state = STATE_FINISH;

else state = STATE_LOAD_JSON;

} else {

// else get the next image url

state = STATE_GET_IMAGE_URL;

}

break;

case STATE_FINISH:

// finish scraping data

updateEventLog("Data scraping finished.\n" + counter + " data records created.\n");

output.log(' ]');

output.log('}');

// generate the output json file

output.downloadLog(OUTPUT_FILE);

updateEventLog("Output file "+OUTPUT_FILE+" generated!\n");

state = STATE_IDLE;

break;

case STATE_IDLE:

break;

}

// Updates event log with a new message.

function updateEventLog(message){

txtEventLog.setText(txtEventLog.text + message);

var newBoundWidth = txtEventLog.x + txtEventLog.width + txtEventLog.x;

var newBoundHeight = txtEventLog.y + txtEventLog.height;

game.world.setBounds(

0, 0,

newBoundWidth > game.width ? newBoundWidth : game.width,

newBoundHeight > game.height ? newBoundHeight : game.height

);

game.camera.y = txtEventLog.height;

}

// Gets the name of a reddit's json input file.

function getInputFileName(){

return REDDIT_URL + 'top.json?sort=top&t=all&limit=' + LIMIT + '&after='+after;

}

// Sets isInputFileLoaded flag to true when a reddit's json input file is loaded.

function onInputFileLoaded(){

isInputFileLoaded = true;

}

// Checks if url points to an actual image.

function checkImageUrl(url) {

// here is some of many javascript regex for checking url validity

var urlPattern = /(http|ftp|https):\/\/[\w-]+(\.[\w-]+)+([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])?/;

if (urlPattern.test(url)){ // if url is valid then start checking if this is an actual image

imageValidity = 0; // set image validity to checking status

var img = new Image();

img.onerror = function(){ imageValidity = 1; }; // not an image

img.onload = function(){ imageValidity = 2; }; // valid image

img.src = url;

} else {

imageValidity = 1; // set image validity to error status because url is not valid

}

Here is an example of the generated output JSON file with 5 records:

{"data":
  [
   {"title":"I've spent the past two years shooting drone aerials around the world. Here are 38 images which would be totally illegal today.",
    "url":"http://imgur.com/a/J9iOB.png",
    "icon":"http://b.thumbs.redditmedia.com/EJnhYQkBCyrCrLYqoXVSlRzBI2WhXxVBzpYHwx7UDko.jpg",
    "score":22524}
   ,
   {"title":"Got divorced, lost my job, so me and my buddy got on our motorcycles and rode North to the Alaskan Arctic until the road ran out.",
    "url":"http://imgur.com/a/J7kZJ.png",
    "icon":"http://a.thumbs.redditmedia.com/5647fC_ALfU94wel5BdgM4fj4tl--3pxe2Yadxo1GT8.jpg",
    "score":16822}
   ,
   {"title":"I sent Tom Hanks a 1934 Smith Corona typewriter with a typed invitation to come on my podcast. This was his response.",
    "url":"http://i.imgur.com/ppBPV.jpg",
    "icon":"http://f.thumbs.redditmedia.com/4pC-IVgF1KFnqWp5.jpg",
    "score":14563}
   ,
   {"title":"My grandpa, my dad, and myself. Making three generations of wives worried sick.",
    "url":"http://i.imgur.com/Tgay4hL.jpg",
    "icon":"http://b.thumbs.redditmedia.com/B0p-xKV4Kd6tfANI.jpg",
    "score":12881}
   ,
   {"title":"Garbagemen taking a break.",
    "url":"http://i.imgur.com/EWLWdxc.jpg",
    "icon":"http://b.thumbs.redditmedia.com/nBzQi4NxNoOZWXIS0JndRzVXd7uXUmbAylc2EKmMGAY.jpg",
    "score":12759}
  ]
}

{"data":

[

{"title":"I've spent the past two years shooting drone aerials around the world. Here are 38 images which would be totally illegal today.",

"url":"http://imgur.com/a/J9iOB.png",

"icon":"http://b.thumbs.redditmedia.com/EJnhYQkBCyrCrLYqoXVSlRzBI2WhXxVBzpYHwx7UDko.jpg",

"score":22524}

{"title":"Got divorced, lost my job, so me and my buddy got on our motorcycles and rode North to the Alaskan Arctic until the road ran out.",

"url":"http://imgur.com/a/J7kZJ.png",

"icon":"http://a.thumbs.redditmedia.com/5647fC_ALfU94wel5BdgM4fj4tl--3pxe2Yadxo1GT8.jpg",

"score":16822}

{"title":"I sent Tom Hanks a 1934 Smith Corona typewriter with a typed invitation to come on my podcast. This was his response.",

"url":"http://i.imgur.com/ppBPV.jpg",

"icon":"http://f.thumbs.redditmedia.com/4pC-IVgF1KFnqWp5.jpg",

"score":14563}

{"title":"My grandpa, my dad, and myself. Making three generations of wives worried sick.",

"url":"http://i.imgur.com/Tgay4hL.jpg",

"icon":"http://b.thumbs.redditmedia.com/B0p-xKV4Kd6tfANI.jpg",

"score":12881}

{"title":"Garbagemen taking a break.",

"url":"http://i.imgur.com/EWLWdxc.jpg",

"icon":"http://b.thumbs.redditmedia.com/nBzQi4NxNoOZWXIS0JndRzVXd7uXUmbAylc2EKmMGAY.jpg",

"score":12759}

]

}

In the next part we will see how to use this output JSON file to make a real game! So stay tuned!

HTML5 Games

Flash Games

Android Games

How to create an HTML5 Reddit’s Image Scraper using Phaser

Leave a Reply Cancel reply