Jump to content

[Perl] dandwiki Monster Grabber


Freak
 Share

Recommended Posts

As part of larger project, I found myself wanting to create a database of as many DnD monsters as possible. So I set out to write a program that would take them off of public wiki pages so that it could do all of the work for me. The wiki I used was dandwiki and specifically their monster list for 5e. The database isn't particularly useful on its own since it would be impractical to try to read these database items yourself, and like I said, it is part of a larger project. But it could still be useful if you're playing without internet and needed a reference, and it's got a lot of cool regex in it to look at and learn from.

Please note that they do have a "terms and conditions for non-human visitors" page here which my program is fully compliant with. I even put a 3 second interval in between page requests so as to be less bothersome. Additionally any database created as a result of running my program will be licensed under the GNU Free Documentation License v1.3 which is available here.

Now that the legal nonsense mumbo jumbo is out of the way, the program works by requesting the page for their monster list, then goes through all hyperlinks within that page and requests the pages for any of them that have "(5e_Creature)" in the URL, then uses a ton of regular expressions to find all of the relevant data. It then places that data into this configuration in a .csv file:

lYiumbd.png

Note that, by the nature of how this program was written and its source, many of the resulting database entries are likely to be improperly formatted or missing elements. There are simply too many entries for me to manually check if they've been properly processed, and many of the wiki entries have inconsistent formatting. That said, I've written the program to be as flexible as possible and fixed many issues while writing the program, so hopefully even inconsistencies that I'm unaware of should be properly processed.

Here is the code: (the indentation messed up a tiny bit)

Spoiler

##!/usr/bin/perl -w
use strict;
use HTML::Entities;
use LWP::UserAgent;
my($ua, $response, $mainContents, $thisLink, $thisContent);

$ua = LWP::UserAgent->new(
	protocols_allowed 	=> ['http', 'https'],
	timeout 			=> 10,
	agent 				=> "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0", #Necessary otherwise 403 forbidden.
);
$response = $ua->get('https://www.dandwiki.com/wiki/5e_Monsters');
if($response->is_success){
	$mainContents = $response->decoded_content;
}else{
	die $response->status_line;
}

my $count = 0;
while($mainContents =~ /<a href=\"(.*?)\"/g){
	$thisLink = $1;
	if($thisLink =~ /\(5e_Creature\)/){
		sleep(3);
		$response = $ua->get('https://www.dandwiki.com'.$thisLink);
		if($response->is_success){
			$thisContent = $response->decoded_content;
			if($thisLink =~ /wiki\/(.*?)_\(5e_Creature\)/){
				my @abilities	= ();
				my @features	= ();
				my @actions		= ();
				my @legendary 	= ();
				my @reactions	= ();
				open(DATA, ">$1.csv");
				#-----------------------------------------------------------------------------------------------------------------------
				if($thisContent =~/<p><i>(.*?)<\/i>/){print DATA "\"".rmHTML($1)."\",";}else{print DATA ",";}							#Size/Alignment
				if($thisContent =~/Armor Class<\/b>(.*?)(\d+)(\s?|\S?)/){print DATA "\"".$2."\",";}else{print DATA ",";}				#AC
				if($thisContent =~/Hit Points<\/a><\/b>(.*?)(\d+)(\s?|\S?)/){print DATA "\"".$2."\",";}else{print DATA ",";}			#HP
				if($thisContent =~/Speed<\/a><\/b>(.*?)(\n|<)/){print DATA "\"".rmCh($1)."\",\n";}else{print DATA ",\n";}				#Speed
				#-----------------------------------------------------------------------------------------------------------------------
				while($thisContent =~/\n<td>(.*?)\((.*?)\)/g){print DATA "\"".$1."\",";}												#Ability scores
				print DATA "\n";
				#-----------------------------------------------------------------------------------------------------------------------
				if($thisContent =~/Strength(.*?)<\/a> \+( \d+|\d+)/){print DATA "\"".$2."\",";}											#Str save	
				elsif($thisContent =~/Saving Throws<\/b>(.*?)(Str|Strength)(\s*?)\+(\s*?)(\d+)/){print DATA "\"".$5."\",";}
				else{print DATA ",";}
				if($thisContent =~/Dexterity(.*?)<\/a> \+( \d+|\d+)/){print DATA "\"".$2."\",";}										#Dex save
				elsif($thisContent =~/Saving Throws<\/b>(.*?)(Dex|Dexterity)(\s*?)\+(\s*?)(\d+)/){print DATA "\"".$5."\",";}
				else{print DATA ",";}
				if($thisContent =~/Constitution(.*?)<\/a> \+( \d+|\d+)/){print DATA "\"".$2."\",";}										#Con save
				elsif($thisContent =~/Saving Throws<\/b>(.*?)(Con|Constitution)(\s*?)\+(\s*?)(\d+)/){print DATA "\"".$5."\",";}
				else{print DATA ",";}
				if($thisContent =~/Intelligence(.*?)<\/a> \+( \d+|\d+)/){print DATA "\"".$2."\",";}										#Int save
				elsif($thisContent =~/Saving Throws<\/b>(.*?)(Int|Intelligence)(\s*?)\+(\s*?)(\d+)/){print DATA "\"".$5."\",";}
				else{print DATA ",";}
				if($thisContent =~/Wisdom(.*?)<\/a> \+( \d+|\d+)/){print DATA "\"".$2."\",";}											#Wis save
				elsif($thisContent =~/Saving Throws<\/b>(.*?)(Wis|Wisdom)(\s*?)\+(\s*?)(\d+)/){print DATA "\"".$5."\",";}
				else{print DATA ",";}
				if($thisContent =~/Charisma(.*?)<\/a> \+( \d+|\d+)/){print DATA "\"".$2."\",\n";}										#Cha save
				elsif($thisContent =~/Saving Throws<\/b>(.*?)(Cha|Charisma)(\s*?)\+(\s*?)(\d+)/){print DATA "\"".$5."\",\n";}
				else{print DATA ",\n";}
				#-----------------------------------------------------------------------------------------------------------------------
				if($thisContent =~/>(\s*?)Skills(.*?)((\s*?)|(.?))<b>/){print DATA "\"".rmHTML($2)."\",\n";}else{print DATA ",\n";}		#Skills
				#-----------------------------------------------------------------------------------------------------------------------
				if($thisContent =~/Vulnerabilities(\s*?)<\/a><\/b>(.*?)<br/){print DATA "\"".rmCh($2)."\",\n";}else{print DATA ",\n";}	#Damage Vulnerabilities
				#-----------------------------------------------------------------------------------------------------------------------
				if($thisContent =~/Resistances(\s*?)<\/a><\/b>(.*?)<br/){print DATA "\"".rmCh($2)."\",\n";}else{print DATA ",\n";}		#Damage Resistances
				#-----------------------------------------------------------------------------------------------------------------------
				if($thisContent =~/(Dmg |Damage )Imm(.*?)<\/b>(.*?)<br/){print DATA "\"".rmCh($3)."\",\n";}else{print DATA ",\n";}		#Damage Immunities
				#-----------------------------------------------------------------------------------------------------------------------
				if($thisContent =~/(Con |Condition )Imm(.*?)<\/b>(.*?)<br/){print DATA "\"".rmHTML($3)."\",\n";}else{print DATA ",\n";}	#Condition Immunities
				#-----------------------------------------------------------------------------------------------------------------------
				if($thisContent =~/Senses(.*?)<\/b>(.*?)(\s*?)<br/){print DATA "\"".rmHTML($2)."\",\n";}else{print DATA ",\n";}			#Senses
				#-----------------------------------------------------------------------------------------------------------------------
				if($thisContent =~/Languages(.*?)<\/b>(.*?)(\s*?)<br/){print DATA "\"".rmCh($2)."\",\n"}else{print DATA ",\n";}			#Languages
				#-----------------------------------------------------------------------------------------------------------------------
				if($thisContent =~/Challenge(\s*?)<\/b>(.*?)(\s*?)<br/){print DATA "\"".rmCh($2)."\",\n"}else{print DATA ",\n";}		#Challenge
				#-----------------------------------------------------------------------------------------------------------------------
				my $temp = "";																											#Grab all Abilities/Features
				while($thisContent =~/<p><i><b>((.|\s)*?)<\/p>/g){#Good ones
					$temp = $1;
					chomp($temp);
					push(@abilities, $temp);
				}
				$temp = "";
				while(shorten($thisContent) =~/ACTION((.|\s)*?)\Q$temp\E((.|\s)*?)<b>(.+)(\s*)(<b>|<\/p>)((.|\s)+)/g){#Stupid ones
					$temp = $5;
					if(not grep(/\Q$temp\E/, @abilities)){
						push(@abilities, $temp);
					}
				}
				#-----------------------------------------------------------------------------------------------------------------------
				foreach my $i (@abilities){																								#Sort all Abilities/Features
					if($thisContent =~/<\/p>(\s*)<\/td>(\s*)<td((.|\s)+)\Q$i\E/){	
						#Then it's flavor text beneath a picture. Don't include.
						#Not convinced that this will always work.
					}elsif($thisContent =~/LEGENDARY ACTIONS((.|\s)+)\Q$i\E/){
						push(@legendary, rmHTML($i));
					}elsif($thisContent =~/REACTIONS((.|\s)+)\Q$i\E/){
						push(@reactions, rmHTML($i));
					}elsif($thisContent =~/ACTIONS((.|\s)+)\Q$i\E/){
						push(@actions, rmHTML($i));
					}else{
						push(@features, rmHTML($i));
					}
				}
				#-----------------------------------------------------------------------------------------------------------------------
				foreach my $i (@features){print DATA "\"".$i."\",";}																	#Features
				print DATA "\n";
				#-----------------------------------------------------------------------------------------------------------------------
				foreach my $i (@actions){print DATA "\"".$i."\",";}																		#Actions
				print DATA "\n";
				#-----------------------------------------------------------------------------------------------------------------------
				foreach my $i (@reactions){print DATA "\"".$i."\",";}																	#Reactions
				print DATA "\n";
				#-----------------------------------------------------------------------------------------------------------------------
				foreach my $i (@legendary){print DATA "\"".$i."\",";}																	#Legendary Actions
				print DATA "\n";
				#-----------------------------------------------------------------------------------------------------------------------
				close(DATA);
			}
		}else{
			die $response->status_line;
		}
	}
}

sub rmHTML{
	my $newString = $_[0];
	$newString =~s/<(.*?)>//g;
	$newString =~s/<(.*)//g;
	$newString =~s/(.*)>//g;
	return rmCh($newString);
}
sub rmCh{
	my $newString = $_[0];
	$newString =~s/,/;/g;
	$newString =~s/"/'/g;
	$newString =~s/[^[:ascii:]]+//g;
	return decode_entities($newString);
}
sub shorten{
	my $content = $_[0];
	$content =~s/((.|\s)+)Challenge(\s*?)<\/b>(.*?)(\s*?)<br//;
	$content =~s/<li ((.|\s)+)//g;
	$content =~s/<p><i><b>((.|\s)*?)<\/p>//g;
	$content =~s/<(.*?)div(.*?)>//g;
	$content =~s/<(.*?)table(.*?)>//g;
	$content =~s/<(.*?)span(.*?)>//g;
	return $content;
}

 

And here is a link to the download for the database so that you don't have to run this and bother the website owners: https://megaupload.nz/D7WdZbvfn4/Monsters_rar

  • I Like This! 1
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...