How to Scrape Google Play with Perl

I was faced with a peculiar challenge to find all apps developed by Nigerian developers in the Google play store. It sounded pretty straight forward at first. Easy Peeze.. WRONG!!!!.

I thought all I had to do was just go to Google play, click a link or drop down somewhere and then I will automatically be presented with a list of all the apps in my lovely country, how wrong I was….

By the way, for your information, Google doesn’t provide this information to regular consumers like me :). My next step was to fire up the developer console in Google chrome and do some j-query magic, I noticed that the links to the app name and developer/publisher name were wrapped in hyperlinks decorated with title and subtitle css classes i.e

” class =’title’ and  class=’subtitle’ ”

so I tried to get some apps using jQuery using the scriplet :

$(‘.title’).text();

and $(‘.subtitle’).text()

this was after i searched by a key word “Nigeria”.

I got the following output:

” Apps Android AppsAll pricesUpdatedSizeInstallsCurrent VersionRequires AndroidContent Rating Contact Developer Nigerian Movies Nigerian Movies 2014 Mercy Johnson Nollywood Star Nollywood Movies Nigerian Movies HD Nigerian Movie John Okafor/Mr Ibu Nollywood Nigerian Movies Tube Nollywood Movies and News Nkem Owoh Naija Movies Ghana Movies (Ghallywood) Best HD Movies Nonso Diobi Nollywood Star Afrinolly Full HD Movies on Youtube Nigerian Movies Nigerian Gossip Nigerian News Daily Bible Words Toddlers Telly Full Movies 4 Free Zambian News Just Zambia Nollywood “

Pretty impressive you would say, but then it occured to me I would have to do this for all search keywords like naija, nigerian, eko, etc. and then jQuery didn’t seem so smart after all.

So back to the drawing board again…. then I remembered, perl should be able to hack this one… PERL is after all called the Portable Extraction and Reporting Language.. God bless Larry Wall.. 🙂

After foraging and stack overflowing for a bit, I arrived at a few scripts that got me a list of all the links matching certain keywords I selected for the job.

I called one of them playcurl.pl

$ cat playcurl.pl
#!/usr/bin/perl

my @keywords = qw\nigeria nigerian naija yoruba igbo ibo hausa ng ngn lagos gidi lasgidi africa strika\;

foreach(@keywords){

my $URL = “https://play.google.com/store/search?q=”.$_;

my $content = system “wget $URL”;

open (my $fh, ‘>’, $_) or die “Could not open file $_ !”;

$fh->print(“$content”);

close $fh;

}

So playcurl was pretty straight forward. All it did was to build a URL of search strings for the store and use wget. [linux utility] to download the content to files matching the search strings to my file system.

$ ls -ltr search*
-rw-rw-r–+ 1 Oladipo None 286981 Mar 17 18:46 search@q=nigeria
-rw-rw-r–+ 1 Oladipo None 285532 Mar 17 18:47 search@q=nigerian
-rw-rw-r–+ 1 Oladipo None 270028 Mar 17 18:47 search@q=naija
-rw-rw-r–+ 1 Oladipo None 289716 Mar 17 18:47 search@q=yoruba
-rw-rw-r–+ 1 Oladipo None 309516 Mar 17 18:47 search@q=igbo
-rw-rw-r–+ 1 Oladipo None 304053 Mar 17 18:47 search@q=ibo
-rw-rw-r–+ 1 Oladipo None 297166 Mar 17 18:47 search@q=hausa
-rw-rw-r–+ 1 Oladipo None 293471 Mar 17 18:47 search@q=ng
-rw-rw-r–+ 1 Oladipo None 362197 Mar 17 18:47 search@q=ngn
-rw-rw-r–+ 1 Oladipo None 299289 Mar 17 18:47 search@q=lagos
-rw-rw-r–+ 1 Oladipo None 330012 Mar 17 18:47 search@q=gidi
-rw-rw-r–+ 1 Oladipo None 45168 Mar 17 18:47 search@q=lasgidi
-rw-rw-r–+ 1 Oladipo None 296523 Mar 17 18:47 search@q=africa
-rw-rw-r–+ 1 Oladipo None 37989 Mar 17 18:47 search@q=strika

lets take a peek into one of the files.

$ cat search@q=nigeria

…..<a href=”/store/apps/details?id=com.comviva.mtnnigeriaselfcare” title=”MTN Nigeria Selfcare App”>   MTN Nigeria Selfcare App  <span></span> </a> </h2>  <div>  <a href=”/store/apps/developer?id=MTN+Nigeria+Communications+Limited” title=”MTN Nigeria Communications Limited”>MTN Nigeria Communications Limited</a>   <span> <span></span> …….

As you can see, this is pure html with all the strings on one line of text…yes you heard me one line of text…. The next thing I asked myself was, how do I get all the hyperlinks from this document containing the app names and developer names? Google, Stack Overflow, Perlmonks… you can guess like another hour went by 🙂

….. and then I created another script to do the job, this time it was a shell script. I call it, the monster! LOL….

Introducing the Monster!

cat search@q=strika | grep -o ‘‘ | sed -e ‘s/<a /\n<a /g’ | sed -e ‘s/> links.out

The monster one of the files created by playcurl.pl and retrieves all the links matching a regular expression from piping of grep and sed and outputs to a file called links.out.

I ran this several times for all the search files and appended the output into links.out. I am sure I could have done all this with another perl script.. but well I am sure there is always time for refactoring 🙂 always….

so now we have with us, all the links containing to apps matching our search criteria.

$ cat links.out

This file can still be cleaned up to get unique URLs.. but i leave that to you guys to sort out 🙂

now to get the emails…

The email bit was  bit tricky though… we have with us links.out, let’s clean it up a bit.

So I ran the following command:

grep -n ‘/store/apps/details’ links.out | sort –unique >> unique.out

This generates another file “unique.out” containing a cleaner version of links.out

So I decided, what to do to get me my emails.. google, stackoverflow, google led me to create another perl script: getapppage.pl to download the app pages to my filesystem with the hope to extract the emails from within.

getapppage.pl
—————-
#!/usr/bin/perl

use strict;
use warnings;
use Email::Address;
use File::Slurp;

my $file = ‘unique.out’;

open my $handle, $file or die “Could not open $file: $!”;

while(my $line = <$handle>){

chomp $line;
system “wget https://play.google.com/$line&#8221;;
print $line.”\n”;
}

close $handle;

these files are saved as ‘details?id=appnamespace’ on the filesystem.

now to get the emails.. the real fun part. introducing emailparse.pl

#!/usr/bin/perl
 
use File::Slurp;
use File::Find;
 
my $dir = '/home/oladipo/Downloads';
opendir(DIR, $dir);
 
while(my $file = readdir(DIR)){
 
    if($file eq ".." || $file eq ".") {next; }
 
    system("grep -Eio '([[:alnum:]_.]+@[[:alnum:]_]+?\.[[:alpha:].]{2,6})' $file >> /home/oladipo/emails.out");
}
I’ll advise you set your directory name i.e $dir to a location containing only the ‘details?id=’ files . This script reads all the files in the directory specified in $dir and loops through them while running a system command:
    "system("grep -Eio '([[:alnum:]_.]+@[[:alnum:]_]+?\.[[:alpha:].]{2,6})' $file >> /home/oladipo/emails.out");"
this command will filter all the emails found in the current file during the loop iteration and append to emails.out.
THERE WE FINALLY DID IT!!!…
Advertisements

4 thoughts on “How to Scrape Google Play with Perl

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s