Anonymous Web Scraping with Python Selenium PhantomJS Xpath and TOR

Using pure selenium methods to extract data from metasearch websites is quite tricky and uses a lot of CPU resources. I have spent significant amount of time with selenium built-in methods with python and have a feeling that the development is quite tedious, time consuming and prone to bugs.
Selenium’s WebElement objects are not flexible, as i would like them to be.

By my opinion, the html source can be easier parsed with the lxml library and XPATH like this.

driver.get(link)
# wait a bit
# get web content
content = driver.page_source
doc = lh.fromstring(content)

#apply your XPATH to doc object
# get price list
best_price_list = doc.xpath(XPATH_HOTEL_LIST + "//div[@class='item__flex-column']//section[@class='item__deal-best']/div/@data-bestprice")
# get hotels name list
hotel_names_list = doc.xpath(XPATH_HOTEL_LIST + "//div[@class='item__flex-column']/div[@class='item__details']/h3/@title")
## get the rest ...

Furthermore, this approach is more performant and less CPU and memory intensive.

Another useful feature for scraping the metasearch websites is the automatic IP rotation. There are many approaches for anonymity like private VPNs or proxy providers,  but the useful ones are pretty expensive. By surfing over TOR network, your IP is automatically hidden and changes every few minutes, so you get IP rotation for free. For TOR&Selenium setup have a look at this link. You can get it up&running in a few minutes. Some good anonymity guidelines can be found  here ;).

Below are two python files, logic.py and main.py. Place them in the same directory, change some parameters(look at the comments and TODOs) and you’re ready to scrawl!


__author__="Dzenan Hamzic"
__date__ ="XXX"


import uuid
import datetime
import time
import lxml.html as lh
from random import randint
import os
from selenium import webdriver

### helper method to generate UIDs
def my_random_string(string_length=10):
    """Returns a random string of length string_length."""
    random = str(uuid.uuid4()) # Convert UUID format to a Python string.
    random = random.upper() # Make all characters uppercase.
    random = random.replace("-","") # Remove the UUID '-'.
    return random[0:string_length] # Return the random string.

### used for capseling data
class BitDeal:
    meta = 'tvg'
    def __init__(self, deal_ID,hotel_name, hotel_ota, hotel_price, has_breakfast, tvg_hotel_id, hotel_category):
        self.dealID = deal_ID
        self.hname = hotel_name
        self.hota = hotel_ota
        self.hprice = hotel_price
        self.htvgid = tvg_hotel_id
        self.hbreakfast = has_breakfast
        self.hcat = hotel_category

    def show_info(self):
        print self.dealID,",",self.hname, ","+self.htvgid+",", self.hota, ",", self.hprice, ",", self.hbreakfast,",",self.hcat

### web-scraping class
class BitCrawly:
    version = 1
    chromedriver = "/path/to/my/chromedriver" # TODO
    os.environ["webdriver.chrome.driver"] = chromedriver
    _source = "http://www.SOME-SITE.com/" # TODO

    ### TOR crawling
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--proxy-server=http://127.0.0.1:8118')

    ## some page XPATH configuration
    XPATH_HOTEL_LIST = ".//ol/li/article" # TODO - change XPATH

    def __init__(self, browser, starts, ends, roomtype, nights, stars, cityid):
        self.browser = browser
        self.startd = datetime.datetime.strptime(starts, '%Y-%m-%d').date()
        self.endd = datetime.datetime.strptime(ends, '%Y-%m-%d').date()
        self.stars = stars
        self.cid = cityid
        self.delta = datetime.timedelta(days=1)
        self.daySpan = datetime.timedelta(days=nights)
        self.rt = 7 if roomtype=="D" else 1

    def start(self):
        links = self.generate_links()
        crawlCounter = 0
        for link in links:

            ## the webbrowser is closed and opened for every new date-range
            ## in order to aquire new session
            # can use Phantom and Chromedriver
            #driver = webdriver.PhantomJS()
            # TOR
            #driver = webdriver.Chrome(self.chromedriver, chrome_options=self.chrome_options)
            driver = webdriver.Chrome(self.chromedriver)

            # opening link just to get pagination
            print "opening site..."
            driver.get(link)
            # wait a bit
            self.pause_crawling(1)
            # get web content
            content = driver.page_source
            doc = lh.fromstring(content)
            #get pagination
            pagination_offsets = doc.xpath(".//div[@class='pagination__pages']/button/@data-offset")
            print pagination_offsets

            # insert pagination into links and produce new ones
            paginated_links = self.insert_pagination(pagination_offsets,link)
            # go trough paginated links ...
            for plink in paginated_links:
                print plink
                print "opening site..."
                driver.get(plink)
                self.pause_crawling(3)
                # get web content
                content = driver.page_source
                doc = lh.fromstring(content)

                print "crawl number:", crawlCounter
                self.do_scrape(doc, plink)
                crawlCounter +=1
            # close browser after pagination completed
            driver.quit()

    def pause_crawling(self, min_wait_time):
        waittime = min_wait_time + randint(2, 5)
        print "waiting for " + str(waittime) + " sec..."
        time.sleep(waittime)

    # this is a QUICK-HACK - should be refactored
    def insert_pagination(self,offsets, link):
        links_with_offsets = []
        for offset in offsets:
            newLink = link
            newLink = newLink.replace("iOffset=0", "iOffset="+str(offset))
            links_with_offsets.append(newLink)
        return links_with_offsets

    ### extract content from website
    def do_scrape(self, doc, link):
        start = time.time()
        html_hotel_list = doc.xpath(self.XPATH_HOTEL_LIST)

        for list_item in html_hotel_list:
            hotel_price = list_item.xpath(".//div[@class='item__flex-column']//section[@class='item__deal-best']/div/@data-bestprice")[0]
            hotel_name = list_item.xpath(".//div[@class='item__flex-column']/div[@class='item__details']/h3/@title")[0]
            hotel_category = list_item.xpath(".//div[@class='item__flex-column']/div[@class='item__details']/h3/span/@data-category")[0]
            hotel_tvg_id = list_item.xpath("./@data-item")[0]
            hotel_best_alternative_prices_raw = list_item.xpath(".//div[@class='item__flex-column']/section/ul[@class='deal-other__top-alternatives']")
            hotel_breakfast = list_item.xpath(".//div[@class='item__flex-column']//section[@class='item__deal-best']/div/em[@class='item__flag item__meal-plan font-tiny fw-bold fs-normal ta-center cur-pointer--hover']/text()")
            hotel_ota = list_item.xpath(".//div[@class='item__flex-column']//section[@class='item__deal-best']/div/div/div/em/text()")[0].strip()

            # check if deal has breakfast included
            hasBreakfast = 0
            if len(hotel_breakfast) > 0:
                hasBreakfast = 1

            deal = BitDeal(my_random_string(10),hotel_name,hotel_ota,hotel_price, hasBreakfast, hotel_tvg_id, hotel_category)
            deal.show_info()

        end = time.time()
        print "execution time:", (end - start)

    ### generate list of links to visit
    def generate_links(self):
        links = []
        while self.startd < self.endd:
            toDate = self.startd + self.daySpan
            #print "start:",self.startd,",to:",toDate,",end:",self.endd
            link = self._source + "?aCategoryRange="+ str(self.stars)+"&aDateRange[arr]="+ self.startd.strftime('%Y-%m-%d') +"&aDateRange[dep]="+ toDate.strftime('%Y-%m-%d') +"&iRoomType="+ str(self.rt) +"&iViewType=0&iPathId="+ str(self.cid) +"&sOrderBy=price&iOffset=0&iLimit=25"
            links.append(link)
            #print link
            self.startd += self.delta

        # return list of generated urls
        return links
### end of BitCrawly class

 

You can start the crawler with the following few lines.

#! /usr/bin/python
# -*- coding: utf-8 -*-

# To change this template, choose Tools | Templates
# and open the template in the editor.

__author__="Dzenan Hamzic"
__date__ ="XXX"

# imports
from logic import my_random_string
from logic import BitDeal
from logic import BitCrawly
import time

start = time.time()
crawly = BitCrawly("for_driver_object_reference","2016-12-10","2017-02-25","D",2,4,45232)
crawly.start()
end = time.time()
print "crawler done in: ", (end-start)

First parameter, “for_driver_object_reference” is unused and can be changed if you want to pass selenium’s driver object reference for controlling the browser from the main python file.

The output has the following format.

opening site…
waiting for 7 sec…
crawl number: 3
E784F3A404 , K+K Palais Wien ,35572, thomascook.de , 180 , 1 , 4
2815C574D0 , Arion Cityhotel Vienna ,35547, Booking.com , 180 , 0 , 4
2D7A56C042 , Pertschy Palais Hotel ,35707, Booking.com , 182 , 1 , 4
924F181E7B , Spiess & Spiess ,35728, Expedia , 186 , 1 , 4
50D067CE9D , Novotel Wien City ,438046, Booking.com , 189 , 0 , 4
D1063E689F , Renaissance Hotel Wien ,35583, ebookers.de , 189 , 1 , 4
AFC29D4DCC , 25hours beim MuseumsQuartier ,35604, 25hours , 195 , 0 , 4
1320D38631 , Das Tigra ,35562, Amoma.com , 195 , 0 , 4
320F2DA79B , NH Collection Wien Zentrum ,35578, NH Collection , 207 , 0 , 4
47D5E2F132 , Lindner Am Belvedere ,238226, 7ideas , 208 , 1 , 4
C3738F5BE1 , Best Western Premier Kaiserhof Wien ,41710, Expedia , 218 , 1 , 4
35B850A350 , Hotel Am Konzerthaus Vienna MGallery by Sofitel ,41824, Hotels.com , 233 , 0 , 4
0A6292E70D , Amadeus ,18283, Booking.com , 233 , 1 , 4
853F21C6E6 , Hotel Mercure Wien Zentrum ,25843, Booking.com , 238 , 0 , 4
B46DC40351 , Kaiserin Elisabeth ,35650, Booking.com , 246 , 1 , 4
8673760AFA , Wandl ,35745, ZenHotels.com , 252 , 1 , 4
738C57D2B3 , Derag Livinghotel an der Oper ,1394884, Expedia , 256 , 0 , 4
F35C0BEACE , Mercure Vienna First ,3851022, Booking.com , 285 , 1 , 4
execution time: 0.00765299797058

 

P.S. The collected data was used in research and academic purposes. I hope you won’t misuse this source code as well.

Anyway, have fun!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: