Using pure selenium methods to extract data from metasearch websites is quite tricky and uses a lot of CPU resources. I have spent significant amount of time with selenium built-in methods with python and have a feeling that the development is quite tedious, time consuming and prone to bugs.
Selenium’s WebElement objects are not flexible, as i would like them to be.
By my opinion, the html source can be easier parsed with the lxml library and XPATH like this.
driver.get(link) # wait a bit # get web content content = driver.page_source doc = lh.fromstring(content) #apply your XPATH to doc object # get price list best_price_list = doc.xpath(XPATH_HOTEL_LIST + "//div[@class='item__flex-column']//section[@class='item__deal-best']/div/@data-bestprice") # get hotels name list hotel_names_list = doc.xpath(XPATH_HOTEL_LIST + "//div[@class='item__flex-column']/div[@class='item__details']/h3/@title") ## get the rest ...
Furthermore, this approach is more performant and less CPU and memory intensive.
Another useful feature for scraping the metasearch websites is the automatic IP rotation. There are many approaches for anonymity like private VPNs or proxy providers, but the useful ones are pretty expensive. By surfing over TOR network, your IP is automatically hidden and changes every few minutes, so you get IP rotation for free. For TOR&Selenium setup have a look at this link. You can get it up&running in a few minutes. Some good anonymity guidelines can be found here ;).
Below are two python files, logic.py and main.py. Place them in the same directory, change some parameters(look at the comments and TODOs) and you’re ready to scrawl!
__author__="Dzenan Hamzic" __date__ ="XXX" import uuid import datetime import time import lxml.html as lh from random import randint import os from selenium import webdriver ### helper method to generate UIDs def my_random_string(string_length=10): """Returns a random string of length string_length.""" random = str(uuid.uuid4()) # Convert UUID format to a Python string. random = random.upper() # Make all characters uppercase. random = random.replace("-","") # Remove the UUID '-'. return random[0:string_length] # Return the random string. ### used for capseling data class BitDeal: meta = 'tvg' def __init__(self, deal_ID,hotel_name, hotel_ota, hotel_price, has_breakfast, tvg_hotel_id, hotel_category): self.dealID = deal_ID self.hname = hotel_name self.hota = hotel_ota self.hprice = hotel_price self.htvgid = tvg_hotel_id self.hbreakfast = has_breakfast self.hcat = hotel_category def show_info(self): print self.dealID,",",self.hname, ","+self.htvgid+",", self.hota, ",", self.hprice, ",", self.hbreakfast,",",self.hcat ### web-scraping class class BitCrawly: version = 1 chromedriver = "/path/to/my/chromedriver" # TODO os.environ["webdriver.chrome.driver"] = chromedriver _source = "http://www.SOME-SITE.com/" # TODO ### TOR crawling chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('--proxy-server=http://127.0.0.1:8118') ## some page XPATH configuration XPATH_HOTEL_LIST = ".//ol/li/article" # TODO - change XPATH def __init__(self, browser, starts, ends, roomtype, nights, stars, cityid): self.browser = browser self.startd = datetime.datetime.strptime(starts, '%Y-%m-%d').date() self.endd = datetime.datetime.strptime(ends, '%Y-%m-%d').date() self.stars = stars self.cid = cityid self.delta = datetime.timedelta(days=1) self.daySpan = datetime.timedelta(days=nights) self.rt = 7 if roomtype=="D" else 1 def start(self): links = self.generate_links() crawlCounter = 0 for link in links: ## the webbrowser is closed and opened for every new date-range ## in order to aquire new session # can use Phantom and Chromedriver #driver = webdriver.PhantomJS() # TOR #driver = webdriver.Chrome(self.chromedriver, chrome_options=self.chrome_options) driver = webdriver.Chrome(self.chromedriver) # opening link just to get pagination print "opening site..." driver.get(link) # wait a bit self.pause_crawling(1) # get web content content = driver.page_source doc = lh.fromstring(content) #get pagination pagination_offsets = doc.xpath(".//div[@class='pagination__pages']/button/@data-offset") print pagination_offsets # insert pagination into links and produce new ones paginated_links = self.insert_pagination(pagination_offsets,link) # go trough paginated links ... for plink in paginated_links: print plink print "opening site..." driver.get(plink) self.pause_crawling(3) # get web content content = driver.page_source doc = lh.fromstring(content) print "crawl number:", crawlCounter self.do_scrape(doc, plink) crawlCounter +=1 # close browser after pagination completed driver.quit() def pause_crawling(self, min_wait_time): waittime = min_wait_time + randint(2, 5) print "waiting for " + str(waittime) + " sec..." time.sleep(waittime) # this is a QUICK-HACK - should be refactored def insert_pagination(self,offsets, link): links_with_offsets = [] for offset in offsets: newLink = link newLink = newLink.replace("iOffset=0", "iOffset="+str(offset)) links_with_offsets.append(newLink) return links_with_offsets ### extract content from website def do_scrape(self, doc, link): start = time.time() html_hotel_list = doc.xpath(self.XPATH_HOTEL_LIST) for list_item in html_hotel_list: hotel_price = list_item.xpath(".//div[@class='item__flex-column']//section[@class='item__deal-best']/div/@data-bestprice")[0] hotel_name = list_item.xpath(".//div[@class='item__flex-column']/div[@class='item__details']/h3/@title")[0] hotel_category = list_item.xpath(".//div[@class='item__flex-column']/div[@class='item__details']/h3/span/@data-category")[0] hotel_tvg_id = list_item.xpath("./@data-item")[0] hotel_best_alternative_prices_raw = list_item.xpath(".//div[@class='item__flex-column']/section/ul[@class='deal-other__top-alternatives']") hotel_breakfast = list_item.xpath(".//div[@class='item__flex-column']//section[@class='item__deal-best']/div/em[@class='item__flag item__meal-plan font-tiny fw-bold fs-normal ta-center cur-pointer--hover']/text()") hotel_ota = list_item.xpath(".//div[@class='item__flex-column']//section[@class='item__deal-best']/div/div/div/em/text()")[0].strip() # check if deal has breakfast included hasBreakfast = 0 if len(hotel_breakfast) > 0: hasBreakfast = 1 deal = BitDeal(my_random_string(10),hotel_name,hotel_ota,hotel_price, hasBreakfast, hotel_tvg_id, hotel_category) deal.show_info() end = time.time() print "execution time:", (end - start) ### generate list of links to visit def generate_links(self): links = [] while self.startd < self.endd: toDate = self.startd + self.daySpan #print "start:",self.startd,",to:",toDate,",end:",self.endd link = self._source + "?aCategoryRange="+ str(self.stars)+"&aDateRange[arr]="+ self.startd.strftime('%Y-%m-%d') +"&aDateRange[dep]="+ toDate.strftime('%Y-%m-%d') +"&iRoomType="+ str(self.rt) +"&iViewType=0&iPathId="+ str(self.cid) +"&sOrderBy=price&iOffset=0&iLimit=25" links.append(link) #print link self.startd += self.delta # return list of generated urls return links ### end of BitCrawly class
You can start the crawler with the following few lines.
#! /usr/bin/python # -*- coding: utf-8 -*- # To change this template, choose Tools | Templates # and open the template in the editor. __author__="Dzenan Hamzic" __date__ ="XXX" # imports from logic import my_random_string from logic import BitDeal from logic import BitCrawly import time start = time.time() crawly = BitCrawly("for_driver_object_reference","2016-12-10","2017-02-25","D",2,4,45232) crawly.start() end = time.time() print "crawler done in: ", (end-start)
First parameter, “for_driver_object_reference” is unused and can be changed if you want to pass selenium’s driver object reference for controlling the browser from the main python file.
The output has the following format.
opening site…
waiting for 7 sec…
crawl number: 3
E784F3A404 , K+K Palais Wien ,35572, thomascook.de , 180 , 1 , 4
2815C574D0 , Arion Cityhotel Vienna ,35547, Booking.com , 180 , 0 , 4
2D7A56C042 , Pertschy Palais Hotel ,35707, Booking.com , 182 , 1 , 4
924F181E7B , Spiess & Spiess ,35728, Expedia , 186 , 1 , 4
50D067CE9D , Novotel Wien City ,438046, Booking.com , 189 , 0 , 4
D1063E689F , Renaissance Hotel Wien ,35583, ebookers.de , 189 , 1 , 4
AFC29D4DCC , 25hours beim MuseumsQuartier ,35604, 25hours , 195 , 0 , 4
1320D38631 , Das Tigra ,35562, Amoma.com , 195 , 0 , 4
320F2DA79B , NH Collection Wien Zentrum ,35578, NH Collection , 207 , 0 , 4
47D5E2F132 , Lindner Am Belvedere ,238226, 7ideas , 208 , 1 , 4
C3738F5BE1 , Best Western Premier Kaiserhof Wien ,41710, Expedia , 218 , 1 , 4
35B850A350 , Hotel Am Konzerthaus Vienna MGallery by Sofitel ,41824, Hotels.com , 233 , 0 , 4
0A6292E70D , Amadeus ,18283, Booking.com , 233 , 1 , 4
853F21C6E6 , Hotel Mercure Wien Zentrum ,25843, Booking.com , 238 , 0 , 4
B46DC40351 , Kaiserin Elisabeth ,35650, Booking.com , 246 , 1 , 4
8673760AFA , Wandl ,35745, ZenHotels.com , 252 , 1 , 4
738C57D2B3 , Derag Livinghotel an der Oper ,1394884, Expedia , 256 , 0 , 4
F35C0BEACE , Mercure Vienna First ,3851022, Booking.com , 285 , 1 , 4
execution time: 0.00765299797058
P.S. The collected data was used in research and academic purposes. I hope you won’t misuse this source code as well.
Anyway, have fun!
Hi,
Great article. I am trying to do the exact similar to your design. But I’m learning about tor and the complete setup. The link you have doesn’t work anymore. Do you have instruction or github on setting up tor/selenium? That would be great. Thanks in advance
LikeLike
Thanks! Unfortunately, the setup guide is nowhere to find and I saved no configuration docs. But, maybe this could be helpful to you:
https://github.com/webfp/tor-browser-selenium
Cheers!
LikeLike
the TOR&Selenium setup link is giving a 404
LikeLike
Unfortunately. But, this might be helpful with the setup.
https://github.com/webfp/tor-browser-selenium
LikeLike