How to crawl hotel names and urls from booking.com using Python and Selenium

You might be needing a list of all hotels in your city for any reason. Most of them can be found at booking.com (assuming it’s a city in Europe).

If you need hotel names, ratings and/or hotel url list from any city you can crawl booking for it. Coding it with Python and selenium is pretty easy. Below is the script that collects hotel names, booking.com hotel urls and ratings for city of Vienna. The list is finally saved to json file.

Go crazy with it…

#! /usr/bin/python
# coding: utf-8

__author__="selfconstruct3d"
__date__ ="$Jun 17, 2016 11:41:36 PM$"

# this script is used to collect basic hotel-info from booking.com
# hotel name, url, and user rating are extracted and saved to json file

from selenium import webdriver
import json

driver = webdriver.Firefox()
# output dict
hotelsDict = dict()

# pagination offset
booking_list_offset = 0
CITY_NAME = "Vienna"

for i in range (1,80):

    # just paste booking.com link with city entered. arrival and departure dates are not inserted
    driver.get('http://www.booking.com/searchresults.de.html?dcid=1&label=gen173nr-1DCAEoggJCAlhYSDNiBW5vcmVmaBKIAQGYAQe4AQrIAQzYAQPoAQGoAgM&lang=de&sid=e8b897b588f56aa2e25913117df47bcc&sb=1&src=searchresults&src_elem=sb&error_url=http%3A%2F%2Fwww.booking.com%2Fsearchresults.de.html%3Flabel%3Dgen173nr-1DCAEoggJCAlhYSDNiBW5vcmVmaBKIAQGYAQe4AQrIAQzYAQPoAQGoAgM%3Bsid%3De8b897b588f56aa2e25913117df47bcc%3Bdcid%3D1%3Bclass_interval%3D1%3Bdest_id%3D-1746443%3Bdest_type%3Dcity%3Bgroup_adults%3D2%3Bgroup_children%3D0%3Bhlrd%3D0%3Blabel_click%3Dundef%3Bno_rooms%3D1%3Boffset%3D0%3Breview_score_group%3Dempty%3Broom1%3DA%252CA%3Bsb_price_type%3Dtotal%3Bscore_min%3D0%3Bsrc%3Dindex%3Bsrc_elem%3Dsb%3Bss%3DBerlin%252C%2520Berlin%2520%2528Bundesland%2529%252C%2520Deutschland%3Bss_raw%3Dber%3Bssb%3Dempty%26%3B&ss=Wien%2C+Wien+%28Bundesland%29%2C+%C3%96sterreich&ssne=Berlin&ssne_untouched=Berlin&city=-1746443&room1=A%2CA&no_rooms=1&group_adults=2&group_children=0&ss_raw=wien&ac_popular_badge=1&ac_position=0&ac_langcode=de&dest_id=-1995499&dest_type=city&ac_pageview_id=d2db9ad66c2d0283&ac_suggestion_list_length=5&ac_suggestion_theme_list_length=1&rows=15&offset='+str(booking_list_offset))

    hotelUrls = driver.find_elements_by_css_selector("a.hotel_name_link.url")
    hotelNames = driver.find_elements_by_css_selector("span.sr-hotel__name")
    hotelRatings = driver.find_elements_by_css_selector("span.average.js--hp-scorecard-scoreval")

    for hotelurl, hotelRating in zip(hotelUrls, hotelRatings):
        #get hotel name
        name = hotelurl.text
        # get url
        url = hotelurl.get_attribute("href").split("?")[0]
        # get rating
        rating = hotelRating.text
        print url, ",",name,",",rating
        # set up dictionary structure
        hotelsDict[url] = {}
        hotelsDict[url]["name"] = name
        hotelsDict[url]["rating"] = rating

    #increase offset
    booking_list_offset += 15

# save to json file
with open("crawlbooking-"+CITY_NAME+"-hotel-urls-ratings.json","w") as f:
    json.dump(hotelsDict,f)
  1. […] post continues on the last one. Assuming you have the hotel list with urls from booking you can now extract addresses for each […]

    Like

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: