Click Read More Using Python Selenium Tripadvisor Reviews

This article is office of a series that goes through all the steps needed to write a script that automatically scrapes information from a website. The beginning article in this serial was an Introduction to Python scraping.

This article is almost scraping TripAdvisor reviews using Selenium. TripAdvisor is a big travel platform with hundreds of millions of reviews on numerous things including restaurants. This platform provides an API to programmatically read data.  The API is highly controlled allowing a limited number of API keys and does not allow access to the Content API for purposes of data analysis, academic research and whatever use not associated with a consumer-facing (B2C) travel website or application.

This leaves united states of america with scraping every bit the only option to download reviews programmatically. We are going to download reviews of a particular eating place.  For the sake of this article, we chose Storie & Sapori – La Valletta. That said, approach and the code described should be applicable to all restaurants on TripAdvisor.

As highlighted in the screenshot beneath, TripAdvisor only loads part of the review initially and waits until the user clicksMore than to load the rest.

TripAdvisor review with More button

When clicked further JavaScript is executed (code below) and is likely to make scraping harder.

          <span class="taLnk ulBlueLinks" onclick="widgetEvCall('handlers.clickExpand',event,this);">More</bridge>                  

The cyberspace result is that using a scraping library like Beautiful Soup is non plenty and we need to launch a browser and command using Selenium, to simulate pressing theMore link and executing the JavaScript code.

The score given by the reviewer is shown using an epitome made up of five circles, some of which are filled in green. The number of green circles indicates the score out of v. There is no textual indication of the score, and this makes it hard to scrape. Upon closer inspection one can encounter that the CSS class of the SPAN element holds an indication of the score. Basically, if nosotros carve up the class name using the '_' character, we can keep the substring at alphabetize three as the score. The examples below will result in scores of 40 and x respectively:

          <span class="ui_bubble_rating bubble_40"></span> <bridge course="ui_bubble_rating bubble_10"></span>                  

Beneath is a cleaned HTML DIV chemical element that holds one sample reviewed (Use F12 to plow on Developer Mode in your browser and audit the code). The relevant parts are labelled with HTML comments.

          <div class="ui_column is-9">  <!-- CONTAINER -->   <span course="ui_bubble_rating bubble_30"></bridge>  <!-- SCORE -->   <span class="ratingDate" title="August 2, 2020">Reviewed ane week agone </span>  <!-- DATE -->   <div class="quote">     <a href="/Show.....html" class="championship " onclick="..JavaScript.." id="rn762691785">       <span class="noQuotes">Lunch</bridge>  <!-- TITLE -->     </a>   </div>   <div class="prw_rup">     <div course="entry">       <p class="partial_entry">         ..review here..  <!-- REVIEW -->         <span class="taLnk ulBlueLinks" onclick="..JavaScript..">More</span>  <!-- MORE Push button -->       </p>     </div>   </div>   ... </div>        

One time the information is gathered, it will exist saved as a CSV file that can exist analysed later on. Python has built-in code for treatment CSV data and the sample below shows how to employ it:

          import csv csvFile = open up("file.csv", "w", newline='', encoding="utf-8") csvWriter = csv.writer(csvFile) csvWriter.writerow(('Heading1','Heading2')) csvWriter.writerow(('row1col1', 'row1col2')) csvWriter.writerow(('row2col1', 'row2col2')) csvFile.close()        

A first version of the script that is as unproblematic as possible to go the job done is establish below:

          import csv import time from selenium import webdriver  URL = "https://www.tripadvisor.com/Restaurant_Review-g190328-d8867662-Reviews-Storie_Sapori_La_Valletta-Valletta_Island_of_Malta.html"  driver = webdriver.Chrome("./chromedriver") driver.get(URL)  # Prepare CSV file csvFile = open("reviews.csv", "w", newline='', encoding="utf-8") csvWriter = csv.writer(csvFile) csvWriter.writerow(['Score','Engagement','Title','Review'])  # Notice and click the More than link (to load all reviews) commuter.find_element_by_xpath("//span[@course='taLnk ulBlueLinks']").click() time.sleep(5) # Wait for reviews to load  reviews = driver.find_elements_by_xpath("//div[@grade='ui_column is-9']") num_page_items = min(len(reviews), 10)  # Loop through the reviews constitute for i in range(num_page_items):     # go the score, engagement, title and review     score_class = reviews[i].find_element_by_xpath(".//bridge[contains(@class, 'ui_bubble_rating bubble_')]").get_attribute("class")     score = score_class.split("_")[three]     date = reviews[i].find_element_by_xpath(".//span[@class='ratingDate']").get_attribute("title")     championship = reviews[i].find_element_by_xpath(".//span[@class='noQuotes']").text     review = reviews[i].find_element_by_xpath(".//p[@class='partial_entry']").text.supplant("\northward", "")     # Save to CSV     csvWriter.writerow((score, date, title, review))  # Shut CSV file and browser csvFile.close() driver.close()        

Once the scraper runs successfully, it will produce a file namedreviews.csv with the structure of the sample beneath:

          Score,Engagement,Title,Review l,"August 8, 2020",We love your pizzas!,"I went to this restaurant yesterday evening with some friends, ..." 50,"August six, 2020",Most enjoyable meal in Malta,"Came to this footling restaurant by accident, very glad we did..." 50,"March half dozen, 2020",#Osema,Good and reasonable food overnice ambient highly recommended excellent service volition visit again for sure 50,"March 6, 2020",#Osems,Excellent service with good food and very overnice and kind staff volition non be our concluding time hither...                  

It was decided to keep this part as uncomplicated as possible. The post-obit limitations be in the current version of the scraper and will be addressed in the next commodity in the series:

  • Merely reviews on the kickoff page are saved
  • A maximum of 10 reviews (or less) are saved
  • TripAdvisor filters out any non-English reviews by default
  • The scraper will wait v seconds after clicking More, even if the reviews load in less fourth dimension
  • Full general script maintainability and efficiency can exist improved.

blackwhantem.blogspot.com

Source: https://lobeslab.com/2020/09/03/a-selenium-scraper-for-tripadvisor-reviews/

0 Response to "Click Read More Using Python Selenium Tripadvisor Reviews"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel