Web scraping the yellow pages with Python, get the name and location of thousand of businesses in minutes
In the world of business, a dataset that is of great interest but equally hard to obtain is the location of all your competitors. This short tutorial will guide you on how to scrape location data from a yellow-pages-like website.
Other applications to this code are if you want to know where is a good idea to open a new store or create a list of company leads if you are in sales.
We will be using the selenium library in python, and please note that this is not meant to be a 101 on how to web scrape.
Where are all the Pizzerias?
As a scenario, we will be creating a dataset of all Pizzerias in the country. We are interested in creating a data frame that contains two columns; location name and address. We could get additional data if we wished, like website, reviews, or hours of operation, however, you can achieve this by adapting the code provided in this blog.
Note that for this tutorial I am selecting as the area of search the entire country. This is meant to show how powerful this method is, as we can scrape the data of the 1,579 Pizzerias in the Netherlands within a few minutes.
As the image above shows, the website is constructed so that each business gets its own “element box” on a scroll-down page. Once you reach the end of the website, you can select to advance to the next page button to see more companies' data. This website layout is quite common, so you should be able to adapt my code to your case.
At the heart of this method is the ability to create a loop that scrapes each of the “element boxes” (each company box) and can understand when it's time to click “go to next page” and continue the scrapping process. Additionally, we need to be able to save the information we scrape into a data frame.
I will first provide the entire code and then break the code into steps that I will explain.
Step 1: Import libraries, set your google chrome path, create an empty data frame and go to the website of interest. Note that the try loop in the screenshot is more complicated than it needs to be, and relates to some cases of accepting cookies. For that reason, in the code below I exclude it.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import pandas as pdPATH ="your locaiton of chromedriver.exe"
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(PATH,0,chrome_options=options,)columns = ['Address2','Zipcode','Phone']
driver.maximize_window()time.sleep(2)click1 = driver.find_element_by_xpath('//*[@id="onetrust-accept-btn-handler"]')
Step 2: Set two variables that we will use to iterate through each “element box”. “Row” will keep track of the index assigned to each “element box” in the website, while “row” refers to the row in the data frame where we will be storing our scrapped text. Note that on the first page, Row and row will be the same, but as we navigate to the second page and beyond, Row needs to be reset to an index of 1, while row needs not to be reset as this would result in the overwriting of the stored row data within the data frame.
Step 3: The key concept here is that the XPath of the name and the address of each company differs only in a digit within the XPath. This digit is the index that we have stored in the variable “Row” already. You can think of this digit as an index referring to the number assigned to each business in the list of the website query results. We can create two variables per object of interest (name or address), where the first variable is the first section of the XPath text until the indexed integer, and the second portion is the reminder of the XPath.
Step 4: We start a while loop that will contain the most important part of the code’s logic. First, we concatenate the XPath before the index to the index in string form, and finally, the right portion of the XPath is concatenated to what we have in order to complete the valid full XPath. We do this for both the name and the location. While it may seem complicated, it is not, we are just concatenating parts of the code in order to create a valid XPath. We concatenate the XPath in order to make this dynamic so that we can crape the data of each company in a loop.
while Row <26:
Step 5: We have already written the XPaths needed to identify each Name and Location of each business on the website. Now we need to pass these XPaths to selenium in order to find them on the website. After which, we transfer the found text for each XPath into a data frame. We achieve this with the following code. We finish this step by adding +1 to our variables Row and row in order to both scrape the next business and move in our data frame to the next row.
Location = driver.find_elements_by_xpath(name)
for x in range (len(Location)):
df.loc[row, 'Address2'] = (Location[x].text)
Address = driver.find_elements_by_xpath(location)
for x in range (len(Address)):
df.loc[row, 'Zipcode'] = (Address[x].text)
Step 6: In my case, the website displays up to 25 locations, after which, the user needs to click to the next page to see additional results. This means, that we know that once the index Row arrives at 26, we will no longer find any “business box” with this index. For this reason, we set some logic so that when Row is 26, we advance to the next page and reset the index of “business box” back to 1. Note we do not want to reset the data frame row to 1.
click1 = driver.find_element_by_xpath('//*[@id="content"]/div/div/div/section/div/div/section/div/ul/li/a')
In conclusion, the above method can be used to scrape data from a website where information per units of interest (in our case businesses) is provided in the form of element boxes that can be accessed by iterating the XPath index within them and setting some logic to save the information in a data frame and then be able to navigate across pages when each page is over.