scrape webpages with xpath and pythonSo, sometimes, it would be great to be able to programatically scrape data from a webpage. In fact, this might even be your first journey into the world of Data Science.
We are going to use PythonA programming language invented to teach programming but now widely used in education, industry and research. and XPathA query language for selecting nodes from an XML document., a syntax for defining parts of an XML document, to access a webpage, locate a piece of data on the page, retrieve the data and display it.
Python needs two libraries to be installed...
First, let's check whether they are installed. I'm working on a Windows machine.
Start an administrator command prompt
Click start and type
then click on "Run as Administrator". Accept the User Account Control warning; you know what you are doing 😄
List all the installed Python packages
At the prompt, type
and press ENTER. You should see a list of all the installed Python packages/libraries.
Install the required Python packages if necessary
If you can't see
, you will need to install them.
> pip install lxml > pip install requests
Everything should end normally with a confirmation message saying that this and that have been installed. Check again using
to make sure they are available.
Choose a cool website to scrape
I'm going to scrape my favourite website, eBay, to find out how many items it finds for a particular search query. As an absolute minimum, eBay requires you to specify the search term in the url parameter
. Note that you should replace any spaces with plus signs if you are going to construct the URL yourself.
So, let's say I want to find out how many Facit calculators are listed on Ebay at the moment. The URL would look like this...
Check it out and make a note of how many results you get. When I did it, I got 33.
Use XPath Helper to find the path to the results
We need to find the "path" to this text in the document tree. You *could* view the source code and construct the path yourself but, for a big page like this, it's tough and error prone. There is another way - if you are using Google Chrome, then install an extension called XPath Helper. Make sure that the extension is active and refresh the page before you start (to give the extension permission to access the page).
Now, when you click on the XPath Helper icon, a black panel should appear. Hold down the SHIFT key and hover over the number of results. You'll see a yellow highlight - as you move around, notice the 'path' in the left hand box which is the journey from the root "html" node to the part of the page you are hovering over. The right hand box shows you the content of that node, so you can check it's correct.
When you have located the right node, let go of the SHIFT key and the path will remain fixed. Now, copy the path - we need to do a little editing of this before we put it into Python. You can express this path as an absolute route from the root to the desired node or you can choose a shorter root, starting at a unique node in the path. Normally, we would look for a "id" attribute in a node if we were going to start with that so that we would be sure that the path is unique. However, in this case, we can't because all of the nodes in our path are "classes" so we can't guarantee the path is unique. Moreover, if there are any other instances of these classes in the page, our path will break.
In this situation, the safest way is to strip the path back to it's skeletal form. We reduce this...
/html[@class='srp-ds6 srp-ds6-phase3 history devicemotion deviceorientation']/body[@class='s-page no-touch skin-large srp--list-view gh-flex']/div[@class='srp-main srp-main--isLarge']/div[@id='mainContent']/div[@class='s-answer-region s-answer-region-center-top']/div[@class='srp-controls srp-controls--with-list srp-controls--with-checkbox']/div[@class='clearfix srp-controls__row-2']/div[@class='srp-controls__row-cells']/div[@class='srp-controls__control srp-controls__count']/h1[@class='srp-controls__count-heading']/span[@class='BOLD']
By removing all the class elements, we are guaranteeing that the path is unique. We then need to grab the content of the node (XPath Helper does not seem to include this) by adding
to the end of the path. So our final XPath expression is...
You can actually see the path if you look at the HTML of the page. There is only one valid path to the last node but there are two
tags to choose from when we get there which is why the path specifies the first one with
Writing a simple Python program
So, lets get programming! This is a simple Python script which will request and download the webpage the analyse the path we provide to give us the result. NOTE: This will only work for a website that you do not need to log into to - we'll covert that scenario later 😄. Create the script using your favourite Python editor, save and run.
# First, we need the requests library to get the webpage import requests # The lxml library is used to convert html to an xml tree from lxml import html # Let's create a session object to handle the retrieval. session = requests.Session() # The most basic URL structure for Ebay. url = 'https://www.ebay.co.uk/sch/i.html?_nkw=facit+calculator' # Get the webpage. The page is returned as a text/html file. page = session.get(url) # Create an XML tree from the page text. tree = html.fromstring(page.text) # The XPath to the required data. x = "/html/body/div/div/div/div/div/div/div/h1/span/text()" # Execute the xpath results = tree.xpath(x) # Print the result. print(results)
If all has gone to plan, you should get the following output... (although I got 33 before 🤪)
Connected: True ['29'] >>>
XPath always returns it's results as a list so we get the item we want, in this case, the first one, by...
>>> print(results) 29
It really is up to you what you do with this data. You could get the script to alert you when new items are listed perhaps?
Last modified: October 5th, 2021