retrieve source of external webpage

Simple web scraping with Python and requests.

Setup

Before you start, you need to make sure that your Python installation has the necessary libraries installed. The easiest way to do this is via Pip but you need administrator access. On Windows ...

WIN > cmd > (Right Click) 'Command Prompt' > (Left Click) 'Run as administrator'

'pip install requests' ENTER > "Successfully installed requests-X.X.X" (Library to allow Python to send and receive data from websites.)

'pip install lxml' ENTER > "Successfully installed lxml-X.X.X" (Library to allow Python to parse HTML and XML - you probably want this)

'pip install js2xml' ENTER > "Successfully installed js2xml-X.X.X" (Library to allow Python to convert Javascript variables to XML so that lxml can parse it - you probably want this.)

Two situations - a webpage without a login and a webpage with a login. Both situations require the requests library to be imported ...

import requests

With login

Visit webpage where you login in Google Chrome (or equivalent).

View the page source and find the 'action' attribute for the login form. This will be the page which handles login.

Press F12 to open the developer console.

Switch to 'Network' tab

Pause recording of network log

Clear log

Record network log

When login complete, stop recording

Look through the 'Name' column and find the name of the login handler

Click the name of the login handler page

Look through the 'Headers' panel.

Find 'Form Data' - this lists the data which was in forms on the login page which is passed through to the handler.

This tells you the id's of the variables on the login handler which accept data from login page together with the values that were sent to the page.

▼ Form Data   view source   view URL encoded
  username: johnsmith
  password: Password123
  token: 71569dfb894f2975034b971809cc2714

Make a note of these values.

Create a dictionary like this ...

auth = {
  'username' : 'johnsmith',
  'password' : 'Password123',
  'token' : '71569dfb894f2975034b971809cc2714'
  }

Create a variable to hold the URL of the login handler for the site

login_url = 'https://www.example.com/login_handler'

Create a requests session object. This creates a persistent websession.

session = requests.Session()

Perform the login by posting the form data to the login handler.

post = session.post(login_url,data=auth)

Check that the login is successful (this step is optional)

print(post.ok) # Returns True on success, False otherwise

Now get the source code of any page on the site by creating a new get request.

page_url = 'https://www.example.com/private_page'
get = session.get(page_url)
page_source = get.text()

You can print out/save the source or analyse the structure using the lxml or BeautifulSoup libraries

Without login

Accessing a page source without login is super easy.

import requests
session = requests.Session()
page_url = 'https://www.example.com/public_page'
get = session.get(page_url)
page_source = get.text()

Some references

🌐

Web Scraping Reference- A Simple Cheat Sheet for Web Scraping with Python

blog.hartleybrody.com

🌐

Xpath cheatsheet

devhints.io

Last modified: February 26th, 2022

Login

retrieve source of external webpage

Setup

With login

Without login