Connect to, and retrieve source code of, an external webpage

Before you start, you need to make sure that your Python installation has the necessary libraries installed. The easiest way to do this is via Pip but you need administrator access. On Windows ...
  • WIN > 'cmd' > (Right Click) 'Command Prompt' > (Left Click) 'Run as administrator'
  • 'pip install requests' [ENTER] > "Successfully installed requests-X.X.X"
    Library to allow Python to send and receive data from websites.
  • 'pip install lxml' [ENTER] > "Successfully installed lxml-X.X.X"
    Library to allow Python to parse HTML and XML - you probably want this
  • 'pip install js2xml' [ENTER] > "Successfully installed js2xml-X.X.X [plus dependencies]"
    Library to allow Python to convert Javascript variables to XML so that lxml can parse it - you probably want this.
Two situations - a webpage without a login and a webpage with a login. Both situations require the requests library to be imported ...

import requests

With login
  • Visit webpage where you login in Google Chrome (or equivalent).
  • View the page source and find the 'action' attribute for the login form. This will be the page which handles login.
  • Press F12 to open the developer console.
  • Switch to 'Network' tab
  • Pause recording of network log
  • Clear log
  • Record network log
  • Login to the page
  • When login complete, stop recording
  • Look through the 'Name' column and find the name of the login handler
  • Click the name of the login handler page
  • Look through the 'Headers' panel.
  • Find 'Form Data' - this lists the data which was in forms on the login page which is passed through to the handler.
  • This tells you the id's of the variables on the login handler which accept data from login page together with the values that were sent to the page.
▼Form Data   view source   view URL encoded
  username: johnsmith
  password: Password123
  token: 71569dfb894f2975034b971809cc2714
  • Make a note of these values.
  • Create a dictionary like this ...
auth = {
  'username' : 'johnsmith',
  'password' : 'Password123',
  'token' : '71569dfb894f2975034b971809cc2714'
  }
  • Create a variable to hold the URL of the login handler for the site
login_url = 'https://www.example.com/login_handler'
  • Create a requests session object. This creates a persistent websession.
session = requests.Session()
  • Perform the login by posting the form data to the login handler.
post = session.post(login_url,data=auth)
  • Check that the login is successful (this step is optional)
print(post.ok) # Returns True on success, False otherwise
  • Now get the source code of any page on the site by creating a new get request.
page_url = 'https://www.example.com/private_page'
get = session.get(page_url)
page_source = get.text()
  • You can print out/save the source or analyse the structure using the lxml or BeautifulSoup libraries
Without login

Accessing a page source without login is super easy.

import requests
session = requests.Session()
page_url = 'https://www.example.com/public_page'
get = session.get(page_url)
page_source = get.text()

Some references