Login

Please fill in your details to login.





retrieve source of external webpage

Simple web scraping with Python and requests.

Setup


Before you start, you need to make sure that your Python installation has the necessary libraries installed. The easiest way to do this is via Pip but you need administrator access. On Windows ...

WIN > cmd > (Right Click) 'Command Prompt' > (Left Click) 'Run as administrator'
'pip install requests' ENTER > "Successfully installed requests-X.X.X" (Library to allow Python to send and receive data from websites.)
'pip install lxml' ENTER > "Successfully installed lxml-X.X.X" (Library to allow Python to parse HTML and XML - you probably want this)
'pip install js2xml' ENTER > "Successfully installed js2xml-X.X.X" (Library to allow Python to convert Javascript variables to XML so that lxml can parse it - you probably want this.)

Two situations - a webpage without a login and a webpage with a login. Both situations require the requests library to be imported ...

import requests


With login


Visit webpage where you login in Google Chrome (or equivalent).
View the page source and find the 'action' attribute for the login form. This will be the page which handles login.
Press F12 to open the developer console.
Switch to 'Network' tab
Pause recording of network log
Clear log
Record network log
Login to the page
When login complete, stop recording
Look through the 'Name' column and find the name of the login handler
Click the name of the login handler page
Look through the 'Headers' panel.
Find 'Form Data' - this lists the data which was in forms on the login page which is passed through to the handler.
This tells you the id's of the variables on the login handler which accept data from login page together with the values that were sent to the page.

▼ Form Data   view source   view URL encoded
  username: johnsmith
  password: Password123
  token: 71569dfb894f2975034b971809cc2714


Make a note of these values.
Create a dictionary like this ...

auth = {
  'username' : 'johnsmith',
  'password' : 'Password123',
  'token' : '71569dfb894f2975034b971809cc2714'
  }


Create a variable to hold the URL of the login handler for the site

login_url = 'https://www.example.com/login_handler'


Create a requests session object. This creates a persistent websession.

session = requests.Session()


Perform the login by posting the form data to the login handler.

post = session.post(login_url,data=auth)


Check that the login is successful (this step is optional)

print(post.ok) # Returns True on success, False otherwise


Now get the source code of any page on the site by creating a new get request.

page_url = 'https://www.example.com/private_page'
get = session.get(page_url)
page_source = get.text()


You can print out/save the source or analyse the structure using the lxml or BeautifulSoup libraries

Without login


Accessing a page source without login is super easy.

import requests
session = requests.Session()
page_url = 'https://www.example.com/public_page'
get = session.get(page_url)
page_source = get.text()


Some references
🌐
devhints.io
Last modified: February 26th, 2022
The Computing Café works best in landscape mode.
Rotate your device.
Dismiss Warning