Scraping the unscrapable in Python using Playwright

Written by terieyenike | Published 2023/07/02
Tech Story Tags: python | web-scraping | playwright | programming | tutorial | data-scraping | scraping-browser | web-scraping-with-python

TLDRBright Data is a web data platform that helps organizations, small businesses, and academic institutions retrieve crucial public web data efficiently, reliably, and flexibly. Bright Data comprises ready-to-use datasets that are GDPR and CCPA-compliant. Scraping the web is all about extracting data in a clean and readable format that developers, data analysts, and scientists deploy.via the TL;DR App

Automating your workflowĀ with scriptsĀ to get results efficiently is better than being painstakingly done manually. Scraping the web is all about extracting data in a clean and readable format that developers, data analysts, and scientists deploy to read and download an entire web page of its data ethically.

In this article, you will learn and explore the benefits of using Bright Data infrastructure that connects to large datasets with great proxy networks using theĀ Scraping Browser.

Let’s get started.

What is Bright Data?

Bright Data is a web data platform that helps organizations, small businesses, and academic institutions retrieve crucial public web data efficiently, reliably, and flexibly. Bright Data comprises ready-to-use datasets that are GDPR and CCPA-compliant.

What is Playwright?

Playwright is used to navigating target websites just like the function of Puppeteer interacting with the site’s HTML code to extract the data you need.

Installation

Before writing a single script, check if you have Python installed on your system using this command in the command line interface (CLI) or terminal:

python --version

If the version is not present in the terminal after running the command, go to theĀ official website of PythonĀ to download it to your local machine.

Connecting to Scraping Browser

Create a new account onĀ Bright DataĀ to gain access to the admin dashboard of the Scraping Browser for the proxy integration with your application.

On the left pane of the dashboard, click on theĀ Proxies and Scraping InfraĀ icon.

Scrolling down the page, select theĀ Scraping Browser.Ā After that, click on theĀ Get startedĀ button. If you don’t find it, click theĀ AddĀ button dropdown from the previous image and select the Scraping Browser.

The next screen allows you to rename the proxy name. Click theĀ Add proxyĀ button to pop up a prompt display message. Accept the default change by clicking theĀ YesĀ button.

Next, click the </> Check out code and integration examples button to configure the code in Python.

Creating environment variables in Python

Environment variables are stored secret keys and credentials in the form of values configured to keep the app running during development and prevent unauthorized access.


Like in a Node.js app, create a new file called .env in the root directory. But first, you will need to install the Python package python-dotenv.

pip3 install python-dotenv

The package reads the key-value pairs of the environment variables set.

To confirm the installation of the package python-dotenv, run this command that lists all installed packages present:

pip3 list

Next, copy-paste this code into the .env file:

.env

USERNAME="<user-name>"
HOST="<host>"

Replace the values in the quotation with the values from Bright Data.

Creating the web scraper with Playwright

In the project directory, create a new file called app.py to handle scraping the web.

Installing packages

You will need to install these two libraries, asyncio, and playwright, with this command:

pip3 install asyncio
pip3 install playwright

  • Asyncio: It is a library to write concurrent code using the async/await syntax
  • Playwright: This module provides a method to launch a browser instance

Now, copy-paste this code:

app.py

import asyncio
import os
from playwright.async_api import async_playwright
from dotenv import load_dotenv

load_dotenv()

auth = os.getenv("USERNAME")
host = os.getenv("HOST")

browser_url = f'wss://{auth}@{host}'

async def main():
    async with async_playwright() as pw:
        print('connecting');
        browser = await pw.chromium.connect_over_cdp(browser_url)
        print('connected');
        page = await browser.new_page()
        print('goto')
        await page.goto('http://lumtest.com/myip.json', timeout=120000)
        print('done, evaluating')
        print(await page.evaluate('()=>document.documentElement.outerHTML'))
        await browser.close()

asyncio.run(main())

The code above does the following:

  • Import the necessary modules like asyncio, async_playwright, load_dotenv, and os
  • The load_dotenv() is responsible for reading the variables from the .env file
  • The os.getenv() method returns the values of the environment variable key
  • The main() function is asynchronous, and within the function, the playwright module connects to the data zone
  • The new_page() method gets the page HTML and, with the goto method, leads to the destination site with a timeout of 2 minutes
  • While the page.evaluate() method will query the page and print out the result after accessing the page elements and firing up the events
  • It is a must to close the browser with the browser.close() method

To test this application, run with the command:

python app.py

Conclusion

The prospects of evaluating and extracting meaningful data are the heart and operation of what Bright Data offers.

This tutorial showed you how to use the Scraping Browser in Python with the Playwright package to read data from a website.

Try Bright Data today!


Written by terieyenike | I am a software developer focused on creating content through technical writing and documentation.
Published by HackerNoon on 2023/07/02