paint-brush
How to Web Scrape Using Python, Snscrape & HarperDBby@davisdavid
11,768 reads
11,768 reads

How to Web Scrape Using Python, Snscrape & HarperDB

by Davis DavidAugust 17th, 2022
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

The term “web scraping” refers to an automated process that can collect significant volumes of data from websites. The majority of this data is unstructured data that is stored in an HTML format. In order for this data to be utilized in a variety of applications, it must first be converted into structured data. In this article, you will learn how to use the snsscrape Python library and HarperDB to scrape data from Twitter. The process of scraping data from social networking services can be used to quickly and inexpensively gather data that can then be analyzed.

People Mentioned

Mention Thumbnail

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - How to Web Scrape Using Python, Snscrape & HarperDB
Davis David HackerNoon profile picture

Suppose you are searching for information on a website. Let’s imagine a Twitter user writes about CRYPTOCURRENCY! What do you do? You can copy and paste the tweets about CRYPTOCURRENCY into your own file.

But what if you want to retrieve massive volumes of information from Twitter? Such as vast quantities of information for your data science project? In this circumstance, copying and pasting won’t work! Then you will need to utilize Web Scraping.

What is Web Scraping?

The term “web scraping” refers to an automated process that can collect significant volumes of data from websites. The majority of this data is unstructured data that is stored in an HTML format. In order for this data to be utilized in a variety of applications, it must first be converted into structured data that is stored in a spreadsheet or a database.

For many businesses, web scraping can be used to quickly and inexpensively gather data that can then be analyzed in a variety of ways such as news monitoring, sentiment analysis, email marketing, and others.

Web scraping, the process of obtaining data from websites through automated means, can be carried out in a variety of different methods.

  • Application Programming Interfaces (APIs) e.g from Twitter, StackOverflow & Google.
  • Write code (e.g in the Python programming language).
  • Online services from different providers e.g Octoparse.

In this article, you will learn how to:

  1. How to execute web scraping on Twitter using the snsscrape Python library.
  2. How to store scraped data automatically in the database using HarperDB.
  3. How to share your data via API call by using Custom Function from HarperDB.

So let’s get started.

What is Snscrape?

snscrape is a scraping tool for social networking services (SNS). It scrapes information like user profiles, hashtags, searches, and threads and returns the discovered items, e.g. the relevant posts. It was released on July 8, 2020, and it is capable of scraping data from a variety of platforms, including the following:

  • Twitter
  • Instagram
  • Reddit
  • Facebook
  • Weibo
  • Telegram
  • Mastodon

You can use snscrape by typing its command-line interface (CLI) commands into the command prompt/terminal. If you don’t feel comfortable using a terminal, you can use snscrape as a Python library, but this is not yet documented.

Note: On Twitter, it can scrape users, user profiles, hashtags, searches, tweets (single or surrounding thread), list posts, and trends.

What is HarperDB?

HarperDB is a lightning-fast and versatile platform for managing SQL and NoSQL data. You can put it to work for a wide variety of purposes, some of which include but are not limited to quick application development, distributed computing, edge computing, software as a service (SaaS), and many others.

HarperDB does not duplicate data, is fully indexed and can run on any device, from the edge to the cloud. Additionally, it may be used with any programming language, such as Javascript, Java, Python, and others. 

The following is a list of a few of the features that can be accessed with HarperDB:

  • Allows JSON and CSV file insertions.
  • Single endpoint API.
  • Supports SQL queries for full CRUD operations.Custom Functions (Lambda-like application development platform with direct access to HarperDB’s core methods).
  • Limited database configuration required.
  • Math.js and GeoJSON are both supported.

HarperDB has a built-in HTTP API, custom functions for user-defined endpoints, and a dynamic schema that can help you easily share your scraped data with your coworkers after storing them in a HarperDB cloud instance.

HarperDB allows you to quickly download scraped data held in the HarperDB instance as a CSV file so that you can perform extra analysis before making a final choice.

After being introduced to the tools (snscrape & harperDB) that you will use to automate the process of scraping data and saving it in the database. Then all you have to do is follow the steps that are described below

Step 1: Create a HarperDB Account

We will start by working on the HarperDB database first. You can visit https://harperdb.io/ and then click the navigation bar to see a link called “Start Free.” Click it in order to create your account.

If you already have an account, use the following URL https://studio.harperdb.io/ to sign in with your credentials.

Step 2: Create a HarperDB Cloud Instance

After registration, you need to create a cloud instance to store and fetch your scraped data from Twitter. Click the Create New HarperDB Cloud Instance link to add a new instance to your account.

Note: You just need to follow all instructions provided by harperDB to create your cloud instance, such as:

  • Select Instance Types.Choose Cloud Provider.
  • Add instance information.
  • Select instance specification (RAM size, instance storage size, and instance region).
  • Confirm and create a cloud instance.

When the HarperDB Cloud Instance has been created successfully, you will see the status as OK for that particular instance, check the image below.

Step 3: Configure the HarperDB Schema and Table

To add the Twitter data that has been scraped into the database, you must first create a schema and a table. It only requires loading the HarperDB cloud instance you already created from the dashboard and creating the schema by giving it a name (like “data_scraping”).

 You then have to add a table (e.g “tweets”). Additionally, HarperDB will ask you to specify the hash attribute, which is equivalent to an ID number.

Step 4: Install the Required Packages

You need to install the following package on your local machine.

(a) harper-sdk-python
This is the Python package we’ll use to implement different HarperDB API functions sucha as inserting data into to the cloud instance. It also provides wrappers for an object-oriented interface.

pip install harperdb

(b) snscrape
Snscrape requires Python 3.8 or higher. When you install snscrape, the dependencies for the Python package are automatically installed.

pin install snscrape

Step 5:Import Important Packages

The next step is to import Python packages to scrape data from Twitter and automatically store them on harperDB cloud instance.

#import packages
#snscrape
import snscrape.modules.twitter as sntwitter
# harperdb
import harperdb
import warnings  # To ignore any warnings
warnings.filterwarnings("ignore")

Step 6: Connect to HarperDB Cloud Instance

You need to connect to the HarperDB cloud instance in order to insert scraped tweets into the table called tweets.

Here you need to provide three parameters:

  • Full URL of the HarperDB instance
  • Your username
  • Your password


# connect to harperdb
URL = "https://1-mlproject.harperdbcloud.com"
USERNAME = "USERNAME"
PASSWORD = "PASSWORD"

db = harperdb.HarperDB(url=URL, username=USERNAME, password=PASSWORD)

# check if you are connected
db.describe_all()

When you execute the above code, you will see output similar to that displayed below, indicating a successful connection to your HarperDB Cloud Instance.

{'data_scraping': {'tweets': {'__createdtime__': 1660390877630,
   '__updatedtime__': 1660390877630,
   'hash_attribute': 'id',
   'id': 'd140645e-3af2-42d7-8594-2195826dabbc',
   'name': 'tweets',
   'residence': None,
   'schema': 'data_scraping',
   'attributes': [{'attribute': '__createdtime__'},
    {'attribute': '__updatedtime__'},
    {'attribute': 'id'}],
   'record_count': 0}}}

Step 7:Create a Function to Record the Scrapped Tweets

Using the insert function from the harperdb-python package, the following function will insert the scraped tweets as data (in dictionary format) into the specified table.The insert function will receive three parameters:

  • SCHEMA name
  • TABLE name
  • data (scraped tweets)
# define a function to record scraped data into the table
def record_tweets(data):
#define the schema and table
    SCHEMA = "data_scraping"
    TABLE = "tweets"
# insert data into the table
    result = db.insert(SCHEMA, TABLE, [data])
return result

Step 8:Scrape tweets by using snsscrape

Now you can use TwitterSearchScrapper method from snsscrape python package to scrap tweets with the particular search query. In this example, I will show you how to scrap 1,000 tweets about “cryptocurrency” from 1st January 2022 to 13th August 2022.

#1 Using TwitterSearchScraper to scrape data and append tweets to list
for i, tweet in enumerate(
        sntwitter.TwitterSearchScraper(
            'crytocurrency since:2022-01-01 until:2022-08-13').get_items()):
    if i > 1000:
        break
    #2 save data automatically to the HarperB cloud instance
    data = {
        "user_name": tweet.user.username,
        "content": tweet.content,
        "lang": tweet.lang,
        "url": tweet.url,
        "source": tweet.source
    }
# insert result into the HarperDB table
    result = record_tweets(data)

As you can see from the code block above (comment #2), harperDB will automatically store scraped data into the tweets table with the following attributes.

  • Username
  • Content
  • Lang
  • Url
  • Source

Step 10:View the Tweets Table

If you open your HarperDB cloud instance, you will be able to see all records of your scraped data from Twitter.

Congratulations 🎉 You have successfully completed all required steps to automate the process of scraping data and saving it in the database.

What if you wish to share the scraped information with your colleagues? Custom Function provides a straightforward solution to this problem in HarperDB.

What is a Custom Function?

A Custom Function is a brand-new feature included in HarperDB’s 3.1+ release. You can use the feature to add your own API endpoints to HarperDB. Custom functions are powered by Fastify, which is incredibly flexible and makes it simple to interact with your data by using HarperDB core methods.

You will learn how to use the HarperDB studio to create your very own custom function in this section. You can then use an API call to share the outcomes of your scraped data with your coworkers at the office.

Here are the steps you need to follow:-

1. Enable Custom Functions
The first step is to enable the Custom functions by clicking “functions” in your HarperDB Studio (it is not enabled by default).

2. Create a Project
The next step is to create a project by specifying the name. For example tweets-api-v1. It will also create setting files for the project including:

  • Routes folder
  • File to add helper functions
  • Static folder

Note: For this article, you will focus on the routes folder.

3. Define a Route
In this step, you will create the first route to fetch some data from the tweets table from the HarperDB Datastore. You also need to know that Route URLs are resolved in the following manner:

[Instance URL]:[Custom Functions Port]/[Project Name]/[Route URL]

It will include:

  • Cloud Instance URL
  • Custom Functions Port
  • Project name you have created
  • The route you have defined

In the route file (example.js) from the function page, you will see some template code as an example. You need to replace that code with the following code:

'use strict';
module.exports = async (server, { hdbCore, logger }) => {
server.route({
    url: '/',
 method: 'GET',
 handler: (request) => {
 request.body= {
 operation: 'sql',
 sql: 'SELECT user_name,content,lang,url,source FROM data_scraping.tweets ORDER BY __createdtime__'
};
return hdbCore.requestWithoutAuthentication(request);
}
});

In the code above, the route /tweets-api-v1 is defined with the GET method and the handler function will send an SQL query to the database to get user_name, content, lang, URL, and source from the tweets table ordered by the __createdtime__ column.

4. Access data via API Endpoint
Finally, you can now use the route you have defined to get the data from the tweets table. Here you will send an API request by using the requests Python package.

#send an API request
import requests
# api-endpoint
URL = "https://functions-1-mlproject.harperdbcloud.com/tweets-api-v1"
  
# sending get request and saving the response as response object
r = requests.get(url = URL)
  
# extracting data in json format
data = r.json()
for experiment in data:
    print(experiment)

Here is the sample output from the above code.

{"user_name": "DailyCryptoTrad","content": "DXY forming a bullish bull flag on the daily - a break out of 106.6 will give crypto red days however if we fail below 105 will give crypto green days - Keep an eye on DXY #DXY #SPY #crypto #btc #eth #bitcoin #crytocurrency #cryptocurrencies https://t.co/AkF8Igf3Uc","lang": "en","url": "https://twitter.com/DailyCryptoTrad/status/1558211511461597188","source": "<a href=\"https://mobile.twitter.com\" rel=\"nofollow\">Twitter Web App</a>"},{"user_name": "Ariscrypto1970","content": "@scrypto_1977 @Epayme_uae #Saitama will go parabolic when it happens! This is the #WeAreSaitama and the world are waiting for. 🔥🔥🔥🚀🚀🚀🚀#crytocurrency #DeFi","lang": "en","url": "https://twitter.com/Ariscrypto1970/status/1558200674273345537","source": "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"},{"user_name": "dan_nyeche","content": "Cryptocurrency market up 24Hrs. #Bitcoin #Dan_Trades #crytocurrency Emilokan Big Brother Modella FireBoy Giddyfia  GTBank President Obama #gayfish Gen Z Ethereum Chi Chi Obidatti2023 Sapa Lewandoski #HAPPYJAEMINDAY #Jalsa4K #GomoraMzanzi #SheggzOlu𓃵 #ViratKohli𓃵 https://t.co/QbU4ei3MGA","lang": "in","url": "https://twitter.com/dan_nyeche/status/1558188248362467329","source": "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>"},

Note: With HarperDB, you can quickly and easily build API endpoints to share the scraped data with your team working on the same data science project.

Conclusion

Congratulations 🎉, you have made it to the end of this article. You have learned:

  • How to execute web scraping on Twitter using the snsscrape Python library.
  • How to store scraped data automatically in the database using HarperDB cloud instance.
  • How to create a custom function from the HarperDB cloud instance to share your scraped data with your coworkers working on the project via an API endpoint.

If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!

You can also find me on Twitter @Davis_McDavid.