paint-brush
How to Scrape Data Off Wikipedia: Three Ways (No Code and Code)by@horosin
118 reads

How to Scrape Data Off Wikipedia: Three Ways (No Code and Code)

by Karol HorosinJuly 31st, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

For a side project, I turned to Wikipedia tables as a data source. Despite their inconsistencies, they proved quite useful. I explored three methods for extracting this data: - Google Sheets: Easily scrape tables using the =importHTML function. - Pandas and Python: Use pd.read_html to load tables into dataframes. - Beautiful Soup and Python: Handle more complex scraping, such as extracting data from both tables and their preceding headings. These methods simplify data extraction, though some cleanup is needed due to inconsistencies in the tables. Overall, leveraging Wikipedia as a free and accessible resource made data collection surprisingly easy. With a little effort to clean and organize the data, it's possible to gain valuable insights for any project.
featured image - How to Scrape Data Off Wikipedia: Three Ways (No Code and Code)
Karol Horosin HackerNoon profile picture

I needed some data for my side project. Google helped me find only outdated CSVs, expensive closed sources, and APIs hidden behind the “contact sales” button.


Wikipedia was ranking high in my searches, but I quickly found that the tables are as unstructured as the ones you typically create in Word or Google Docs. There was no consistency in columns and naming conventions. This at first discouraged me but after failing to find another data source, I gave it a go. And I ended up using it for my project.


Let me show you how to quickly load Wikipedia data for your data analysis. I was astonished at how easy it was.


Big thanks to all Wikipedia contributors doing the hard work.

Method 1: Loading Tables in Google Sheets

This one feels like magic.


Use this formula in one cell, and it will expand the entire table scraped from Wikipedia.


=importHTML(WIKIPEDIA_URL, "table", NUMBER_OF_TABLE_ON_THE_PAGE);


Let’s say you want to fetch a list of all EU countries with their population count, GDP, language, and other essential data.


There’s a Wikipedia page containing all of this information and citing sources of the information(!): https://en.wikipedia.org/wiki/Member_state_of_the_European_Union.


The table containing this info is the second table on the page (the first one is the little summary table). The formula will look like:


=importHTML("https://en.wikipedia.org/wiki/Member_state_of_the_European_Union", "table", 2);


After running it, you’ll get all of this data in your Google Sheets, ready to analyze or export.


Google Sheets can also read lists! Full docs here.

Method 2: Pandas and Python

I was surprised to find out that the most popular library for data analysis for Python has a feature equivalent to Google Docs.


import pandas as pd

url = "https://en.wikipedia.org/wiki/Member_state_of_the_European_Union"

dfs = pd.read_html(url)

df = dfs[1]


read_html function from pandas takes the site URL as an input and fetches all tables to a list of data frames. Again, the second one is what we are looking for. And it’s ready for further analysis!


Since Google Docs was not something I intended to stick with, I used this method in addition to method 3 to gather data from several pages.


Full documentation here.

Method 3: Python and Beautiful Soup

Well, Beautiful Soup is a go-to library for parsing HTML in the Python ecosystem. I looked for easy solutions, but I needed to fall back on them in the end. The reason was I needed to get not only data from tables but also from headings that preceded them.


Instead of writing the code from scratch, I recommend using ChatGPT and refining its output. I basically pasted a URL, said what I needed to scrape, and got the script without giving any guidance.


If you wanted to do something similar to me - getting all tables from a page and data from multiple levels of headings preceding them - you would do something like this.


import requests
from bs4 import BeautifulSoup

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
tables = soup.find_all('table', {'class': 'wikitable'})

all_airports = []

for table in tables:
  continent = table.find_previous('h2').find('span', class_='mw-headline').text
  region = table.find_previous('h3').find('span', class_='mw-headline').text
  country = table.find_previous('h4').find('span', class_='mw-headline').text
  
  rows = table.find_all('tr')
        
  for row in rows[1:]:
      cols = row.find_all('td')
      
      # here push the data wherever you need


It looks pretty neat, doesn’t it? I was worried that scraping “by hand” would be a pain but actually, Wikipedia usually has a pretty clear, predictable HTML structure. It varies from page to page but adheres to the same patterns.


This method in addition to using pandas helped me get all the data I needed.

Summary

As you can see, getting data for your analyses or side projects can be nice and easy.


There is a catch though - there are no data consistency mechanisms in tables inside Wikipedia pages. This means that after using any of the above methods, you’ll likely need to do some cleaning and check for mismatches in naming conventions. Some regular expressions will come in handy as well.


At times, I am cynical about technological progress but discovering that information access can be so straightforward makes me feel that I can really stand on the shoulders of giants when building on my own.


Subscribe to my profile by filling in your email address on the left, and be up-to-date with my articles! I will soon be releasing a crazy interesting project that I needed data from Wikipedia!

Don't forget to follow me on Twitter @ horosin, and subscribe to my blog’s newsletterfor more tips and insights!

If you don't have Twitter, you can also follow me on LinkedIn.