Note to the reader: Python code is shared at the end
This week I had to scrape a website for a client. I realized I did it so naturally and quickly that it would be useful to share it so you can master this art too. [Disclaimer: this article shows my practices of scraping, if you have more relevant practices please share it in the comments]
- Pinpoint your target: a simple html website
- Design your scraping scheme in Python
- Run & let the magic operate
Part I: Finding your target (a website)
In my case, I needed to gather the name of the Bank from SWIFT codes (or French BIC codes.) The website http://bank-code.net/country/FRANCE-%28FR%29.html has a list of 4000+ SWIFT codes with the associated bank names. The problem is they show only 15 results per page. Going through all the pages and copy paste 15 results at a time was NOT an option. Scraping came in handy for this task.
First, use Chrome “inspect” option to identify the part of html you need to get. Move your mouse on the different items in the inspection window (on the right), and track the part of website which is highlighted by the code (on the left). Once you’ve selected the item, in the inspection window, use “Copy / Copy element” and paste the html code in your python coding tool.
In my case, the desired item with 15 SWIFT codes is a “table”
Part II: Design your scraping scheme in Python
a) Scrape a first page
And that’s it, 3 lines of code and Python has received the webpage. Now you need to parse the html properly and retrieve the desired item.
Remember the desired html:
It is a “table” element, with id “tableID”. The fact that it has an id attribute is great, because no other html elements on this webpage can have this id. Which means if I look for this id in the html, I cannot find anything else than the desired element. It saves time.
Let’s do that properly in Python
So now we have got the desired html element. But we still need to get the SWIFT codes inside the html, and then store it in Python. I chose to store it in a pandas.DataFrame object, but just a list of list can work out as well.
To do that, go back on Chrome inspection window, analyse the structure of the html tree, and notice until which element you have to go. In my case, the required data was in “tbody” element. Each bank and its SWIFT code were contained in a “tr” element and each “tr” element had multiple “td” elements. The “td” elements contained the data I was looking for.
I do it in one line with the following:
b) Prepare automation
Now that we have scraped the first webpage, we need to think of how to scrape new webpages we haven’t seen yet. My way of doing that is replicating human behavior: storing results from one page, then going to the next. Let’s focus now on going to the next webpage.
At the bottom of the page, there is a menu that allows you to go on a specific page of the swift code table. Let’s inspect the “next page” button in the inspector window.
This gives the following html element:
Now to get the url in Python is simple:
And we’re almost there.
So far we have:
- developed the scraping of the table of one page
- identified the url link of the next page
We only need to do a loop, and run the code. Two best practices I recommend following:
1. printing out when you land on a new webpage: to know at which stage of the process your code is (scraping codes can run for hours)
2. saving results regularly: to avoid losing all you scraped if there is an error
As long as I don’t know when to stop scraping, I loop with the idiomatic “while True:” syntax. I print out counter value at each step. And I save results in a csv file at each step as well. This can lose time actually, a better way would be to store data every 10 or 20 steps for instance. But I went for quick implementation.
The code goes like this:
Full code (only 26 lines) can be found here: https://github.com/FelixChop/MediumArticles/blob/master/Scraping_SWIFT_codes_Bank_names.py
Originally posted: https://towardsdatascience.com/a-short-practical-how-to-guide-to-scrape-data-from-a-website-using-python-888373227d4f