How to Scrape Data with Python and BeautifulSoup
Zenva
ACCESS the FULL COURSE here: https://academy.zenva.com/product/data-science-mini-degree/?zva_src=youtube-datascience-md
TRANSCRIPT
Hello, everyone, and welcome to our course on an intro to web scraping. My name is Nimesh, and I'll be guiding you through the next hour to an hour and a half, in which we'll learn how to retrieve data and information from various web pages. So what is web scraping for a start? Web scraping is simply the act of getting data from an HTML page. Now this HTML page could be loaded through a URL, in which we have to establish a data connection, download the page and then read it, or we could actually pre-download the page as an HTML file and read from that. So depending on whether or not we're going to have a persistent internet connection, we can choose one of the two methods, but the general idea is the same. Basically, we are just reading the HTML code from that web page and extracting data. So the first step typically is fetching the web page or choosing a file. We do so, again, through a URL or by downloading the page directly. Then we're going to read through that page and select the data that we want. Now as we're kind of bulk-downloading the entire HTML, usually we don't want most of the page. There may be a specific set of data that we want. Often this is in the form of a table. Then once we find the data that we want and know how the page is structure, we can begin parsing that data. So that typically involves reading through the web page, going into various HTML elements until we find the stuff that we want. We then have to extract it, and often we store this data in variables or perhaps we write to a CSV. In fact, in this course, we're gonna show you how to get that data and then store it into a CSV sheet.
So what topics are we going to be covering and in which order? The first order of business will be downloading BeautifulSoup. Now BeautifulSoup, interestingly named, is a Python library that will allow use to scrape web pages with relative ease. It basically takes the HTML code, formats it nicely, and then allows us to dive into each individual element. After this, we'll show you how to inspect a web page in a browser, if you don't already know that. This will allow us to determine the structure of the page and find the data we want. After this, we'll be scraping the actual data from the page. I think we'll start by reading from a Wikipedia page. We'll begin with some nicely formatted table data. Then we can parse the data, storing the necessary fields into variables. We'll learn how to write that data to a CSV, or a comma-separated value sheet. This is kind of like an Excel spreadsheet. Then we'll take a look at how to sanitize input. Sanitizing input is a very important step, as particularly when we're writ ... https://www.youtube.com/watch?v=ArVT3DF_TLg
4007249 Bytes