Scrapy Python Tutorial : Yelp Business data | XPATH and 'response.meta.get' for multi level scraping
Dr Pi
Writing a spider to scrape Yelp business information.
Sometimes some of the data needs to be passed to the another callback function while using the scrapy framework. In this video we need to get some data from the main listings page AND go to the details page and get the telephone number and website for Scrapy to output to a CSV using FEEDS as the export.
So, we use the meta property of the scrapy request to pass our data and get it with "response.meta.get" in the 'fetch_details' method.
I am using Python 3.8 with Scrapy 2.1 and Atom as my IDE/Editor.
Video Chapters
0:20 Introduction 1:49 start_urls 3:48 begin the scrapy spider 5:00 XPATH(or CSS) selectors 7:08 Scrapy Shell (and Fordbidden by robots.txt) 11:34 selectors : tag and class 15:50 pages may not be processed in synch 17:50 example of out of synch output 18:54 'meta' 28:56 response.meta.get 30:34 successful test run 32:00 inspecting the test results 34:53 csv output, in synch 37:47 gratuitous flames 37:53 next_page can be used - just un comment it (remove #) 38:16 don't forget to....
Many Scrapy web scraping tutorials, especially the ""quotes" and "books to scrape" ones will not show how to pass variables in this way :
⦿ yield Request(absolute_url, callback=self.fetch_detail, meta={'link': link, 'logo_url': logo_url, 'lcompanyname':lcompanyname})
This allows you to have your method called "parse", and send (using yield & 'meta') to the next function - eg "parse_details" or "fetch_details" - where you can then send all of the collected data to the same row in the output.
➤ https://docs.scrapy.org/en/latest/top...
Note! After 1.7, Request.cb_kwargs became the preferred way for handling user information, leaving Request.meta for communication with components like middlewares and extensions.
*Meta is still used widely and you can find it on some intermediate/advanced tutorials and Stackoverflow.
** Don't confuse this with " Extracting keywords from metatag using scrapy"
⦿ If you need to know how to phrase this concept then Scrapy documentation says :
"In some cases you may be interested in passing arguments to those callback functions so you can receive the arguments later, in the second callback."
I show 2 ways, and illustrate why it's best to use "meta" with yield Request when using callback and passing variables between methods.
This is useful when collecting data from a main page and a detail page.
start url = "https://www.yelp.com/search?find_desc..."
➤ Scrapy reference : https://docs.scrapy.org/en/latest/top... ➤ About Scrapy - https://towardsdatascience.com/a-mini... ➤ A Minimalist End-to-End Scrapy Tutorial (Part II) - https://towardsdatascience.com/a-mini... ➤ About XPATH and using the dot "." https://stackoverflow.com/qu ... https://www.youtube.com/watch?v=IXFXeefCiVM
231108035 Bytes