Scrapy is a web crawling framework which divide the whole process of crawling to small processes so that the crawling process is well organize!
Crawl Data (spider.py) -> Rotate proxy or ip (middlewares.py) ->
Clean Data (items.py)-> Store Data(pipeline.py)
With all the settings (setting.py).
The biggest feature is that it is built on Twisted, an asynchronous networking library, so Scrapy is implemented using a non-blocking (aka asynchronous) code for concurrency, which makes the spider performance is very great. — Michael Yin
Selenium is a free automated testing suite for web applications across different browsers and platforms. Although it was created for automated testing on web app, it is really easy to apply to scrape websites! You just need to
Issues I faced using Selenium:
Issues I faced using Scrapy:
Sharing about my experiences:
At first, I learned Selenium as it is much easier to learn and debug as I need to render JavaScript websites. When I first use selenium, it satisfies all my needs, crawling all the web-pages in required time frame. Then speed it up by using multi threading and everything goes really smooth.
Yeah Really Smooth!
But one day, one particular website block me by implementing Completely Automated Public Turing test to tell Computers and Humans Apart (Captcha). I was really stuck but I was required to figure out a way to solve this problem. So, after I tried all the ways to solve the captcha, I think why not I use another framework to try and see whether it can bypass the captcha.
Bang my head and hope something magical come to my mind :(
At last I found Scrapy framework and not only solve captcha problems but a start for me to learn a really powerful crawling framework! The learning curve for Scrapy is much steeper than Selenium but it definitely worth it base on the five points below:
Scrapy-Splash is definitely worth trying out to render heavy loaded Javascript websites but compare to Splash, Scrapy-Splash have much lesser resources compare to Scrapy.
Here are some resources I find useful to learn Scrapy-Splash.
Here are some really useful resources to learn Scrapy.
Here are some really useful resources to learn Selenium.
All resources are based on Python. Happy Learning!
If you are interested to know more about tutorials for Scrapy-Splash, Scrapy or Selenium, feel free to comment below!
Feel free to reach out to me too:)