Issue
I am using selenium
and Python to do a big project. I have to go through 320.000 webpages (320K) one by one and scrape details and then sleep for a second and move on.
Like bellow:
links = ["https://www.thissite.com/page=1","https://www.thissite.com/page=2", "https://www.thissite.com/page=3"]
for link in links:
browser.get(link )
scrapedinfo = browser.find_elements_by_xpath("*//div/productprice").text
open("file.csv","a+").write(scrapedinfo)
time.sleep(1)
The greatest problem : too slow!
With this script I will take days or maybe weeks.
- Is there a way to increase speed? Such as, by visiting multiple links at the same time and scraping all at once?
I have spent hours finding answers on google and Stackoverflow and only found about multiprocessing
.
But, I am unable to apply it in my script.
Solution
Threading approach
- You should start with
threading.Thread
and it will give you a considerable performance boost (explained here). Also threads are lighter than processes. You can use afutures.ThreadPoolExecutor
with each thread using its own webdriver. Consider also adding theheadless
option for your webdriver. Example bellow using a chrome-webdriver:
from concurrent import futures
def selenium_work(url):
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
driver = webdriver.Chrome(options=chromeOptions)
#<actual work that needs to be done be selenium>
# default number of threads is optimized for cpu cores
# but you can set with `max_workers` like `futures.ThreadPoolExecutor(max_workers=...)`
with futures.ThreadPoolExecutor() as executor:
# store the url for each thread as a dict, so we can know which thread fails
future_results = { url : executor.submit(selenium_work, links) for url in links }
for url, future in future_results.items():
try:
future.result() # can use `timeout` to wait max seconds for each thread
except Exception as exc: # can give a exception in some thread
print('url {:0} generated an exception: {:1}'.format(url, exc))
Consider also storing the
chrome-driver
instance initialized on each thread usingthreading.local()
. From here they reported a reasonable performance improvement.Consider if using
BeautifulSoup
direct on the page from selenium can give some other speed-up. It's a very fast and stablished package. Example something likedriver.get(url) ... soup = BeautifulSoup(driver.page_source,"lxml") ... result = soup.find('a')
Other approaches
Although I personally not saw much benefits on using
concurrent.futures.ProcessPoolExecutor()
you could experiment on that. In fact it was slower than threads on my experiments on Windows. Also on Windows you have many limitations for pythonProcess
.Consider if your use case can be satisfied by using arsenic a asynchronous webdriver client built on asyncio. That really sound promissing, though having many limitations.
Consider if Requests-Html solves your problems with javascript load. Since it claims Full JavaScript support! In that case you could use it with
BeautifulSoup
on a standard data scraping methodology.
Answered By - iambr
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.