Issue
I have made a Scrapy web crawler which can scrape Amazon. It can scrape by searching for items using a list of keywords and scrape the data from the resulting pages.
However, I would like to scrape Amazon for large portion of its product data. I don't have a preferred list of keywords with which to query for items. Rather, I'd like to scrape the website evenly and collect X number of items which is representative of all products listed on Amazon.
Does anyone know how scrape a website in this fashion? Thanks.
Solution
I'm putting my comment as an answer so that others looking for a similar solution can find it easier.
One way to achieve this is to going through each category (furniture, clothes, technology, automotive, etc.) and collecting a set number of items there. Amazon has side/top bars with navigation links to different categories, so you can let it run through there.
The process would be as follows:
- Follow category urls from initial Amazon.com parse
- Use a different parse function for the callback, one that will scrape however many items from that category
- Ensure that data is writing to a file (it will probably be a lot of data)
However, such an approach would not be representative in the proportions of each category in the total Amazon products. Try looking for a "X number of results" label for each category to compensate for that. Good luck with your project!
Answered By - harada
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.