Issue
I'm new to Python trying to build a web scraper with Scrapy and I am getting a lot of non-printing and blank spaces in the results. I'm attempting to iterate through a dictionary with a for loop where the values are lists, then run the .strip() method to get rid of all the non-printing characters. Only now I this error instead, "TypeError: list indices must be integers or slices, not str". I know I must be reaching into the object wrong, but after a few days of sifting through docs and similar exceptions I haven't found a way to resolve it yet.
The code I'm using is:
# -*- coding: utf-8 -*-
import scrapy
from ..items import JobcollectorItem
from ..AutoCrawler import searchIndeed
class IndeedSpider(scrapy.Spider):
name = 'indeed'
page_number = 2
start_urls = [searchIndeed.current_page_url]
def parse(self, response):
items = JobcollectorItem()
position = response.css('.jobtitle::text').extract()
company = response.css('span.company::text').extract()
location = response.css('.location::text').extract()
# print(position[0])
items['position'] = position
items['company'] = company
items['location'] = location
for key in items.keys():
prestripped = items[key]
for object in prestripped:
object = object.strip('\n')
items[key] = prestripped
yield items
I'm using python 3.7.4. Any tips on simplifying the function to get rid of the nested for loops would also be appreciated. The code for the entire project can be found here.
Thanks for the help!
Edit0: The exception is thrown at line 27 reading: " prestripped = items[key][value] TypeError: list indices must be integers or slices, not str"
Edit1: The data structure is items{'key':[list_of_strings]} where the dictionary name is items, the keys are string and the key's value is a list, with each list element being a sting.
Edit2: Updated the code to reflect Alex.Kh's answer. Also, here is an approximation of what is currently getting returned: {company: ['\nCompany Name', '\n', '\nCompany Name', '\n', '\n', '\n',], location: ['Some City, US', 'Some City, US'], position: [' ', '\n', '\nPosition Name', ' ', ' Position Name']}
Solution
In addition to my comment, I think I know how to simplify and fix your code as well.
...
for key in items.keys():
restripped = items[key]
#BEWARE: a novice mistake here as object is just a copy
for object in restripped: #assuming extract() returns a list
object=object.strip() # will change a temporary copy
items[key] = restripped
...
I am not sure why exactly you need a value in your loop, so you could also just say for key in items.keys():
. You main mistake was probably accessing the dictionary incorrectly(items[key][value]->items[key]
as value
is an actually value that corresponds to that key).
Edit:I spotted a huge mistake from my part in the for
loop. As it creates a copy, the statement object=object.strip()
will not affect the actual list. Guess not using Python for a while does make you forget certain features
I will leave the incorrect solution a reminder both to me and others. The correct way to use the strip()
method is as follows:
...
#correct solution
for key in items.keys():
restripped = items[key]
for i,object in enumerate(restripped):
# alternatively: restripped[i]=restripped[i].strip()
restripped[i]=object.strip()
items[key] = restripped
...
Answered By - Alex.Kh
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.