Friday, December 29, 2023

[FIXED] scraping attribute under tag

December 29, 2023 css-selectors, scrapy, web-scraping No comments

Issue

response:

['<td class="V2ligneB" valign="top">\r\n                        LINAIA\r\n                    </td>',
 '<td class="V2ligneB" valign="top" title="[email protected]">\r\n                        PAILLEREAU  Florent \r\n        
            </td>',
 '<td class="V2ligneB" valign="top">\r\n                        35000 RENNES\r\n                    </td>',
 '<td class="V2ligneB" valign="top">\r\n                        \r\n                    </td>',
 '<td class="V2ligneB" valign="top" align="center">\r\n                        \n                    <a href="javascript:void(0)" onclick="window.open(\'index.cfm?fuseaction=mEnt.ficheEntAW&amp;uuid=2f89094e-4da1-4e1b-9ada-c16cea5e25f9&amp;affDoc=false\',\'ficheEntreprise\',\'scrollbars=yes,width=700,height=750\')">Fiche</a>\n                \r\n                    </td>']

I want to extract the value "[email protected]".

I have css selector as below

email = response.css('td::attr(title)')[1].get()

but this is not working and I am getting below error and I don't understand why

IndexError                                Traceback (most recent call last)
Input In [43], in <cell line: 1>()
----> 1 all.css('td::attr(title)')[1].get().strip()

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\parsel\selector.py:70, in SelectorList.__getitem__(self, pos)
     69 def __getitem__(self, pos):
---> 70     o = super(SelectorList, self).__getitem__(pos)
     71     return self.__class__(o) if isinstance(pos, slice) else o

IndexError: list index out of range

Solution

The structure of your html is strange but I've recreated your problem and used python + BeautifulSoup to get an answer using a try/except to find the tag that has a 'title' attribute:

from bs4 import BeautifulSoup

resp  = ['<td class="V2ligneB" valign="top">\r\n                        LINAIA\r\n                    </td>',
 '''<td class="V2ligneB" valign="top" title="[email protected]">\r\n                        PAILLEREAU  Florent \r\n        
            </td>''',
 '<td class="V2ligneB" valign="top">\r\n                        35000 RENNES\r\n                    </td>',
 '<td class="V2ligneB" valign="top">\r\n                        \r\n                    </td>',
 '<td class="V2ligneB" valign="top" align="center">\r\n                        \n                    <a href="javascript:void(0)" onclick="window.open(\'index.cfm?fuseaction=mEnt.ficheEntAW&amp;uuid=2f89094e-4da1-4e1b-9ada-c16cea5e25f9&amp;affDoc=false\',\'ficheEntreprise\',\'scrollbars=yes,width=700,height=750\')">Fiche</a>\n                \r\n                    </td>']

for row,html in enumerate(resp):
    soup = BeautifulSoup(html,'html.parser')
    try:
        email = soup.find('td')['title']
        print(email)
    except KeyError:
        print(f'Not found in row: {row}')

Answered By - childnick

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 29, 2023

[FIXED] scraping attribute under tag

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels