Wednesday, November 23, 2022

[FIXED] How to solve extracting data with scrapy because from contacts doesn't do anything?

November 23, 2022 python, scrapy No comments

Issue

    import scrapy
    import pycountry
    from locations. Items import GeojsonPointItem
    from locations. Categories import Code
    from typing import List, Dict

    import uuid

creating the metadata

    #class
    class TridentSpider(scrapy.Spider):
        name: str = 'trident_dac'
        spider_type: str = 'chain'
        spider_categories: List[str] = [Code.MANUFACTURING]
        spider_countries: List[str] = [pycountry.countries.lookup('in').alpha_3]
        item_attributes: Dict[str, str] = {'brand': 'Trident Group'}
        allowed_domains: List[str] = ['tridentindia.com']

    #start script
    def start_requests(self):
        url: str = "https://www.tridentindia.com/contact"

        yield scrapy.Request(
            url=url,
            callback=self.parse_contacts
        )

   `parse data from the website using xpath`

     def parse_contacts(self, response):

        email: List[str] = [
             response.xpath(
            "//*[@id='gatsby-focus- 
            wrapper']/main/div[2]/div[2]/div/div[2]/div/ul/li[1]/a[2]/text()").get()
        ]

        phone: List[str] = [
            response.xpath(
            "//*[@id='gatsby-focus- 
             wrapper']/main/div[2]/div[2]/div/div[2]/div/ul/li[1]/a[1]/text()").get(),
        ]
    
        address: List[str] = [
            response.xpath(
            "//*[@id='gatsby-focus- 
            wrapper']/main/div[2]/div[1]/div/div[2]/div/ul/li[1]/address/text()").get(),
        ]

            dataUrl: str = 'https://www.tridentindia.com/contact'

         yield scrapy.Request(
            dataUrl,
            callback=self. Parse,
            cb_kwargs=dict(email=email, phone=phone, address=address)
         )

Parsing data from above def parse(self, response, email: List[str], phone: List[str], address: List[str]): ''' @url https://www.tridentindia.com/contact' @returns items 1 6 @cb_kwargs {"email": ["[email protected]"], "phone": ["0161-5038888 / 5039999"], "address": ["E-212, Kitchlu Nagar Ludhiana - 141001, Punjab, India"]} @scrapes ref addr_full website ''' responseData = response.json()

    `response trom data`
    for row in responseData['data']:
        data = {
            "ref": uuid.uuid4().hex,
            'addr_full': address,
            'website': 'https://www.tridentindia.com',
            'email': email,
            'phone': phone,
        }

        yield GeojsonPointItem(**data)

I want to extract the address (location) with the phone number and email of the 6 offices from html because I couldn't find a json with data. At the end of the extraction I want to save it as json to be able to load it on a map and check if the extracted addresses match their real location. I use scrapy because I want to learn it. I am new to web scraping using scrapy.

Solution

There are 6 offices and none of them contain email. It didn't make sense, why have you included email item where it's clear to look that there are no email in 6 offices and the way that you are using to extract data isn't correct and perpect. So you can try yhe next example.

Code:

import scrapy
class TestSpider(scrapy.Spider):
    name = "test"

    def start_requests(self):
        url = 'https://www.tridentindia.com/contact'
        yield scrapy.Request(url, callback=self.parse)


    def parse(self, response):

        for card in response.xpath('//*[@class="cp-correspondence typ-need-asst"]/ul/li'):
            yield {

                'phone':''.join(card.xpath('.//*[@class="address"]/span[2]//text()').getall()).split(':')[-1].replace('\xad','').strip(),
                'address':card.xpath('.//*[@class="address"]/span[1]/text()').get(),
                'url':response.url
                }

Output as json format:

[
    {
        "phone": "+91 - 161 - 5039999",
        "address": "E-212, Kitchlu Nagar Ludhiana - 141001, Punjab, India",
        "url": "https://www.tridentindia.com/contact"
    },
    {
        "phone": "1800 180 2999",
        "address": "Trident Group, Sanghera – 148101, India",
        "url": "https://www.tridentindia.com/contact"
    },
    {
        "phone": "0124 - 2350399",
        "address": "25, A, 15 Shahtoot Marg, DLF Phase-1, Sector 26A, Gurugram, Haryana-122002",
        "url": "https://www.tridentindia.com/contact"
    },
    {
        "phone": "0172 - 4602593 / 2742612",
        "address": "SCO 20 - 21, Sector 9D, Madhya Marg, Chandigarh - 160009",
        "url": "https://www.tridentindia.com/contact"
    },
    {
        "phone": "0755 - 2660479",
        "address": "Trident Limited, H.NO. - 3, Nadir Colony, Shyamla Hills, Bhopal - 462013",
        "url": "https://www.tridentindia.com/contact"
    },
    {
        "phone": "01679 - 244700 - 703 - 707",
        "address": "Trident Limited, Sanghera Complex, Raikot Road, Barnala - 148101, Punjab",
        "url": "https://www.tridentindia.com/contact"
    }
]

Answered By - Fazlul

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, November 23, 2022

[FIXED] How to solve extracting data with scrapy because from contacts doesn't do anything?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels