Issue
import scrapy
import pycountry
from locations. Items import GeojsonPointItem
from locations. Categories import Code
from typing import List, Dict
import uuid
creating the metadata
#class
class TridentSpider(scrapy.Spider):
name: str = 'trident_dac'
spider_type: str = 'chain'
spider_categories: List[str] = [Code.MANUFACTURING]
spider_countries: List[str] = [pycountry.countries.lookup('in').alpha_3]
item_attributes: Dict[str, str] = {'brand': 'Trident Group'}
allowed_domains: List[str] = ['tridentindia.com']
#start script
def start_requests(self):
url: str = "https://www.tridentindia.com/contact"
yield scrapy.Request(
url=url,
callback=self.parse_contacts
)
`parse data from the website using xpath`
def parse_contacts(self, response):
email: List[str] = [
response.xpath(
"//*[@id='gatsby-focus-
wrapper']/main/div[2]/div[2]/div/div[2]/div/ul/li[1]/a[2]/text()").get()
]
phone: List[str] = [
response.xpath(
"//*[@id='gatsby-focus-
wrapper']/main/div[2]/div[2]/div/div[2]/div/ul/li[1]/a[1]/text()").get(),
]
address: List[str] = [
response.xpath(
"//*[@id='gatsby-focus-
wrapper']/main/div[2]/div[1]/div/div[2]/div/ul/li[1]/address/text()").get(),
]
dataUrl: str = 'https://www.tridentindia.com/contact'
yield scrapy.Request(
dataUrl,
callback=self. Parse,
cb_kwargs=dict(email=email, phone=phone, address=address)
)
Parsing data from above def parse(self, response, email: List[str], phone: List[str], address: List[str]): ''' @url https://www.tridentindia.com/contact' @returns items 1 6 @cb_kwargs {"email": ["[email protected]"], "phone": ["0161-5038888 / 5039999"], "address": ["E-212, Kitchlu Nagar Ludhiana - 141001, Punjab, India"]} @scrapes ref addr_full website ''' responseData = response.json()
`response trom data`
for row in responseData['data']:
data = {
"ref": uuid.uuid4().hex,
'addr_full': address,
'website': 'https://www.tridentindia.com',
'email': email,
'phone': phone,
}
yield GeojsonPointItem(**data)
I want to extract the address (location) with the phone number and email of the 6 offices from html because I couldn't find a json with data. At the end of the extraction I want to save it as json to be able to load it on a map and check if the extracted addresses match their real location. I use scrapy because I want to learn it. I am new to web scraping using scrapy.
Solution
There are 6 offices and none of them contain email. It didn't make sense, why have you included email item where it's clear to look that there are no email in 6 offices and the way that you are using to extract data isn't correct and perpect. So you can try yhe next example.
Code:
import scrapy
class TestSpider(scrapy.Spider):
name = "test"
def start_requests(self):
url = 'https://www.tridentindia.com/contact'
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
for card in response.xpath('//*[@class="cp-correspondence typ-need-asst"]/ul/li'):
yield {
'phone':''.join(card.xpath('.//*[@class="address"]/span[2]//text()').getall()).split(':')[-1].replace('\xad','').strip(),
'address':card.xpath('.//*[@class="address"]/span[1]/text()').get(),
'url':response.url
}
Output as json format:
[
{
"phone": "+91 - 161 - 5039999",
"address": "E-212, Kitchlu Nagar Ludhiana - 141001, Punjab, India",
"url": "https://www.tridentindia.com/contact"
},
{
"phone": "1800 180 2999",
"address": "Trident Group, Sanghera – 148101, India",
"url": "https://www.tridentindia.com/contact"
},
{
"phone": "0124 - 2350399",
"address": "25, A, 15 Shahtoot Marg, DLF Phase-1, Sector 26A, Gurugram, Haryana-122002",
"url": "https://www.tridentindia.com/contact"
},
{
"phone": "0172 - 4602593 / 2742612",
"address": "SCO 20 - 21, Sector 9D, Madhya Marg, Chandigarh - 160009",
"url": "https://www.tridentindia.com/contact"
},
{
"phone": "0755 - 2660479",
"address": "Trident Limited, H.NO. - 3, Nadir Colony, Shyamla Hills, Bhopal - 462013",
"url": "https://www.tridentindia.com/contact"
},
{
"phone": "01679 - 244700 - 703 - 707",
"address": "Trident Limited, Sanghera Complex, Raikot Road, Barnala - 148101, Punjab",
"url": "https://www.tridentindia.com/contact"
}
]
Answered By - Fazlul
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.