Issue
I'm using CSS Class selector to help me out with a spider. On Scrapy shell if I do the following command I get the output of all the elements I need:
scrapy shell "https://www.fcf.cat/acta/2022/futbol-11/cadet-primera-divisio/grup-2/1c/la-salle-bonanova-ce-a/1c/lhospitalet-centre-esports-b"
response.css(".acta-table:nth-child(3) .tc::text , .acta-table a::text").extract()
Now I need to build the JSON file according to the information on the 12 tables the webpage is built on. The JSON I'm trying to build should look something like this:
{
"DadesPartit":
{
"Temporada": "2021-2022",
"Categoria": "Cadet",
"Divisio": "Primera",
"Grup": 2,
"Jornada": 28
},
"TitularsCasa":
[
{
"Nom": "IGNACIO",
"Cognom":"FERNÁNDEZ ARTOLA",
"Link": "https://.."
},
{
"Nom": "JAIME",
"Cognom":"FERNÁNDEZ ARTOLA",
"Link": "https://.."
},
{
"Nom": "BRUNO",
"Cognom":"FERRÉ CORREA",
"Link": "https://.."
}
],
"SuplentsCasa":
[
{
"Nom": " MARC",
"Cognom":"GIMÉNEZ ABELLA",
"Link": "https://.."
}
],
"CosTecnicCasa":
[
{
"Nom": " JORDI",
"Cognom":"LORENTE VILLENA",
"Llicencia": "E"
}
],
"TargetesCasa":
[
{
"Nom": "IGNACIO",
"Cognom":"FERNÁNDEZ ARTOLA",
"Tipus": "Groga",
"Minut": 65
}
],
"Arbitres":
[
{
"Nom": "ALEJANDRO",
"Cognom":"ALVAREZ MOLINA",
"Delegacio": "Barcelona1"
}
],
"Gols":
[
{
"Nom": "NATXO",
"Cognom":"MONTERO RAYA",
"Minut": 5,
"Tipus": "Gol de penal"
}
],
"Estadi":
{
"Nom": "CAMP DE FUTBOL COL·LEGI LA SALLE BONANOVA",
"Direccio":"C/ DE SANT JOAN DE LA SALLE, 33, BARCELONA"
},
"TitularsFora":
[
{
"Nom": "MARTI",
"Cognom":"MOLINA MARTIMPE",
"Link": "https://.."
},
{
"Nom": " XAVIER",
"Cognom":"MORA AMOR",
"Link": "https://.."
},
{
"Nom": " IVAN",
"Cognom":"ARRANZ MORALES",
"Link": "https://.."
}
],
"SuplentsFora":
[
{
"Nom": "OLIVER",
"Cognom":"ALCAZAR SANCHEZ",
"Link": "https://.."
}
],
"CosTecnicFora":
[
{
"Nom": "RAFAEL",
"Cognom":"ESPIGARES MARTINEZ",
"Llicencia": "D"
}
],
"TargetesFora":
[
{
"Nom": "ORIOL",
"Cognom":"ALCOBA LAGE",
"Tipus": "Groga",
"Minut": 34
}
]
}
I would like some guidance on how to build it.
Thanks, Joan
Solution
It is much simpler than this with requests
and pandas
.You can do the following:
import requests as r
import pandas as pd
a=r.get("https://www.fcf.cat/acta/2022/futbol-11/cadet-primera-divisio/grup-2/1c/la-salle-bonanova-ce-a/1c/lhospitalet-centre-esports-b")
table_fb = pd.read_html(a.content)
You just have to index table_fb
for the tables.
Here is the scrapy alternative:
import scrapy
import pandas as pd
class stack(scrapy.Spider):
name = 'test'
start_urls = ["https://www.fcf.cat/acta/2022/futbol-11/cadet-primera-divisio/grup-2/1c/la-salle-bonanova-ce-a/1c/lhospitalet-centre-esports-b"]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
callback=self.parse
)
def parse(self, response):
tables = pd.read_html(response.text)
yield {
'table1':tables[0],
'table2':tables[1],
'table3':tables[2],
'table4':tables[3],
'table5':tables[4],
'table6':tables[5],
'table7':tables[6],
'table8':tables[7],
'table9':tables[8],
'table10':tables[9],
'table11':tables[10],
'table12':tables[11],
'table13':tables[12],
'table14':tables[13],
}
Answered By - joe_bill.dollar
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.