Issue
I have the following Python Code:-
import pandas as pd
import requests
import numpy as np
from bs4 import BeautifulSoup
import xlrd
import re
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
res3 = requests.get("https://web.archive.org/web/20220521203053/https://www.military-
airshows.co.uk/press22/bbmfschedule2022.htm")
soup3 = BeautifulSoup(res3.content,'lxml')
BBMF_2022 = []
#BBMF_elem = soup3.find_all('a', string=re.compile(r'between|Flypast'))
for item in soup3.find_all('a', string=re.compile(r'between|Flypast')):
li1 = item.find_parent().text
#li2 = li1.find_previous().font
#print(link)
print(li1)
#print(li2)
#BBMF_2022.append(li1)
#df = pd.DataFrame(BBMF_2022, columns=['BBMF_2022'])
#df
The issue I have is when I run the Code, the Data is printed, for the 15 Entries from May 28th to May 29th, 15 times. I am not sure why that is the case? Could someone suggest for me the reason why? And tell me what I need to change in the Code, so that that Data is printed only once and not the 15 times? I have tried to Scrape Data from a Website, where entries contain the word between or Flypast in 'a' tags. The Printed Data is correct, i.e. for the entries for the 21st May they are printed only the once, and the appearance of the Data is correct.
I have inspected the Page data and have noticed <i>
tags are not present in the 28th-29th May Data that are, in places in the 21st May data etc.
When I use these lines of Code instead:
for item in soup3.find_all('a', string=re.compile(r'between|Flypast')):
li1 = item.find_parent().text
#li2 = li1.find_previous().font
#print(link)
#print(li1)
#print(li2)
BBMF_2022.append(li1)
df = pd.DataFrame(BBMF_2022, columns=['BBMF_2022'])
df
The first entry for the 28th May, is printed out in the output DataFrame 15 times! Instead of 15 separate Entries from the 28th May to 29th May I mentioned before. I am confused as to where, I am going wrong with this? I am using a web.archive.org link, as the Data from a week ago was removed the other day, from the Website.
For the First used Python Code, the output I wan't is :-
21 May - S - Rickmansworth Festival, Hertfordshire Flypast - 3.35pm
21 May - S - The Great Woodford Vintage Festival, Thrapston, Northamptonshire Flypast - between 3.50 & 4.35pm
21 May - Moira Canal Festival, Leicestershire Flypast - between 1.25 & 2.10pm
21 May - L - Wild West & American History Photography Day, Abbots Bromley, Staffs Flypast - between 1.10 & 1.55pm
21 May - Haworth 1940's event, Haworth, West Yorkshire Flypast - between 12.40 & 1.30pm
21 May - L - Etwall Well Dressing Festival, Derbyshire Flypast - between 1.15 & 2.05pm
21 May - Supercars & Classics Weekend, Stonor Park, Oxfordshire Flypast - between 3.25 & 4.15pm
21 May - S - VW Breakout, Santa Pod Raceway, Northamptonshire Flypast - 3.59pm
21 May - The Wartime Village, Skegness, Lincolnshire Flypast - between 2.45 & 3.30pm
22 May - L - Norfolk & Suffolk Aviation Museum, Flixton, Suffolk Flypast - between 10.00 & 10.45am
22 May - S or H - VE Day Event (Royal Air Force Association), Bridlington, E.Yorkshire Flypast - between 11.45 & 12.30pm
22 May - Haworth 1940's event, Haworth, West Yorkshire Flypast - between 12.10 & 1.00pm
22 May - L - Etwall Well Dressing Festival, Derbyshire Flypast - between 11.55 & 12.40pm
22 May - Moira Canal Festival, Leicestershire Flypast - between 11.50 & 12.30pm
22 May - L - The Great Woodford Vintage Festival, Thrapston, Northamptonshire Flypast - between 3.50 & 4.35pm
22 May - L - Rickmansworth Festival, Hertfordshire Flypast - 11.20am
22 May - Supercars & Classics Weekend, Stonor Park, Oxfordshire Flypast - between 10.40 & 11.30am
22 May - L - VW Breakout, Santa Pod Raceway, Northamptonshire Flypast - 11.38am
22 May - The Wartime Village, Skegness, Lincolnshire Flypast - between 11.20 & 12.05pm
28 May - Vintage Rally, Smallwood, Cheshire - between 1.45 & 2.30pm
28 May - Carrington Steam & Heritage Show, Lincolnshire - between 1.15 & 2.00pm
28 May - H - The Shropshire County Show - between 2.05 & 2.45pm
28 May - The Ironbridge WW2 Weekend, Shropshire - between 2.05 & 2.50pm
28 May - H - Middy in the 1940s, Wetheringsett, Suffolk - between 2.15 & 3.00pm
28 May - S - FIA/FIM, Santa Pod Raceway, Northamptonshire - between 3.25 & 4.10pm
28 May - Prescott Historique, Bishops Cleeve, Gloucestershire - between 11.45 & 12.30pm
28 May - S - WARAG Weekend, Somerset - between 2.45 & 3.30pm
28 May - Lechlade Festival, Gloucestershire - between 3.05 & 3.55pm
28 May - H - Heathfield Agricultural Show, East Sussex - between 1.45 & 2.30pm
29 May - Carrington Steam & Heritage Show, Lincolnshire - between 4.15 & 5.00pm
29 May - Vintage Rally, Smallwood, Cheshire - between 3.45 & 4.30pm
29 May - SH - FIA/FIM, Santa Pod Raceway, Northamptonshire - between 12.10 & 12.55pm
29 May - Lechlade Festival, Gloucestershire - between 3.05 & 3.55pm
29 May - SH - Classic Wings & Wheels, Bidford Gliding Club, Warwickshire - between 12.30 & 1.00pm
02 June - L - Lanc, Tank and Military Machines, East Kirkby, Lincs. Flypast
02 July - S - Hollowell Steam and Vintage Rally Flypast - 12.48pm
03 July - SH - Hollowell Steam and Vintage Rally Flypast - 2.01pm
And I want the Same Output, when using the DataFrame lines of Code.
I tried the Latest Webpage for June. I would like the Output to be the Same format as that I posted for June. The problem with June's Data, is that the between and Flypast text is not in a 'a' href tag this time, so I am not sure how to combine re=compile line of Code with which relevant tag, it seems to be in a font tag ?
I used this line of Code for June instead :-
for item in soup3.find_all('b', string=re.compile(r'June')):
But as I am not including between and Flypast, in the line of Code, alot of unwanted Data is output. And as before, June's Data is repeated, as many times as there are entries.
Solution
Well, that wasn't easy but here we are: (I did it for June because couldn't access May well but the code should work for May too)
1. Import modules, get url and html code:
from bs4 import BeautifulSoup
from lxml import etree
import requests
URL = "https://www.military-airshows.co.uk/press22/bbmfschedule2022.htm"
webpage = requests.get(URL)
soup = BeautifulSoup(webpage.content, "html.parser")
dom = etree.HTML(str(soup))
2. Get all the text()
from the descendant
after the first i
:
all = dom.xpath('/html/body/div[6]/div/div[1]/div/div[2]/i[4]/descendant::text()')
2.1 For convenience I did here a first clean:
all = [i for i in all if i != '\n' and i != ' ']
3. I wrote a small function to let us delimit lines/rows every time a '•' occurs:
def split_list(input_list, delimiter):
result_list = []
while len(input_list) > 0:
elem = input_list.pop(0)
if elem == delimiter:
if 'sub_list' in locals():
result_list.append(sub_list)
sub_list = [elem]
elif len(input_list) == 0:
sub_list.append(elem)
result_list.append(sub_list)
else:
sub_list.append(elem)
return result_list
a = split_list(all, '•')
This function works for any separator you want, you can use it somewhere else ;)
4. Now we can use a for loop on our clean list to create a dataframe:
rows = []
for i in a:
date = (i[1])
event = (','.join(i[2:])).replace(',', '')
rows.append([date,event])
df = pd.DataFrame(rows, columns=["Date", "Event"])
df
5. Output:
index | Date | Event |
---|---|---|
0 | 02 June | - BBMF aircraft will take part in the Queen's Platinum Jubilee Flypast over Buckingham Palace at 1.00pm |
1 | 02 June | - Kingston-Upon-Hull E.Yorkshire - between 6.50 & 7.35pm |
2 | 02 June | - Hessle N.Yorkshire - between 6.45 & 7.30pm |
...
index | Date | Event |
---|---|---|
105 | 05 June | - Ingatestone Essex - between 12.45 & 1.30pm |
106 | 05 June | - Maidstone Kent - between 3.45 & 4.30pm |
107 | 05 June | - H - The Overlord Show Denmead Hampshire - between 3.10 & 4.00pm |
Answered By - Drakax
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.