Issue
I want to scrape the bus schedule times from the following website https://www.redbus.in/. By putting the locations I am interested in the search fields I arrive at the following link which is an example of ones I am interested in: https://www.redbus.in/bus-tickets/bhopal-to-indore?fromCityName=Bhopal&fromCityId=979&toCityName=Indore&toCityId=313&onward=18-Sep-2022&srcCountry=IND&destCountry=IND&opId=0&busType=Any
When I manually save this page and open the HTML file I can find the search results including Bus operator names, departure times, fare etc. But when I do the same using Python that part of the page is not saved. The code I am using is the following:
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://www.redbus.in/bus-tickets/bhopal-to-indore?fromCityName=Bhopal&fromCityId=979&toCityName=Indore&toCityId=313&onward=18-Sep-2022&srcCountry=IND&destCountry=IND&opId=0&busType=Any"
browser = webdriver.Chrome()
browser.get(url)
soup = BeautifulSoup(browser.page_source)
browser.quit()
soup
object that is created this way has all the other content of the page in HTML format except the search results showing the bus route and time information. I am not sure why that is the case.
I am new to web scrapping so any help here will be really appreciated.
Solution
Main issue is that the data you expect needs a moment to be loaded and rendered by the browser - so simplest way is to give some time.sleep()
or better selenium waits
for second or two.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
url = "https://www.redbus.in/bus-tickets/bhopal-to-indore?fromCityName=Bhopal&fromCityId=979&toCityName=Indore&toCityId=313&onward=18-Sep-2022&srcCountry=IND&destCountry=IND&opId=0&busType=Any"
browser = webdriver.Chrome()
browser.get(url)
time.sleep(2)
soup = BeautifulSoup(browser.page_source)
browser.quit()
But selenium
is not necessarry, you can also access the JSON with all the data via requests
and this is all well structured:
import requests
import json
url = "https://www.redbus.in/search/SearchResults?"
headers = {
'content-type': 'application/json',
'origin': 'https://www.redbus.in',
'user-agent': 'Mozilla/5.0'
}
data={
'fromCity':979,
'toCity':313,
'src':'Bhopal',
'dst':'Indore',
'DOJ':'20-Sep-2022',
'sectionId':0,
'groupId':0,
'limit':0,
'offset':0,
'sort':0,
'sortOrder':0,
'meta':'true',
'returnSearch':0}
response = requests.request("POST", url, params=data, headers=headers)
[e['bpData'] for e in response.json()['inv']]
Output
[[{'Id': 23400197,
'Name': 'ISBT Bhopal (Verma Travels)',
'Vbpname': 'ISBT Bhopal (Verma Travels)',
'BpTm': '07:00',
'bpTminmin': 420,
'eta': None,
'Address': 'ISBT Bhopal (Verma Travels)',
'BpFullTime': '2022-09-18 07:00:00'},
{'Id': 23400198,
'Name': 'Bhopal Railway Station (Verma Travels) (Pickup Van)',
'Vbpname': 'Bhopal Railway Station (Verma Travels) (Pickup Van)',
'BpTm': '07:00',
'bpTminmin': 420,
'eta': None,
'Address': 'Bhopal Railway Station (Verma Travels) (Pickup Van)',
'BpFullTime': '2022-09-18 07:00:00'},
{'Id': 23400196,
'Name': 'Lalghati (Verma Travels)',
'Vbpname': 'Lalghati (Verma Travels)',
'BpTm': '07:30',
'bpTminmin': 450,
'eta': None,
'Address': 'Lalghati (Verma Travels)',
'BpFullTime': '2022-09-18 07:30:00'},
{'Id': 23408191,
'Name': 'Sehore Bypass (Near Crescent Hotel)',
'Vbpname': 'Sehore Bypass (Near Crescent Hotel)',
'BpTm': '08:10',
'bpTminmin': 490,
'eta': None,
'Address': 'Sehore Bypass (Near Crescent Hotel)',
'BpFullTime': '2022-09-18 08:10:00'}],
[{'Id': 23400197,
'Name': 'ISBT Bhopal (Verma Travels) (Pickup Van)',
'Vbpname': 'ISBT Bhopal (Verma Travels) (Pickup Van)',
'BpTm': '18:30',
'bpTminmin': 1110,
'eta': None,
'Address': 'ISBT Bhopal (Verma Travels) (Pickup Van)',
'BpFullTime': '2022-09-18 18:30:00'},
{'Id': 23400198,
'Name': 'Bhopal Railway Station (Verma Travels) (Pickup Van)',
'Vbpname': 'Bhopal Railway Station (Verma Travels) (Pickup Van)',
'BpTm': '18:30',
'bpTminmin': 1110,
'eta': None,
'Address': 'Bhopal Railway Station (Verma Travels) (Pickup Van)',
'BpFullTime': '2022-09-18 18:30:00'},
{'Id': 23400196,
'Name': 'Lalghati (Verma Travels)',
'Vbpname': 'Lalghati (Verma Travels)',
'BpTm': '19:00',
'bpTminmin': 1140,
'eta': None,
'Address': 'Lalghati (Verma Travels)',
'BpFullTime': '2022-09-18 19:00:00'},
{'Id': 23408191,
'Name': 'Sehore Bypass (Near Crescent Hotel)',
'Vbpname': 'Sehore Bypass (Near Crescent Hotel)',
'BpTm': '19:40',
'bpTminmin': 1180,
'eta': None,
'Address': 'Sehore Bypass (Near Crescent Hotel)',
'BpFullTime': '2022-09-18 19:40:00'}],...]
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.