Issue
I am trying to scrape a website for it's listings.
I am experiencing an issue where I can't seem to access the page with listings via a script, whilst the homepage is accessible normally.
import os
import random
import time
import requests
USER_AGENTS = []
with open('user-agents.txt', 'r') as file:
USER_AGENTS = [line.strip() for line in file]
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': random.choice(USER_AGENTS),
'Referer': 'https://www.google.com/search?q=autoplius',
'Origin': 'https://en.m.autoplius.lt',
'DNT': '1',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document'
}
def main():
base_url = "https://en.m.autoplius.lt/"
not_accessible = "https://en.m.autoplius.lt/ads/used-cars?qt="
response = requests.get(not_accessible, headers=headers)
if response.status_code != 200:
print("Not today chap...")
time.sleep(5)
return False
print("I got in!")
return True
while True:
main()
Is there a way to improve the sent headers, to access the listings website? What other ways are there to access the website's other pages? I also checked robots.txt, it doesn't seem to have any permits for that url path.
Solution
I managed to resolve this issue by using a 3rd party provider for proxies, a.k.a "WEB Unblocker". It seems like the website has a strong protection from bots (scripts) which resulted in me being recognised as a script often. I managed to get several 200's by constantly changing user headers from a larger list, but after some time they seemed to still get recognised as bots. This is the code that currently works for me by using proxy provider:
import requests
proxies = {
'http': 'http://USERNAME:[email protected]:PORT',
'https': 'http://USERNAME:[email protected]:PORT',
}
response = requests.request(
'GET',
'https://autoplius.lt/skelbimai/naudoti-automobiliai',
verify=False,
proxies=proxies,
)
print(response.status_code)
Answered By - dovexz12323
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.