Monday, January 15, 2024

[FIXED] Home page of the website is accessible, but the page containing the ads is not accessible. How to bypass while scraping?

January 15, 2024 beautifulsoup, python, web-crawler, web-scraping No comments

Issue

I am trying to scrape a website for it's listings.

I am experiencing an issue where I can't seem to access the page with listings via a script, whilst the homepage is accessible normally.

import os
import random
import time
import requests

USER_AGENTS = []

with open('user-agents.txt', 'r') as file:
    USER_AGENTS = [line.strip() for line in file]

headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.9',
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': random.choice(USER_AGENTS),
    'Referer': 'https://www.google.com/search?q=autoplius',
    'Origin': 'https://en.m.autoplius.lt',
    'DNT': '1',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-User': '?1',
    'Sec-Fetch-Dest': 'document'
}

def main():
    base_url = "https://en.m.autoplius.lt/"
    not_accessible = "https://en.m.autoplius.lt/ads/used-cars?qt="
    response = requests.get(not_accessible, headers=headers)
    if response.status_code != 200:
        print("Not today chap...")
        time.sleep(5)
        return False
    
    print("I got in!")
    return True

while True:
    main()

Is there a way to improve the sent headers, to access the listings website? What other ways are there to access the website's other pages? I also checked robots.txt, it doesn't seem to have any permits for that url path.

Solution

I managed to resolve this issue by using a 3rd party provider for proxies, a.k.a "WEB Unblocker". It seems like the website has a strong protection from bots (scripts) which resulted in me being recognised as a script often. I managed to get several 200's by constantly changing user headers from a larger list, but after some time they seemed to still get recognised as bots. This is the code that currently works for me by using proxy provider:

import requests

proxies = {
  'http': 'http://USERNAME:PASSWORD@PROXY.URL:PORT',
  'https': 'http://USERNAME:PASSWORD@PROXY.URL:PORT',
}

response = requests.request(
    'GET',
    'https://autoplius.lt/skelbimai/naudoti-automobiliai',
    verify=False, 
    proxies=proxies,
)

print(response.status_code)

Answered By - dovexz12323

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 15, 2024

[FIXED] Home page of the website is accessible, but the page containing the ads is not accessible. How to bypass while scraping?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels