Thursday, October 6, 2022

[FIXED] Trying to scrape a page with Selenium and ChromeDriver. It loads the page but then times out

October 06, 2022 c#, selenium, selenium-chromedriver No comments

Issue

I'm trying to scrape all that's inside the html tag.

Basically it gets to the GoToUrl line, it opens the page in th browser but then it doesn't do further in the code.

It just times out after 60 seconds.

Here's the error:

fail: Microsoft.AspNetCore.Diagnostics.DeveloperExceptionPageMiddleware[1]
      An unhandled exception has occurred while executing the request.

Update: edited for privacy reasons.

Solution

I made an example for your scenario.

Lets say, we want to scrape the posts in the home page so we need a model to store our data:

public class Post
{
    public string ImageSrc { get; set; }
    public string Category { get; set; }
    public string Title { get; set; }
    public string Description { get; set; }
    public string Date { get; set; }

    public override string ToString()
    {
        return JsonSerializer.Serialize(this, 
              new JsonSerializerOptions { WriteIndented = true });
    }
}

Next we need to initialize selenium webdriver

var options = new ChromeOptions();
options.AddArgument("--no-sandbox");
using var driver = new ChromeDriver(options);

// Here we setup a fluent wait
var wait = new WebDriverWait(driver, TimeSpan.FromSeconds(20))
{
    PollingInterval = TimeSpan.FromMilliseconds(250)
};
wait.IgnoreExceptionTypes(typeof(NoSuchElementException), typeof(StaleElementReferenceException));

// Navigate to the target url
driver.Navigate().GoToUrl("https://www.rtlnieuws.nl/zoeken?q=Philips+fraude");

// Accept cookies
var cookieBtn = wait.Until(driver => driver.FindElement(By.Id("onetrust-accept-btn-handler")));
cookieBtn.Click();

// Scroll to end
int count = 0; 
await driver.ScrollToEndAsync(d =>
{
    // Determine when we are at the end of the page
    var tempCount = d.FindElements(By.XPath("//a[@class = 'search-item search-item--artikel']")).Count;
    if (tempCount != count)
    {
        count = tempCount;
        return false;
    }       
    
    return true;
});

// List of post elements
var elements = wait.Until(driver =>
{
    return driver.FindElements(By.XPath("//div[@class = 'search-items']//a[contains(@class, 'search-item')]"));
});

// Print Posts in json format 
foreach (var e in elements)
{
    var post = new Post
    {
        ImageSrc = e.FindElement(By.XPath(".//img")).GetAttribute("src"),
        Category = e.FindElement(By.XPath(".//div/span")).Text,
        Title = e.FindElement(By.XPath(".//div/h2")).Text,
        Description = e.FindElement(By.XPath(".//div[@class = 'search-item__content']/p[@class = 'search-item__description']")).Text,
        Date = e.FindElement(By.XPath(".//div[@class = 'search-item__content']//span[@class = 'search-item__date']")).Text,
    };
    Console.WriteLine(post);
}

// Just for this sample in order to wait to see our results 
Console.ReadLine();

In order to use ScrollToEndAsync like above, you must create an extension method:

public static class WebDriverExtensions
{
    public static async Task ScrollToEndAsync(this IWebDriver driver, Func<IWebDriver, bool> pageEnd)
    {
        while (!pageEnd.Invoke(driver))
        {
            var js = (IJavaScriptExecutor)driver;
            js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");
            
            // Arbitrary delay between scrolling
            await Task.Delay(200);
        }
    }
}

Answered By - ggeorge

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, October 6, 2022

[FIXED] Trying to scrape a page with Selenium and ChromeDriver. It loads the page but then times out

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels