Issue
I'm trying to scrape all that's inside the html tag.
Basically it gets to the GoToUrl line, it opens the page in th browser but then it doesn't do further in the code.
It just times out after 60 seconds.
Here's the error:
fail: Microsoft.AspNetCore.Diagnostics.DeveloperExceptionPageMiddleware[1]
An unhandled exception has occurred while executing the request.
Update: edited for privacy reasons.
Solution
I made an example for your scenario.
Lets say, we want to scrape the posts in the home page so we need a model to store our data:
public class Post
{
public string ImageSrc { get; set; }
public string Category { get; set; }
public string Title { get; set; }
public string Description { get; set; }
public string Date { get; set; }
public override string ToString()
{
return JsonSerializer.Serialize(this,
new JsonSerializerOptions { WriteIndented = true });
}
}
Next we need to initialize selenium webdriver
var options = new ChromeOptions();
options.AddArgument("--no-sandbox");
using var driver = new ChromeDriver(options);
// Here we setup a fluent wait
var wait = new WebDriverWait(driver, TimeSpan.FromSeconds(20))
{
PollingInterval = TimeSpan.FromMilliseconds(250)
};
wait.IgnoreExceptionTypes(typeof(NoSuchElementException), typeof(StaleElementReferenceException));
// Navigate to the target url
driver.Navigate().GoToUrl("https://www.rtlnieuws.nl/zoeken?q=Philips+fraude");
// Accept cookies
var cookieBtn = wait.Until(driver => driver.FindElement(By.Id("onetrust-accept-btn-handler")));
cookieBtn.Click();
// Scroll to end
int count = 0;
await driver.ScrollToEndAsync(d =>
{
// Determine when we are at the end of the page
var tempCount = d.FindElements(By.XPath("//a[@class = 'search-item search-item--artikel']")).Count;
if (tempCount != count)
{
count = tempCount;
return false;
}
return true;
});
// List of post elements
var elements = wait.Until(driver =>
{
return driver.FindElements(By.XPath("//div[@class = 'search-items']//a[contains(@class, 'search-item')]"));
});
// Print Posts in json format
foreach (var e in elements)
{
var post = new Post
{
ImageSrc = e.FindElement(By.XPath(".//img")).GetAttribute("src"),
Category = e.FindElement(By.XPath(".//div/span")).Text,
Title = e.FindElement(By.XPath(".//div/h2")).Text,
Description = e.FindElement(By.XPath(".//div[@class = 'search-item__content']/p[@class = 'search-item__description']")).Text,
Date = e.FindElement(By.XPath(".//div[@class = 'search-item__content']//span[@class = 'search-item__date']")).Text,
};
Console.WriteLine(post);
}
// Just for this sample in order to wait to see our results
Console.ReadLine();
In order to use ScrollToEndAsync
like above, you must create an extension method:
public static class WebDriverExtensions
{
public static async Task ScrollToEndAsync(this IWebDriver driver, Func<IWebDriver, bool> pageEnd)
{
while (!pageEnd.Invoke(driver))
{
var js = (IJavaScriptExecutor)driver;
js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");
// Arbitrary delay between scrolling
await Task.Delay(200);
}
}
}
Answered By - ggeorge
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.