Issue
i am trying to scrape a little chunk of information from a site: but it keeps printing "None" as if the title, or any tag if i replace it, doesn't exists.
The project: for a list of meta-data of wordpress-plugins: - approx 50 plugins are of interest! but the challenge is: i want to fetch meta-data of all the existing plugins. What i subsequently want to filter out after the fetch is - those plugins that have the newest timestamp - that are updated (most) recently. It is all aobut acutality...
https://wordpress.org/plugins/wp-job-manager
https://wordpress.org/plugins/ninja-forms
https://wordpress.org/plugins/participants-database ....and so on and so forth.
we have the following set of meta-data for each wordpress-plugin:
Version: 1.9.5.12
installations: 10,000+
WordPress Version: 5.0 or higher
Tested up to: 5.4 PHP
Version: 5.6 or higher
Tags 3 Tags:databasemembersign-up formvolunteer
Last updated: 19 hours ago
enter code here
the project consits of two parts: the looping-part: (which seems to be pretty straightforward). the parser-part: where i have some issues - see below. I'm trying to loop through an array of URLs and scrape the data below from a list of wordpress-plugins. See my loop below-
from bs4 import BeautifulSoup
import requests
#array of URLs to loop through, will be larger once I get the loop working correctly
plugins = ['https://wordpress.org/plugins/wp-job-manager', 'https://wordpress.org/plugins/ninja-forms']
this can be done like so
ttt = page_soup.find("div", {"class":"plugin-meta"})
text_nodes = [node.text.strip() for node in ttt.ul.findChildren('li')[:-1:2]]
the Output of text_nodes:
['Version: 1.9.5.12', 'Active installations: 10,000+', 'Tested up to: 5.6 ']
but if we want to fetch the data of all the wordpress-plugins and subesquently sort them to show the -let us say - latest 50 updated plugins. This would be a interesting task:
first of all we need to fetch the urls
then we fetch the information and have to sort out the newest- the newest timestamp. Ie the plugin that updated most recently
List the 50 newest items - that are the 50 plugins that are updated recently ...
challenge: how to avoid that we overload the RAM while fetching all URLs. (see here How extract all URLs in a website using BeautifulSoup with interesting insights, approaches and ideas.
at the moment i try to figure out how to fetch all the urls -and to parse them:
a. how to fetch the meta-data of each plugin:
b. and how to sort out the range of the newest updates…
c. afterward how to pick out the 50 newest
Solution
import requests
from bs4 import BeautifulSoup
from concurrent.futures.thread import ThreadPoolExecutor
url = "https://wordpress.org/plugins/browse/popular/{}"
def main(url, num):
with requests.Session() as req:
print(f"Collecting Page# {num}")
r = req.get(url.format(num))
soup = BeautifulSoup(r.content, 'html.parser')
link = [item.get("href")
for item in soup.findAll("a", rel="bookmark")]
return set(link)
with ThreadPoolExecutor(max_workers=20) as executor:
futures = [executor.submit(main, url, num)
for num in [""]+[f"page/{x}/" for x in range(2, 50)]]
allin = []
for future in futures:
allin.extend(future.result())
def parser(url):
with requests.Session() as req:
print(f"Extracting {url}")
r = req.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
target = [item.get_text(strip=True, separator=" ") for item in soup.find(
"h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]]
head = [soup.find("h1", class_="plugin-title").text]
new = [x for x in target if x.startswith(
("V", "Las", "Ac", "W", "T", "P"))]
return head + new
with ThreadPoolExecutor(max_workers=50) as executor1:
futures1 = [executor1.submit(parser, url) for url in allin]
for future in futures1:
print(future.result())
Output: view-online
Answered By - αԋɱҽԃ αмєяιcαη
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.