Issue
I am currently doing a project to improve my python knowledge. It's an attempt to use beautifulsoup
to find specific data being held in <p>
with no class from the website text
I have pasted the modules I am using as well as the section of code that I'm having trouble fixing at the bottom.
Any help is appreciated!
import requests
from bs4 import BeautifulSoup
import csv
import re
mitigation = []
for id in id_list:
page = requests.get(f"https://attack.mitre.org/techniques/{id2}")
soup = BeautifulSoup(page.content, 'html.parser')
paragraph = soup.find('p', class_ = '')
status_code = page.status_code
mitigation.append(paragraph)
I have attempted to use:
paragraph = soup.select('p')[4].text
instead of:
paragraph = soup.find('p', class_ = '')
in order to find the correct <p>
Solution
The question could do with a little more clarity in order to provide a good and holistic approach to a solution, which is why it is only dealt with selectively here.
In my view, @Barmar's comments would have been correct approaches to solving the problem, given the focus of the question.
However, in order to pick up on a specific content, let's break away from the <p>
without a class and look at the bigger picture. What other context can we use to localise this specific content? Adapt your selection so that you orientate yourself on the HTML structure, unique and less dynamic attributes.
You are looking for mitigations, simply select this area using an id
and proceed from there with the next steps - used css selectors
here for chaining:
soup.select_one('#mitigations + div table p')
If you need it from multiple rows use select()
over select_one()
and iterate over its resultset.
Example
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get('https://attack.mitre.org/techniques/').text)
mitigation = []
for id in ['T1588/001/','T1129/']:
page = requests.get(f"https://attack.mitre.org/techniques/{id}")
soup = BeautifulSoup(page.content, 'html.parser')
paragraph = soup.select_one('#mitigations + div table p')
mitigation.append(paragraph.text)
mitigation
['This technique cannot be easily mitigated with preventive controls since it is based on behaviors performed outside of the scope of enterprise defenses and controls.','Identify and block potentially malicious software executed through this technique by using application control tools capable of preventing unknown modules from being loaded.']
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.