Issue
I'm trying to scrape the contents on the left side under azurerm_provider
from this webpage using the requests module.
I've explored dev tools to find any links containing the expected results, but I failed to locate anything. I've also looked for the content in the page source in case it is within some script tags, but I also failed to find anything.
I've already found success grabbing the content using selenium, so I do not wish to go that route.
This is my failed attempt using the requests module:
import requests
from bs4 import BeautifulSoup
link = 'https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/api_management'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
}
with requests.Session() as session:
session.headers.update(headers)
res = session.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("ul.provider-docs-menu-list .menu-list-category"):
category_name = item.select_one("a.menu-list-category-link span.menu-list-category-link-title").get_text(strip=True)
category_content = [i.get_text(strip=True) for i in item.select("li.menu-list-link > a")]
print(category_name,category_content)
Expected output:
Azure Provider: Authenticating via a Service Principal and a Client Certificate
Azure Provider: Authenticating via a Service Principal and a Client Secret
Azure Provider: Authenticating via a Service Principal and OpenID Connect
Azure Provider: Authenticating via Managed Identity
Azure Provider: Authenticating via the Azure CLI
Solution
Hashicorp builds their documentation dynamically from API calls.
You need to get the provider version and then fetch the provider docs. Finally, you can request docs body using the provider docs link.
For example:
import requests
from tabulate import tabulate
provider_versions = "https://registry.terraform.io/v2/provider-versions/38614?include=provider-docs"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
}
with requests.Session() as session:
session.headers.update(headers)
provider_docs = session.get(provider_versions).json()
docs = [
[
doc['attributes']['title'],
f"https://registry.terraform.io{doc['links']['self']}",
]
for doc in provider_docs['included']
]
print(tabulate(docs, headers=['Title', 'Link']))
This should output:
Title Link
------------------------------------------------------------------------------- ------------------------------------------------------
private_dns_resolver_dns_forwarding_ruleset https://registry.terraform.io/v2/provider-docs/2530275
sentinel_threat_intelligence_indicator https://registry.terraform.io/v2/provider-docs/2530362
proximity_placement_group https://registry.terraform.io/v2/provider-docs/2529517
bot_channel_sms https://registry.terraform.io/v2/provider-docs/2529741
elastic_cloud_elasticsearch https://registry.terraform.io/v2/provider-docs/2529924
key_vault_certificate_contacts https://registry.terraform.io/v2/provider-docs/2530018
netapp_volume https://registry.terraform.io/v2/provider-docs/2530213
site_recovery_hyperv_replication_policy https://registry.terraform.io/v2/provider-docs/2530388
subscription_policy_assignment https://registry.terraform.io/v2/provider-docs/2530499
automation_webhook https://registry.terraform.io/v2/provider-docs/2529715
dev_test_linux_virtual_machine https://registry.terraform.io/v2/provider-docs/2529897
express_route_connection https://registry.terraform.io/v2/provider-docs/2529945
kusto_iothub_data_connection https://registry.terraform.io/v2/provider-docs/2530039
logz_sub_account_tag_rule https://registry.terraform.io/v2/provider-docs/2530094
private_dns_txt_record https://registry.terraform.io/v2/provider-docs/2529511
log_analytics_cluster https://registry.terraform.io/v2/provider-docs/2530063
machine_learning_workspace https://registry.terraform.io/v2/provider-docs/2529442
dev_test_virtual_network https://registry.terraform.io/v2/provider-docs/2529900
private_endpoint https://registry.terraform.io/v2/provider-docs/2530284
and much more ...
Then, using the links you can get the documentation's content.
For example:
# Get the first doc and its body
first_doc = session.get(docs[0][1]).json()
print(first_doc['data']['attributes']['content'])
This should give you:
---
subcategory: "IoT Hub"
layout: "azurerm"
page_title: "Azure Resource Manager: azurerm_iothub_device_update_instance"
description: |-
Manages an IoT Hub Device Update Instance.
---
# azurerm_iothub_device_update_instance
Manages an IoT Hub Device Update Instance.
## Example Usage
>> truncated <<
Answered By - baduker
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.