Monday, December 4, 2023

[FIXED] Web Scraping Satellite Image from publicly available data

December 04, 2023 beautifulsoup, python-3.x, web-scraping No comments

Issue

import re
import requests
from bs4 import BeautifulSoup

webpage = 'https://xgis.maaamet.ee/xgis2/page/app/ristipuud'


----------


response = requests.get(site)

bsoup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = \[img\['src'\] for img in img_tags\]

for url in urls:
filename = re.search(r'/(\[\\w\_-\]+\[.\](jpg|gif|tif|png))$', url)
if not filename:
print("didn't match with the url: {}".format(url))
continue
with open(filename.group(1), 'wb') as f:
if 'http' not in url:
url = '{}{}'.format(webpage, url)
response = requests.get(url)
f.write(response.content)`

#code for Lithuania

import time
import requests
from bs4 import BeautifulSoup
import os

def download_url(url, save_path, chunk_size=128):
    r = requests.get(url, stream=True)
    with open(save_path, 'wb') as fd:
        for chunk in r.iter_content(chunk_size=chunk_size):
            fd.write(chunk)

def get_file_name(url):
    tokens = url.split("/")
    file_name = tokens[-1].split("?")[0]
    return file_name

# Start timer
start_time = time.time()
print("Start time: ", start_time)

# Create image directory
image_directory = 'images'
isExist = os.path.exists(image_directory)
if not isExist:
    os.makedirs(image_directory)

template = "https://www.geoportal.lt/map/webapp/rest/mapgateway/6100e156c755e15f6e46a8820824d8c595d30ae51?f=json"

response = requests.get(template)
if response.status_code == 200:
    soup = BeautifulSoup(response.content, "html.parser")
    link = soup.find("a")
    if link is not None:
        url = 'https://www.geoportal.lt/' + link['href']
        file_name = get_file_name(url)
        print(file_name)
        # Save zip file
        download_url(url, './' + image_directory + '/' + file_name)

# End timer
end_time = time.time()

# Calculate elapsed time
elapsed_time = end_time - start_time[![image attached[![\]\[1\]][1]][1]][1]
print("Elapsed time: ", elapsed_time)

Link: https://www.geoportal.lt/map/index.jsp?lang=en

I want to download satellite images from this website (link: https://xgis.maaamet.ee/xgis2/page/app/ristipuud). There is about 6000 satellite images in tif format. Among them, I want to get 500 for my research. I have to repeat the same process frequently, so want to get it by scraping. but I am having problem. When I run this code, it doesnt show me any error but it also not downloading any data. Images on the website are divided into tiles and it can be downloaded separately by searching with the tile number from this link https://geoportaal.maaamet.ee/eng/Maps-and-Data/Orthophotos/Download-Orthophotos-p662.html . RGB Orthophotos comes in a zip file in .tif format. There are multiple version of the image depending on year and I want to get the latest one. But, unfortunately, my code is not working. Could you please help me to identify mistakes in my code or share your experience. I am novice in coding and trying to learn more.

Solution

This code can download the zipped map files.

import time
import requests
from bs4 import BeautifulSoup
import os

def download_url(url, save_path, chunk_size=128):
    r = requests.get(url, stream=True)
    with open(save_path, 'wb') as fd:
        for chunk in r.iter_content(chunk_size=chunk_size):
            fd.write(chunk)

def get_file_name(url):
    tokens = url.split("&")
    for token in tokens:
        if(token[:2] == 'f='):
            return token[2:]
    return ''

# Start timer
start_time = time.time()
print("Start time: ", start_time)

# create image directory     
image_directory = 'images'
isExist = os.path.exists(image_directory)
if not isExist:
   os.makedirs(image_directory)


# get zip URL and file name
start_sheet = 44744
end_sheet = 44844 # you need to change with 74331, I just test 100 range
total_download = 0
for index in range(start_sheet, end_sheet):
    template = "https://geoportaal.maaamet.ee/index.php?lang_id=2&plugin_act=otsing&page_id=662&&kaardiruut={sheet_number:n}&andmetyyp=ortofoto_eesti_rgb"
    webpage = template.format(sheet_number = index)
    response = requests.get(webpage)
    if (response.status_code == 200):
        soup = BeautifulSoup(response.content, "html.parser")
        link = soup.find("a")
        if link is not None:
            url = 'https://geoportaal.maaamet.ee/' + link['href']
            file_name = get_file_name(url)
            print(file_name)
            # save zip file
            download_url(url, './' + image_directory + '/' + get_file_name(url))
            total_download = total_download + 1
# End timer
end_time = time.time()

# Calculate elapsed time
elapsed_time = end_time - start_time
print("Elapsed time: ", elapsed_time)
print("Total Download zip files: ", total_download)

Result after finish

If unzip, you can see the geoTIFF file.

Main Idea

As you pointed this URL

https://geoportaal.maaamet.ee/eng/Maps-and-Data/Orthophotos/Download-Orthophotos-p662.html

The sheet number range is indicated

Map sheet numbers of 1:10000 scale are between 44744 to 74331.

In the Chrome(or Firefox), if press F12 key, Dev Tool will show.

The 'Network' tab can see the https call at header tab.

After open this screen, you can see the request URL when you search with sheet number(44744) by pressing search button.

This is the template URL.

https://geoportaal.maaamet.ee/index.php?lang_id=2&plugin_act=otsing&page_id=662&&kaardiruut=44744&andmetyyp=ortofoto_eesti_rgb&_=1686945341505

The kaardiruut parameter is key to the switch sheet number.

kaardiruut=44744

To download a program increases its number value for changing another area.

Update for Lithuania with Orthphoto 2021-2013

Lithuania map is not support zip download, it support direct map image download.

This map server is a good example of tile Map

https://www.maptiler.com/google-maps-coordinates-tile-bounds-projection/#10/24.70/56.21

`https://www.geoportal.lt/map` format

"https://www.geoportal.lt/map/webapp/rest/mapgateway/{year_id:s}/tile/{scale:n}/{y:n}/{x:n}"

year_id example

Ortophoto 2021-2023 is '6100e156c755e15f6e46a8820824d8c595d30ae50'

Ortophoto 2018~2020 is '8ddf422a20f8a22fd7c116ef7d6a46eec4126d521'

scale = 8 # (1 : 10 000)

x (longitude) range for 2021-2023

start_x = min number 8263

end_x = max number 8510

y (latitude) range for 2021-2023

start_y = min number 5524

end_y = max number 5839

Demo code

import time
import requests
import os
import requests, imghdr

def download_url(image_url, save_path):
    # copy from Chrome's Network Tab/Headers/Request Headers/Cookie
    cookies = {'JSESSIONID_MWEB': '26F90E44851C5CC9CD41E7A1AE056C54;'}
    response = requests.get(url=image_url, cookies=cookies)
    if response.status_code == 200:
        extension = imghdr.what(file=None, h=response.content)
        print(save_path + '.' + extension)
        with open(save_path + '.' + extension, 'wb') as handler:
            handler.write(response.content)
        return True
    return False

def get_file_name(url):
    file_name = url.rsplit('/',1)[1] # file name
    return file_name

def get_directory_name(url):
    x = url.rsplit('/',2)[1]
    scale = url.rsplit('/',3)[1]
    return scale + '/' + x

def create_directory_name(directory_name):
    isExist = os.path.exists(directory_name)
    if not isExist:
        os.makedirs(directory_name)

start_x = 8330 # min number 8263
end_x = 8334   # max number 8510 

start_y = 5635 # min number 5524
end_y = 5640   # max number 5839

total_download = 0

year_id = '6100e156c755e15f6e46a8820824d8c595d30ae50' # Ortophoto 2021-2023
scale = 8 # (1 : 10 000)
# Start timer
start_time = time.time()
print("Start time: ", start_time)

for x_number in range(start_x, end_x):
    for y_number in range(start_y, end_y):
        # ~/{year_id}/tile/{scale}/{y}/{x}
        template = "https://www.geoportal.lt/map/webapp/rest/mapgateway/{year_id:s}/tile/{scale:n}/{y:n}/{x:n}"
        url = template.format(year_id = year_id, scale = scale, y = y_number, x = x_number)
        directory=get_directory_name(url)
        create_directory_name('./' + directory + '/')
        success = download_url(url,  './' + directory + '/' + get_file_name(url))
        if (success == True):
            total_download = total_download + 1

# End timer
end_time = time.time()

# Calculate elapsed time
elapsed_time = end_time - start_time
print("Elapsed time: ", elapsed_time, " Secs")
print("Total Download tile files: ", total_download)

Result

I got URL and cookies from Dev Tools

Update distance, geo location and image resolution

You can see the grids over tile and location by meter resolution on left bottom area. (I found a defect, X and Y is switched) The unit is meter.

https://www.geoportal.lt/map/index.jsp?lang=en

I calculate red rectangle distance by mouse hover with display x, y location capture values. And pasted each point and calculate the distance. (again X, Y needs to switch - that was bug)

X distance = 3,909 m (yellow color)
Y distance = 4,637 m (green color)

Back to our program to figure out how much tile's resolution

All of tile (256 * 256) pixels image - that is a small size file.

Re-run my program for getting that area

start_x = 8416 # min number 8263
end_x = 8424   # max number 8510 

start_y = 5604 # min number 5524
end_y = 5611   # max number 5839

Got this result

I will calculate how much pixel size is the real-world distance? the red crossed width of late is 1092 meters (I measured mouse hover by the left bottom area tool) delta X = 582123 - 581031 = 1092 m

total pixels 418 pixels = 56 + 256 + 106
pixel per meter = 1085 m/ 418 px= 2.59 m/pixel

So my calculation is 2.5 m/pixel. (guessing)

1 tile size = 256 pixels * 256 pixels = 640 m * 640 m

If you many images, example 20000 * 20000 pixels (like ESTIJA's GeoTiff), like 78 tiles * 78 tiles It will be high-resolution images.

I hope this my guessing is matched the real size. Good luck! I have no more time to spend on this question. Other areas investigate by yourself.

Answered By - Bench Vue

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, December 4, 2023

[FIXED] Web Scraping Satellite Image from publicly available data

Issue

Solution

Update for Lithuania with Orthphoto 2021-2013

`https://www.geoportal.lt/map` format

year_id example

scale = 8 # (1 : 10 000)

x (longitude) range for 2021-2023

y (latitude) range for 2021-2023

Update distance, geo location and image resolution

0 comments:

Post a Comment

Popular Posts

Labels

Monday, December 4, 2023

Issue

Solution

Update for Lithuania with Orthphoto 2021-2013

https://www.geoportal.lt/map format

year_id example

scale = 8 # (1 : 10 000)

x (longitude) range for 2021-2023

y (latitude) range for 2021-2023

Update distance, geo location and image resolution

0 comments:

Post a Comment

Popular Posts

Labels

`https://www.geoportal.lt/map` format