Issue
For scraping one site I have to send duplicats of lines to get json data. I tested this method with requests. But it don't works when I use Scrapy. There are not duplicates in body of request:
class MainSpider(scrapy.Spider):
name = 'main'
allowed_domains = ['ukonlinestores.co.uk']
# start_urls = ['https://ukonlinestores.co.uk/amazon-uk-sellers/']
search_url = 'https://ukonlinestores.co.uk/wp-admin/admin-ajax.php?action=get_wdtable&table_id=9'
handle_httpstatus_list = [400]
def parse_search(self, response):
inspect_response(response, self)
def start_requests(self):
data = {
'draw': '2',
'columns[0][data]': '0',
'columns[0][name]': 'wdt_ID',
'columns[0][searchable]': 'true',
'columns[0][orderable]': 'true',
'columns[0][orderable]': 'true',
'columns[0][search][value]': '',
'columns[0][search][value]': '',
'columns[0][search][regex]': 'false',
'columns[0][search][regex]': 'false',
'columns[1][data]': '1',
'columns[1][data]': '1',
'columns[1][name]': 'sellerid',
'columns[1][name]': 'sellerid',
'columns[1][searchable]': 'true',
'columns[1][searchable]': 'true',
'columns[1][orderable]': 'true',
'columns[1][orderable]': 'true',
}
yield scrapy.Request(
self.search_url,
callback=self.parse_search,
method='POST',
headers=headers,
body=json.dumps(data))
>>> request.body
b'{"columns[0][data]": "0", "columns[0][name]": "wdt_ID", "columns[0][orderable]": "true", "columns[0][search][regex]": "false", "columns[0][search][value]": "", "co
lumns[0][searchable]": "true", "columns[10][data]": "10", "columns[10][name]": "positive12months", "columns[10][orderable]": "true", "columns[10][search][regex]": "f
alse", "columns[10][search][value]": "", "columns[10][searchable]": "true", "columns[11][data]": "11", "columns[11][name]": "positivelifetime", "columns[11][orderabl
e]": "true", "columns[11][search][regex]": "false", "columns[11][search][value]": "", "columns[11][searchable]": "true", "columns[12][data]": "12", "columns[12][name
]": "count30day", "columns[12][orderable]": "true", "columns[12][search][regex]": "false", "columns[12][search][value]": "", "columns[12][searchable]": "true", "colu
mns[13][data]": "13", "columns[13][name]": "count90day", "columns[13][orderable]": "true",
how can I bypass this feature?
Solution
Here's how you can get the data via requests. You have to reverse engineer the HTTP requests. To gain access to the https://ukonlinestores.co.uk/wp-admin/admin-ajax.php, you have to recreate a POST HTTP request, using either nothing than a request, are you have to include parameters, cookies, headers. I tend to start with a simple request and build up, here I didn't need the headers, but the params and data are necessary to get the JSON data you require here.
I tend to use chrometools and copy the request into http://curl.trillworks.com. That way I can get a nicely formatted headers, cookies and params.
You could also use the same params and data in a scrapy script also.
Note looking at the data payload, you hadn't included a lot of it... which is probably why you weren't get the response you needed. Here's an example of using requests to do it.
Code Example
import requests
params = (
('action', 'get_wdtable'),
('table_id', '25'),
)
data = {
'draw': '1',
'columns[0][data]': '0',
'columns[0][name]': 'storeurl',
'columns[0][searchable]': 'true',
'columns[0][orderable]': 'true',
'columns[0][search][value]': '',
'columns[0][search][regex]': 'false',
'columns[1][data]': '1',
'columns[1][name]': 'positivefeedback',
'columns[1][searchable]': 'true',
'columns[1][orderable]': 'true',
'columns[1][search][value]': '',
'columns[1][search][regex]': 'false',
'columns[2][data]': '2',
'columns[2][name]': 'rank',
'columns[2][searchable]': 'true',
'columns[2][orderable]': 'true',
'columns[2][search][value]': '',
'columns[2][search][regex]': 'false',
'columns[3][data]': '3',
'columns[3][name]': 'storemarketplace',
'columns[3][searchable]': 'true',
'columns[3][orderable]': 'true',
'columns[3][search][value]': '',
'columns[3][search][regex]': 'false',
'columns[4][data]': '4',
'columns[4][name]': 'maincategory',
'columns[4][searchable]': 'true',
'columns[4][orderable]': 'true',
'columns[4][search][value]': '',
'columns[4][search][regex]': 'false',
'columns[5][data]': '5',
'columns[5][name]': 'noofproducts',
'columns[5][searchable]': 'true',
'columns[5][orderable]': 'true',
'columns[5][search][value]': '',
'columns[5][search][regex]': 'false',
'columns[6][data]': '6',
'columns[6][name]': 'fulfilmenttype',
'columns[6][searchable]': 'true',
'columns[6][orderable]': 'true',
'columns[6][search][value]': '',
'columns[6][search][regex]': 'false',
'columns[7][data]': '7',
'columns[7][name]': 'countlifetime',
'columns[7][searchable]': 'true',
'columns[7][orderable]': 'true',
'columns[7][search][value]': '',
'columns[7][search][regex]': 'false',
'order[0][column]': '2',
'order[0][dir]': 'asc',
'start': '0',
'length': '50',
'search[value]': '',
'search[regex]': 'false',
'wdtNonce': '78ce0f8f66'
}
response = requests.post('https://ukonlinestores.co.uk/wp-admin/admin-ajax.php', headers=headers, params=params, data=data)
data = response.json()
Answered By - AaronS
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.