class SouthwestSpider(scrapy.Spider):
name = 'southwest'
# allowed_domains = ['www.xxx.com']
# start_urls = ['https://www.southwest.com']
url = 'https://www.southwest.com/api/air-booking/v1/air-booking/page/air/booking/shopping'
def start_requests(self):
post_data = {
"adultPassengersCount": "1",
"application": "air-booking",
"departureDate": "2020-10-01",
"departureTimeOfDay": "ALL_DAY",
"destinationAirportCode": "BDL",
"fareType": "USD",
"int": "HOMEQBOMAIR",
"originationAirportCode": "LAX",
"passengerType": "ADULT",
"reset": "true",
"returnDate": "2020-11-06",
"returnTimeOfDay": "ALL_DAY",
"site": "southwest",
"tripType": "roundtrip",
}
yield scrapy.FormRequest(self.url,formdata=json.dumps(post_data),callback=self.parse)
def parse(self, response):
print(response)
报错信息
2020-09-30 20:14:37 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: southwestPro)
2020-09-30 20:14:37 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 17.9.0, Python 3.6.8 (v3.6.8:3c6b436a57, Dec 24 2018, 02:04:31) - [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 3.0, Platform Darwin-18.7.0-x86_64-i386-64bit
2020-09-30 20:14:37 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-09-30 20:14:37 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'southwestPro',
'NEWSPIDER_MODULE': 'southwestPro.spiders',
'SPIDER_MODULES': ['southwestPro.spiders'],
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 '
'Safari/537.36'}
2020-09-30 20:14:37 [scrapy.extensions.telnet] INFO: Telnet Password: 6c139fcac3ae306c
2020-09-30 20:14:37 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2020-09-30 20:14:37 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-09-30 20:14:37 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-09-30 20:14:37 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-09-30 20:14:37 [scrapy.core.engine] INFO: Spider opened
2020-09-30 20:14:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-09-30 20:14:37 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-09-30 20:14:37 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/core/engine.py", line 129, in _next_request
request = next(slot.start_requests)
File "/Users/PycharmProjects/爬虫练手/2.100个简单练手的网站/30.机票/西南航空/southwestPro/southwestPro/spiders/southwest.py", line 26, in start_requests
yield scrapy.FormRequest(self.url,formdata=json.dumps(post_data),callback=self.parse)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/http/request/form.py", line 31, in __init__
querystr = _urlencode(items, self.encoding)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/http/request/form.py", line 72, in _urlencode
for k, vs in seq
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/http/request/form.py", line 72, in <listcomp>
for k, vs in seq
ValueError: not enough values to unpack (expected 2, got 1)
2020-09-30 20:14:37 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-30 20:14:37 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.012494,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 9, 30, 12, 14, 37, 712716),
'log_count/ERROR': 1,
'log_count/INFO': 10,
'memusage/max': 52596736,
'memusage/startup': 52596736,
'start_time': datetime.datetime(2020, 9, 30, 12, 14, 37, 700222)}
2020-09-30 20:14:37 [scrapy.core.engine] INFO: Spider closed (finished)
请问下大佬们问题出在哪里,请帮忙看看~~
formdata参数直接用post_data,不需要json.dumps
FormRequest方法内部实现片段:
if formdata:
items = formdata.items() if isinstance(formdata, dict) else formdata
querystr = _urlencode(items, self.encoding)
_urlencode方法是把字典拼接成url字符串,例如:a=1&b=2
额。之前没加json这个模块他报错是 Crawled (400) <POST然后在网上说加json模块以后就不会发生
2020-09-30 21:01:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-09-30 21:01:00 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-09-30 21:01:01 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://www.southwest.com/api/air-booking/v1/air-booking/page/air/booking/shopping> (referer: https://www.southwest.com/air/booking/index.html?adultPassengersCount=1&departureDate=2020-09-30&departureTimeOfDay=ALL_DAY&destinationAirportCode=BDL&fareType=USD&int=HOMEQBOMAIR&originationAirportCode=LAX&passengerType=ADULT&reset=true&returnDate=2020-10-03&returnTimeOfDay=ALL_DAY&tripType=roundtrip&validate=true)
我这个是个国外的网站会不会是因为ip或者DNS的问题呀,不过我在网页上这个网站可以正常的打开