代码:
import scrapy
from scrapy import Spider, Request
class ZhihuSpider(Spider):
name = 'zhihu'
allowed_domains = ['www.zhihu.com']
start_urls = ['http://www.zhihu.com/']
start_user = 'excited-vczh'
user_url = 'https://www.zhihu.com/api/v4/members/{user}?include={include}'
user_query = 'allow_message,is_followed,is_following,is_org,is_blocking,employments,answer_count,follower_count,articles_count,gender,badge[?(type=best_answerer)].topics'
follows_url = 'https://www.zhihu.com/api/v4/members/{}/followees?includ={include}&offset={offset}&limit={limit}'
follows_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'
def start_requests(self):
yield Request(self.user_url.format(user=self.start_user, include=self.user_query), self.parse_user)
yield Request(self.follows_url.format(user=self.start_user, include=self.follows_query, offset=0, limit=20), callback=self.parse_follows)
def parse_user(self, response):
print(response.text)
def parse_follows(self, response):
print(response.text)
运行结果:
F:\pycode\zhihuuser>scrapy crawl zhihu
2019-04-08 20:59:59 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: zhihuuser)
2019-04-08 20:59:59 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0,
Twisted 18.9.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 16:07:46) [MSC v.1900 32 bit (Intel)], pyOpenSSL 18.0.0 (OpenSSL
1.1.0j 20 Nov 2018), cryptography 2.4.2, Platform Windows-10-10.0.10586-SP0
2019-04-08 20:59:59 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'zhihuuser', 'NEWSPIDER_MODULE': 'zhihuuser.spiders
', 'SPIDER_MODULES': ['zhihuuser.spiders']}
2019-04-08 21:00:00 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-04-08 21:00:00 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-08 21:00:00 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-08 21:00:00 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-08 21:00:00 [scrapy.core.engine] INFO: Spider opened
2019-04-08 21:00:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-08 21:00:00 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-08 21:00:00 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "e:\python\lib\site-packages\scrapy\core\engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "F:\pycode\zhihuuser\zhihuuser\spiders\zhihu.py", line 20, in start_requests
yield Request(self.follows_url.format(user=self.start_user, include=self.follows_query, offset=0, limit=20), callback=self.p
arse_follows)
IndexError: tuple index out of range
2019-04-08 21:00:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zhihu.com/api/v4/members/excited-vczh?include=all
ow_message,is_followed,is_following,is_org,is_blocking,employments,answer_count,follower_count,articles_count,gender,badge[?(typ
e=best_answerer)].topics> (referer: None)
{"id":"0970f947b898ecc0ec035f9126dd4e08","url_token":"excited-vczh","name":"vczh","avatar_url":"https://pic1.zhimg.com/v2-1bea18
837914ab5a40537d515ed3219c_is.jpg","avatar_url_template":"https://pic1.zhimg.com/v2-1bea18837914ab5a40537d515ed3219c_{size}.jpg"
,"is_org":false,"type":"people","url":"https://www.zhihu.com/people/excited-vczh","user_type":"people","headline":"专业造轮子,
拉黑抢前排。gaclib.net","gender":1,"is_advertiser":false,"vip_info":{"is_vip":false,"rename_days":"60"},"badge":[],"allow_messag
e":true,"is_following":false,"is_followed":false,"is_blocking":false,"follower_count":789253,"answer_count":21994,"articles_coun
t":128,"employments":[{"job":{"id":"19578588","type":"topic","url":"https://www.zhihu.com/topics/19578588","name":"Developer","a
vatar_url":"https://pic4.zhimg.com/e82bab09c_is.jpg"},"company":{"id":"19557307","type":"topic","url":"https://www.zhihu.com/top
ics/19557307","name":"Microsoft Office","avatar_url":"https://pic4.zhimg.com/v2-d3a9ee5ba3a2fe711087787c6169dcca_is.jpg"}}]}
2019-04-08 21:00:00 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-08 21:00:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 483,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 1068,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 8, 13, 0, 0, 811918),
'log_count/DEBUG': 2,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 4, 8, 13, 0, 0, 328233)}
运行期望是获取:
json格式数据没获取到
pycharm 在第六行和第十八行出现这个Overrides method in Spider不知道什么意思
你的follows_url没有 ‘user’这个参数,第一个花括号里面是空的。
哎呀,我傻了,这都漏了