根据书上的介绍练习爬取京东Python图书的信息。京东一次加载30个商品,再往下会再加载30个,共计60个为一页。我在chrome的console中执行语句,发现可以实现下拉。
$("ul.gl-warp>li").length => 30 e = document.getElementsByClassName('page clearfix')[0] => <div class="page clearfix">…</div> e.scrollIntoView(true) => undefined $("ul.gl-warp>li").length =>60
但是在scrapy-spalsh中我发现爬取的还是只有30个商品一页。而且我发现书上的结果其实也是只爬取了一般,动态加载出的另一半并未爬取成功。代码如下:
import scrapy from scrapy import Request from scrapy_splash import SplashRequest import re lua_script = ''' function main(splash) splash:go(splash.args.url) splash:wait(5) splash:runjs("document.getElementsByClassname('page')[0].scrollIntoView(true)") splash:wait(5) return splash:html() end ''' class JDBookSpider(scrapy.Spider): name = 'jd_book' allowed_domains = ['search.jd.com'] base_url = 'https://search.jd.com/Search?keyword=python&enc=utf-8&wq=python' def start_requests(self): yield Request(self.base_url,callback=self.parse_url,dont_filter=True) def parse_url(self,response): num_char = response.css('span#J_resCount::text').extract_first() total = int(re.findall('\d*',num_char,re.S)[0]) page_num = total*10000//60+(1 if total%60 else 0) for i in range(2): url = '%s&page=%s' %(self.base_url,2*i+1) yield SplashRequest(url,endpoint='execute',args={'lua_source':lua_script,'url':url},cache_args=['lua_source'],callback=self.parse) def parse(self, response): for sel in response.css('ul.gl-warp.clearfix>li.gl-item'): yield { 'name':sel.css('div.p-name').xpath('string(.//em)').extract_first(), 'price':sel.css('div.p-price i::text').extract_first(), }
请问这是怎么回事,能不能帮我解决一下。