示例页面:
finance.eastmoney.com/news/1345,20181129995332038.html item['Content'] = response.xpath('//div[@class="Body"]').extract_first() or item['Content'] = response.xpath('//div[@class="Body/p"]').extract_first()
问题是:在示例页面,用了这个Xpath,获取到的内容是这样的(每一行都是单引号引起来的):
'<div id="ContentBody" class="Body">\r\n' ' <div class="abstract">摘要</div>\r\n' ' <div ' ' <!--浪客直播-->\r\n' '\r\n' ' <!--文章主体-->\r\n' ' <p>\u3000\u3000'
导致的结果就是我用sub的时候,只能一行写一个替换规则,而不是整个内容用一个正则。
我看了下type,是str
代码:
# -*- coding: utf-8 -*- import scrapy from Eastmoney.items import EastmoneyItem import re class EastmoneySpider(scrapy.Spider): name = 'eastmoney' allowed_domains = ['eastmoney.com'] start_urls = ['http://finance.eastmoney.com/news/1345,20181129995332038.html'] def parse(self, response): #resposne是start_urls里面的链接爬取后的结果 item = EastmoneyItem() item['Title'] = response.xpath('//div[@class="newsContent"]/h1/text()').extract_first() item['Cratedtime'] = response.xpath('//div[@class="time-source"]/div/text()').extract_first() item['Source'] = response.xpath('//div[@class="source data-source"]/span/text()').extract_first() item['Content'] = response.xpath('//div[@class="Body"]').extract_first() yield item
我跑了你的代码,获取到的是整个内容的字符串。
哎,为啥我这是一段一段的呢正则都不好替换