scrapy 抓取到的内容是一段一段的，不是整体

悬赏园豆：5 [待解决问题]

示例页面：

finance.eastmoney.com/news/1345,20181129995332038.html

item['Content'] = response.xpath('//div[@class="Body"]').extract_first()
    or
item['Content'] = response.xpath('//div[@class="Body/p"]').extract_first()

问题是：在示例页面，用了这个Xpath，获取到的内容是这样的（每一行都是单引号引起来的）：

'<div id="ContentBody" class="Body">\r\n'
            '            <div class="abstract">摘要</div>\r\n'
            '            <div '
            '        <!--浪客直播-->\r\n'
            '\r\n'
            '        <!--文章主体-->\r\n'
            '             <p>\u3000\u3000'

导致的结果就是我用sub的时候，只能一行写一个替换规则，而不是整个内容用一个正则。

我看了下type，是str
代码：

# -*- coding: utf-8 -*-
import scrapy
from Eastmoney.items import EastmoneyItem
import re


class EastmoneySpider(scrapy.Spider):
    name = 'eastmoney'
    allowed_domains = ['eastmoney.com']
    start_urls = ['http://finance.eastmoney.com/news/1345,20181129995332038.html']

    def parse(self, response): #resposne是start_urls里面的链接爬取后的结果
        item = EastmoneyItem()
        item['Title'] = response.xpath('//div[@class="newsContent"]/h1/text()').extract_first()
        item['Cratedtime'] = response.xpath('//div[@class="time-source"]/div/text()').extract_first()
        item['Source'] = response.xpath('//div[@class="source data-source"]/span/text()').extract_first()
        item['Content'] = response.xpath('//div[@class="Body"]').extract_first()
        yield item

python scrapy

会发光 | 菜鸟二级 | 园豆：258
提问于：2018-11-30 10:20

< >

所有回答(1)

我跑了你的代码，获取到的是整个内容的字符串。

Masako | 园豆：1893 (小虾三级) | 2018-11-30 18:28

哎，为啥我这是一段一段的呢正则都不好替换

支持(0) 反对(0) 会发光 | 园豆：258 (菜鸟二级) | 2018-12-03 09:02

清除回答草稿

您需要登录以后才能回答，未注册用户请先注册。