爬取中国票房网(www.cbooo.cn)影院票房的2019年年度票房排行榜(
http://www.cbooo.cn/year?year=2019 )页面上的25条信息。信息包括影片名、类型、总票房、平均票价、场均人次、国家及地区、上 映时间共7项,并且最后将这25条信息保存至一个文本文件中,每条信息的7项数据使用英文逗号分隔。
#!usr/bin/python import requests import time import sys import os from bs4 import BeautifulSoup web_url = r'http://www.cbooo.cn/year?year=2019' #IRI2012 filepath = os.path.split(os.path.realpath(__file__))[0] + '\\cbooodata.txt' fid = open(filepath, 'w', encoding='utf-8') headers = { "Host": "www.cbooo.cn", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36" } r = requests.get(web_url, headers=headers) if r.status_code == 200: contentHtml = BeautifulSoup(r.text) tbcontent = contentHtml.find(name='table', attrs={"id": "tbContent"}) temptrHtml = tbcontent.contents trHtml = list(filter(lambda x: x != "\n", temptrHtml)) count = 0 liste = [] for tr in trHtml: count += 1 if count == 1: continue temptdcontent = tr.contents tdinfo = "" tdcontents = list(filter(lambda x: x != "\n", temptdcontent)) for item in tdcontents: tdinfo += BeautifulSoup(item.text).text.replace("\n", "") + "," liste.append(tdinfo) fid.write(tdinfo + "\n") print(liste) fid.close()
爬虫内容再当前python脚本下的cbooodata.txt文件
这家伙都不结贴的
最讨厌这种伸手党了
结贴啊
– 梦里寻人 5年前