这个网址,浏览器能正常访问,但是使用代码(jsoup、crawler4j、phantomjs都试过了)去访问,就只能拿到js。拿不到想要的数据,是什么原因呢?
经过不断摸索,发现是cookie过期时间很短,很快就失效了,求解决办法。
贴个jsoup代码吧
public Document connect(String id) throws IOException { Connection conn = Jsoup.connect("http://app1.sfda.gov.cn/datasearch/face3/content.jsp"); conn.data("tableId", "41"); conn.data("tableName", "TABLE41"); conn.data("tableView", "%E8%8D%AF%E5%93%81%E7%BB%8F%E8%90%A5%E4%BC%81%E4%B8%9A"); conn.data("Id", id); conn.header("Host", "app1.sfda.gov.cn"); conn.header("Connection", "keep-alive"); conn.header("Cache-Control", "max-age=0"); conn.header("Upgrade-Insecure-Requests", "1"); conn.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36"); conn.header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"); conn.header("Accept-Language", "zh-CN,zh;q=0.8"); conn.header("Accept-Encoding", "gzip, deflate, sdch"); conn.header("Cookie", "FSSBBIl1UgzbN7N80S=BtvfD2PFPAIu.XbuOfux4IcW8ktQDN49zpYsp08n72Px9zOY4YszFH1je4WTD7Wy; FSSBBIl1UgzbN7N80T=10NcPySs4h5GbNLSPd8WENhq3zbz6klmHCXAcNfW4JlariHbuONX7qATt3iPSLFQ0VxXzishMq1xRVktpgZjAnvyM9qcVrxkigiNQFjG9hqYbOG8bIQlWJXJMOCTNB8NRQbT8B1FrOBgZ8HXBmt5KrqwVMlATIP5ge2lCMwIxpbcjSs4kkdo7Ha_TO4yOa1kmTBg7xaQ2n9Aaj9IxoQB1rcWmwBodt1yp2YmoSi5xWgVtRjgAIyO7AfNDJKe5V92mNtWBHdd4gUZKgWuJS_iuKKqg_.8GtrMyuZsI9KxlzsO9iu..sXAFDG9CSnl7hp.6LGp81qKDovJllyEVepnCcW_u"); conn.timeout(10000);//取得整个页面内容; Document doc = conn.get(); return doc; }
它的页面会重新生成cookie,你要分析js,你清除该网站所有cookie,会发现第一次请求的时候有一个返回302的请求.那个是返回加密过的js代码,直接看代码看不懂,要调试才能知道一些,而且这个是给window处理的.
我试过浏览器模拟,可以获取数据.
代码获取较麻烦,需要处理几乎每个页面都重新生成的FSSBBIl1UgzbN7N80T,暂时还未解决,如果你有进展麻烦回复,谢谢.
很明显一定是请求的数据没填够。通常在cookie中。比如session、token