我在抓取网页的时候得不到完整的原始源,在网上搜索了很久都找不到可行的办法,希望园里有人帮个忙,在此先谢了!!
比如我要提取的这个网址为:https://gemaer.1688.com/page/offerlist_72115887_72115886.htm?spm=a2615.2177701.0.0.ZwRR4k&sortType=wangpu_score
我想得到它的原始源码(就是右击-查看原始源里看到的所有字符)
当我用以下代码提取的时候:
private string GetHtmlCode(string url) { string htmlCode; HttpWebRequest webRequest = (System.Net.HttpWebRequest)System.Net.WebRequest.Create(url); webRequest.Timeout = 30000; webRequest.Method = "GET"; webRequest.UserAgent = "Mozilla/4.0"; webRequest.Headers.Add("Accept-Encoding", "gzip, deflate"); HttpWebResponse webResponse = (System.Net.HttpWebResponse)webRequest.GetResponse(); if (webResponse.ContentEncoding.ToLower() == "gzip { using (System.IO.Stream streamReceive = webResponse.GetResponseStream()) { using (var zipStream = new System.IO.Compression.GZipStream(streamReceive, System.IO.Compression.CompressionMode.Decompress)) { using (StreamReader sr = new System.IO.StreamReader(zipStream, Encoding.Default)) { htmlCode = sr.ReadToEnd(); } } } }else { using (System.IO.Stream streamReceive = webResponse.GetResponseStream()) { using (System.IO.StreamReader sr = new System.IO.StreamReader(streamReceive, Encoding.Default)) { htmlCode = sr.ReadToEnd(); } } } return htmlCode; }
提取的数据不完整,无法显示iframe里的代码。本以为可能有些是AJAX的数据,于是我换以下提取的代码:
private void button1_Click(object sender, EventArgs e) { WebBrowser web = new WebBrowser(); web.Navigate(this.rtb_Url.Text); web.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(web_DocumentCompleted); while (web.IsBusy) { Application.DoEvents(); Thread.Sleep(100); } } void web_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e) { WebBrowser web = (WebBrowser)sender; string mystr = web.Document.Body.OuterHtml; }
提到的原始源始终都是不完整的,希望园里知道的指点一下,谢谢!
你直接说你要抓哪部分数据就行了,讲了一大堆,我没听懂
不是部分,我就是取在打开的页面上右击-查看原始源里看到的所有字符--就是原始源码html代码
@奇迹太来: 换一个读取的方法,电脑上没装net没办法帮你测试
http://www.cnblogs.com/hantianwei/archive/2010/11/06/1870802.html
@奇迹太来: 我用python抓取是毫无障碍
@Rich.T: 对python不是很了解
你恐怕要用更高级的mshtml.dll试试了。
这个mshtml.dll怎么弄?
@奇迹太来:http://www.cnblogs.com/hfzsjz/archive/2012/11/21/2780367.html
建议用火狐浏览器试试
/// <summary> /// 获取页面所有的html /// </summary> /// <param name="virtualPath"></param> /// <returns></returns> public string GetHtmlData(string virtualPath) { StringWriter writer = new StringWriter(); string path = Request.Url.Scheme + "://" + Request.Url.Authority + VirtualPathUtility.ToAbsolute(virtualPath); string[] parts = virtualPath.Split('?'); string query = string.Empty; if (parts.Length > 1) query = parts[1]; virtualPath = parts[0]; HttpContext context1 = new HttpContext(new HttpRequest(virtualPath, path, query), new HttpResponse(writer)); IHttpHandler handler = System.Web.UI.PageParser.GetCompiledPageInstance(virtualPath, Server.MapPath(virtualPath), context1); handler.ProcessRequest(context1); return writer.ToString(); }
我之前用过 ,原理都是一样的,你试试,我也不知道行不行
谢谢,我用的是WinForm,晚点我试试看