谁能够提供一个asp.net爬虫的源代码

-1

悬赏园豆：50 [已解决问题] 解决于 2008-12-08 11:20

想做要一个asp.net的爬虫源代码，能够生成sitemap.xml和 sitemap.html ,请大家帮忙！！！！

网站开发

achao | 初学一级 | 园豆：125
提问于：2008-12-04 18:25

< >

最佳答案

以前写过一个作弊程序【嘿嘿，不好意思】，希望对你有所帮助。

下面是主要的代码：
//根据URL获取网页的源代码
private void GetSource(string url)
{
httpSource = "";
try
{
WebClient webClient = new WebClient();
//获取包含页面源代码的字符串
Stream stream;
//if (sites.Count > 0)
//{
// num = sites.Count;
// string site=sites[(new Random()).Next(0, num)];
// stream = webClient.OpenRead(site);
// header = site.Substring(0,site.LastIndexOf("/")+1);
//}
//else
{
stream = webClient.OpenRead(url);
header = url.Substring(0, url.LastIndexOf("/") + 1);
}
StreamReader sr = new StreamReader(stream, Encoding.UTF8);
httpSource = sr.ReadToEnd();
sr.Close();
stream.Close();
}
catch// (Exception err)
{
//MessageBox.Show(err.Message);
return;
}
}

//处理上面的函数返回的网页源代码，使用正则表达式提取其中的链接地址：
private void GetSitesFromSource(string source)
{
string regexPattern = @"(href\s*=\s*)[""''](?<url>[^''""]+)[""'']";
//@"<a\s(.*?)+href=(\"")?(.+?)(\"")?\s*(.*?)>(.+?)</a>";
Regex regex = new Regex(regexPattern, RegexOptions.IgnoreCase);
Match match = regex.Match(httpSource);
siteList.Clear();
while (match.Success)
{
string temp = match.Groups["url"].Value;
if (!temp.EndsWith(".ico") && !temp.EndsWith(".jpg") && !temp.EndsWith(".png") && !temp.EndsWith(".gif") && !temp.EndsWith(".css") && !temp.EndsWith(".js") && !temp.EndsWith("#") && !(temp.IndexOf("+") > 0) && !temp.StartsWith("javascript") && !temp.StartsWith("mailto"))
{
if (!temp.StartsWith("http:"))
{
if (temp.StartsWith("../"))
{
temp = @"http://henu.2008.163.com/" + temp.Substring(3);
}
else
{
temp = header + temp;
}
}
if (!siteList.Contains(temp))
{ siteList.Add(temp); }
//listBox1.Items.Add(temp);
}
match = match.NextMatch();
}
}

上不了岸的鱼 | 老鸟四级 |园豆：4613 | 2008-12-04 20:02

其他回答(3)

http://jerry-blog.blogcn.com/diary,206362384.shtml看看这个吧，以前看过.

Astar | 园豆：40805 (高人七级) | 2008-12-04 21:59

路过

XiaoChun | 园豆：205 (菜鸟二级) | 2008-12-05 09:44

路过xuexi...

Jared.Nie | 园豆：1940 (小虾三级) | 2008-12-05 14:35

清除回答草稿

您需要登录以后才能回答，未注册用户请先注册。

欢迎，请先 登录 或者 注册 。

谁能够提供一个asp.net爬虫的源代码

欢迎，请先登录或者注册。