一般都是请求网址,获取网页源代码,然后用正则表达式分析,给你个得到网页源码的示例程序:
/// <summary>
/// 获得网址原代码
/// </summary>
/// <param name="Url">网址</param>
/// <returns>string</returns>
public static string GetHtml(string Url)
{
string strResult = "";
try
{
Uri uri = new Uri(Url);
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
request.Method = "GET";
request.AllowAutoRedirect = true;
request.UserAgent = "Googlebot/2.1 ( http://www.google.com/bot.html)";
request.Referer = string.Concat("http://", uri.Host);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream streamReceive = response.GetResponseStream();
Encoding encoding = Encoding.UTF8;
StreamReader streamReader = new StreamReader(streamReceive, encoding);
strResult = streamReader.ReadToEnd();
}
catch { }
return strResult;
}
表示自己学习下正则就神马都搞定了。别人跟你说了一次你还是不会啊。
用perl处理吧,看看这个http://szedwin.gotoip1.com/read.php?tid-1035.html,取的就是标题和正文
写个采集软件不是一个简单的事情,找个现成的采集器来用一下,有很多,或者找个专业做采集的帮你做,比如数据农场,你百度一下就看到了。