http://www.tripadvisor.com/Hotels-g60763-New_York_City_New_York-Hotels.html
我抓取酒店列表数据
如:Casablanca Hotel Times Square
string strContent = client.GetProxyServerContent(url).Replace (@"
","");
private string hotelAllInfo = "<div\\s*class=\"listing(?<hotelAllInfo>.*?)<div";
Match match = Regex.Match(strContent, hotelAllInfo, RegexOptions.IgnoreCase);
ArrayList alist = new ArrayList();
while (match.Success)//列表首页抓取
{
....
}
<div class="listing first" id="hotel_113317">
<div class="attnBar">
<div class="inner">
<span class="tc"><b>Travelers' Choice® 2011 Winner</b>
<a class="js_popup" onclick="setPID(4045)" rel=nofollow href=/TravelersChoice-g191-cBestService-United_States.html>Best Service</a> | <a class="js_popup" onclick="setPID(4045)" rel=nofollow href=/TravelersChoice-g191-cTop25-United_States.html>Top 25</a>
</span>
<span class="tv"><b>Top Value!</b> Save vs. similar hotels in New York City</span>
</div>
private string hotelAllInfo = "<div\\s*class=\"listing(?<hotelAllInfo>.*?)>";
first" id="hotel_113317"
第一次是 strContent 像上面那样做了一下换行代替,才可以的,现在这样不行了,不知道是我什么地方弄不同了是不是,NND ,搞了一天
那位牛人能指点一下啊
Match match = Regex.Match(strContent, hotelAllInfo, RegexOptions.Multiline | RegexOptions.IgnoreCase);
string strContent = client.GetProxyServerContent(url).Replace (@"
","").Replace("\r\n","");
的确有 \r\的问题,之前研究过一段时间抓取数据,用正则截取是个办法,不过用分析dom更好一些,有开源的项目可以利用。
\s可以匹配不可见字符(包括换行)
使用.来匹配带有换行的所有字符时,需设置单行匹配RegexOptions.Singeline