匹配HTML标签的正则

悬赏园豆：20 [已解决问题] 解决于 2012-02-21 09:36

<li>
	<div class="info">
	<a class="title_n" href="http://www.tootoomart.com/product-4134493-Men+the+new+Pure+paul+Sminth+Smith+sweater+16+colors+sweaters/" target="_blank"><h2>Men the new Pure paul Sminth Smith sweater 16 colors...</h2></a>
	<p class="info">Men the new Pure paul Sminth Smith sweater 16 colors sweaters</p>
	<div class="infoBox wrapClear">
	<div class="supplier">Wholesaler: <a href="/seller-jingjinghaoyun888/">jingjinghaoyun888</a> <b class="star6"><img src="http://img.ttmimg.com/www/img/090514/transparent-pixel.gif" width="16" height="16" title="Feedback Score is 1-50"/></b></div>
	<span id="isSellerOnline_jingjinghaoyun888_1"></span>
	</div>
	</div>
	<div class="priceArea">
	<p class="price1" id="price">$16.36~$16.58/Piece</p>
	<div class="items">
	<p class="price2">$163.59~$165.73/Lot</p>
	<p class="priceInfo">(10 s per lot)</p>
	</div>
	</div>
	<p class="images"><a href="http://www.tootoomart.com/product-4134493-Men+the+new+Pure+paul+Sminth+Smith+sweater+16+colors+sweaters/" title="Men the new Pure paul Sminth Smith sweater 16 colors sweaters" alt="Men the new Pure paul Sminth Smith sweater 16 colors sweaters" target="_blank"><img src="http://img.ttmimg.com/images/product/images/091219/90/0912190632381314441bmzaz_130.jpg" alt="Men the new Pure paul Sminth Smith sweater 16 colors sweaters" onload="if((this.width>=this.height)&&(this.width>=130)) {this.resized=true; this.width=130;} if((this.height>this.width)&&(this.height>=130)) {this.resized=true; this.height=130;}" onerror='this.onerror="";this.src="/images/www/nophoto_small.gif"' ptype="photo"/></a></p>
	</li>

我要匹配这里面所有P标签里的字除(10 s per lot) 和$163.59~$165.73/Lot外和图片的src 求匹配的正则

正则表达式

问题补充：

我需要的的<h2>里面的一个值取个名字《name1》 <p class="info">里面的值《name2》

<div class="supplier">这个层里面的《a》标签的值《name3》 <p class="price1" id="price">里面的值《name4》取<p class="images">里面的《ing》里的src 取名<name5> 这写肯定都是《li》的

miloss | 菜鸟二级 | 园豆：254
提问于：2012-02-16 17:31

< >

最佳答案

p标签的正则： (?<=<p[^>]*>)(?!($10 s per lot$)|(\$163.59~\$165.73\/Lot)).*(?<!</p\s*>)(?=</p\s*>)

src的正则： (?<=img(?!src).*src=['"])[^'"]*(?=[^>]*>)

收获园豆：20

pmars | 菜鸟二级 |园豆：250 | 2012-02-17 09:47

正则范围在小点，要求<li>里的每个值分别取出来啊用<name>取名字下我好取啊还有前面的《h2》标签离得值啊

miloss | 园豆：254 (菜鸟二级) | 2012-02-17 09:55

@miloss: 我对你的回复不是特别的理解，是这样么？取li 里面的p 标签，并且 p标签的class的value作为p标签的名字，用<name>取到

之后另一个内容是腰去<h2>标签里面的内容，并且h2标签也需要在li标签里面，

是这样的么？

pmars | 园豆：250 (菜鸟二级) | 2012-02-17 10:08

@pmars: 我需要的的<h2>里面的一个值取个名字《name1》 <p class="info">里面的值《name2》

miloss | 园豆：254 (菜鸟二级) | 2012-02-17 10:28

@pmars: 求解啊

miloss | 园豆：254 (菜鸟二级) | 2012-02-17 13:10

<li[^>]*>.*?<h2>(?<name1>(?!</h2>).*)</h2>.*?<p(?!class).*class=['"]info['"]>(?<name2>.*?)</p>.*?<div\s*class=['"]supplier['"]>.*?<a[^>]*>(?<name3>.*?)</a>.*?<p\s*class=['"]price1['"]\s*id=['"]price['"]>(?<name4>.*?)</p>.*?<p\s*class=['"]images['"]>.*?<img\s*src=['"](?<name5>.*?)['"].*?</li>

差不多了！

pmars | 园豆：250 (菜鸟二级) | 2012-02-17 14:15

@pmars: 测试过没啊我复制过来正则都报错了啊

miloss | 园豆：254 (菜鸟二级) | 2012-02-17 14:40

@miloss: 报什么错，说明白点，没有测试，你觉得我很牛逼，直接写出来就可以？

还有啊，你如果放到代码里面的话，给所有的“变成”“，这个就是在测试器里面和代码里面的区别！

pmars | 园豆：250 (菜鸟二级) | 2012-02-17 16:11

@pmars: 我要放代码里弄怎么搞啊你看

string url = "http://www.tootoomart.com/allcategories/";
            string charset = "utf-8";
            string html = GetHtmlSource(url, charset); //获取页面html
            string pattern_Apparel = "<li[^>]*>.*?<h2>(?<name1>(?!</h2>).*)</h2>.*?<p(?!class).*class=['"]info['"]>(?<name2>.*?)</p>.*?<div\s*class=['"]supplier['"]>.*?<a[^>]*>(?<name3>.*?)</a>.*?<p\s*class=['"]price1['"]\s*id=['"]price['"]>(?<name4>.*?)</p>.*?<p\s*class=['"]images['"]>.*?<img\s*src=['"](?<name5>.*?)['"].*?</li>";<a[^>]*>(?<c>[^<]*)</a>([\w\W]*?)";
            Regex regex = new Regex(pattern_Apparel, RegexOptions.None);
            MatchCollection matchCollection = regex.Matches(html);
            for (int i = 0; i < matchCollection.Count; i++)
            {
                Match match = matchCollection[i];
                string name = match.Groups["name1"].Value;

　　　　　　........

　　　　　　}

这有就报错了就这正则包一连串的错什么字符太多无效表达式之类的

miloss | 园豆：254 (菜鸟二级) | 2012-02-17 16:35

@miloss: 给所有的“变成”“ 试过了么？

pmars | 园豆：250 (菜鸟二级) | 2012-02-17 16:53

@pmars: 这是什么也是啊变成？什么意思？

miloss | 园豆：254 (菜鸟二级) | 2012-02-17 16:58

@miloss: 好吧，我说的是这个意思：变成这个：

<li[^>]*>.*?<h2>(?<name1>(?!</h2>).*)</h2>.*?<p(?!class).*class=['""]info['""]>(?<name2>.*?)</p>.*?<div\s*class=['""]supplier['""]>.*?<a[^>]*>(?<name3>.*?)</a>.*?<p\s*class=['""]price1['""]\s*id=['""]price['""]>(?<name4>.*?)</p>.*?<p\s*class=['""]images['""]>.*?<img\s*src=['""](?<name5>.*?)['""].*?</li>"";<a[^>]*>(?<c>[^<]*)</a>([\w\W]*?)

你仔细看看两个正则的区别就知道我是说的什么意思了！

pmars | 园豆：250 (菜鸟二级) | 2012-02-17 17:02

@miloss: 在c#里面有一个转义的问题，两个引号在字符串里面才是引号

pmars | 园豆：250 (菜鸟二级) | 2012-02-17 17:04

@pmars: 报个错应该输入；

miloss | 园豆：254 (菜鸟二级) | 2012-02-17 17:21

@miloss: 没报错了单没匹配到我要的

miloss | 园豆：254 (菜鸟二级) | 2012-02-17 17:33

@miloss: 什么意思，你给的东西我都匹配到了，才给你的啊，是不是你匹配别的内容，结果别的内容的格式和这个不一样导致的？或者说，你改了那个正则吧。。。。

pmars | 园豆：250 (菜鸟二级) | 2012-02-17 17:36

@pmars: 没有啊我说俺你这正则我没匹配到我需要的

miloss | 园豆：254 (菜鸟二级) | 2012-02-17 17:41

@miloss:

string url = "http://www.tootoomart.com/wholesale-hair+removal/";
            string charset = "utf-8";
            string html = GetHtmlSource(url, charset);
            string reg = @"<li[^>]*>.*?<h2>(?<name1>(?!</h2>).*)</h2>.*?<p(?!class).*class=['""]info['""]>(?<name2>.*?)</p>.*?<div\s*class=['""]supplier['""]>.*?<a[^>]*>(?<name3>.*?)</a>.*?<p\s*class=['""]price1['""]\s*id=['""]price['""]>(?<name4>.*?)</p>.*?<p\s*class=['""]images['""]>.*?<img\s*src=['""](?<name5>.*?)['""].*?</li>"";<a[^>]*>(?<c>[^<]*)</a>([\w\W]*?)";
             Regex regex = new Regex(reg, RegexOptions.None);
            MatchCollection matchCollection = regex.Matches(html);
            for (int i = 0; i < matchCollection.Count; i++)
            {
                Match match = matchCollection[i];
                string name = match.Groups["name1"].Value;
            }

public static string GetHtmlSource(string url, string charset)
        {
            Encoding nowCharset;
            string html = "";
            if (charset == "" || charset == null)
            {
                nowCharset = Encoding.Default;
            }
            else
            {
                nowCharset = Encoding.GetEncoding(charset);
            }

            try
            {
                HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
                HttpWebResponse response = (HttpWebResponse)request.GetResponse();
                Stream stream = response.GetResponseStream();
                StreamReader reader = new StreamReader(stream, nowCharset);
                html = reader.ReadToEnd();
                stream.Close();
            }
            catch (Exception e)
            {
            }
            return html;
        }

miloss | 园豆：254 (菜鸟二级) | 2012-02-17 17:42

@pmars: 你把这代码运行看看监视下能看见需要代码不

miloss | 园豆：254 (菜鸟二级) | 2012-02-17 17:43

@miloss: 我在运行的时候加勒一个(?is)

正则：(?is)<li[^>]*>.*?<h2>(?<name1>(?!</h2>).*)</h2>.*?<p(?!class).*class=['""]info['""]>(?<name2>.*?)</p>.*?<div\s*class=['""]supplier['""]>.*?<a[^>]*>(?<name3>.*?)</a>.*?<p\s*class=['""]price1['""]\s*id=['""]price['""]>(?<name4>.*?)</p>.*?<p\s*class=['""]images['""]>.*?<img\s*src=['""](?<name5>.*?)['""].*?</li>

给你加入的那段去掉了，因为不知道你想要在获取什么，这个时候就可以了，你可以复制过去，运行一下，如果还有其他的数据要获取，在恢复就可以了！

pmars | 园豆：250 (菜鸟二级) | 2012-02-17 18:01

@pmars: 我需要的的<h2>里面的一个值取个名字《name1》 <p class="info">里面的值《name2》<div class="supplier">这个层里面的《a》标签的值《name3》 <p class="price1" id="price">里面的值《name4》取<p class="images">里面的《ing》里的src 取名<name5> 我就获取这五个值啊但是你这正则匹配出来的一大堆不知道你怎么测试的

miloss | 园豆：254 (菜鸟二级) | 2012-02-17 18:08

@pmars: 你看我的代码我是要把我需要的每个循环出来一组数据然后在插进数据库

miloss | 园豆：254 (菜鸟二级) | 2012-02-17 18:10

@pmars: 下班了回去再看

miloss | 园豆：254 (菜鸟二级) | 2012-02-17 18:14

@miloss: 回家之后我就不能上网了，你说的要加到name1到name5里面啊，结果你还不会弄。。囧！

类似这样，具体的你研究一下：

for (int i = 0; i < matchCollection.Count; i++)
            {
                Match match = matchCollection[i];
                string name = match.Groups["name1"].Value;

string name2 = match.Groups["name2"].Value;

string name3 = match.Groups["name3"].Value;

string name4 = match.Groups["name4"].Value;

string name5 = match.Groups["name5"].Value;

}

你代码里都有了啊。。。。

pmars | 园豆：250 (菜鸟二级) | 2012-02-17 18:20

@pmars: 就是你的正则没匹配到啊 string name = match.Groups["name1"].Value;

里面的name没取到我想要的值

miloss | 园豆：254 (菜鸟二级) | 2012-02-20 09:00

@pmars: name1和name2 取值有点问题其他的都正常啊还有就是上面的HTML代码只是一个商品的信息一个页面一般有比较多的商品怎么直接取到了最后一个商品啊不能循环取啊

miloss | 园豆：254 (菜鸟二级) | 2012-02-20 09:42

清除回答草稿

您需要登录以后才能回答，未注册用户请先注册。