收藏夹的内容太多,有几百个吧,想整理下,从chrome导出的收藏夹是个html文件。
现在想把 href 后面的网址提取 并去重。
请问大神,怎么用c#解析html文件提取额想要的东西呢?使用正则表达式吗?
如果要提取额 add_date 和 icon 后的内容呢比如
<DT><A HREF="http://wenku.baidu.com/view/13f8dac4bb4cf7ec4afed0c9.html" ADD_DATE="1339383239" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAACW0lEQVQ4jZ2Ty0vVURDHP/O7D7PsQfQgKBHa5C6yxypaWEIEQbSphAxq08PIHgRRRPYgov6CIIhqUUQEQVTLFpUgQiS+7vWGejWv1u1qof5+v3POtLjeh7pr4MB8Z84Zvuc7M8LNLgXAKagDB1ibx9aBc3lsCn4ZtoYogF6t5X9MGt/j4XRBItHn653Wce38NqMAyYSvqWSw8KI1eDg3JxYEyqOHE/L0yST372ZJJQNONI1y9lSGVDKYV8DNZZD7bYnHBWdhRVVEKmIi2V+W6Sknxqhkf9m5BYwlis0zGBwIab02rosqPeobFksua9lcV8HWHZUcO768+GZoMGRDdazIAC53qKrqi2c5rVnb53Zu/+5UVfsTvs5MOy1Yf8LXwwfTeujAkPsxEqqqKntfaFGDmo1x1m+Iyeo1UXn75i8ALWdGOd88SioZ8OrlH1L9Ae1tM5IeDGdb74ji8v/asrWSc5dW6ljGsKk2Li2nM9rT7ROv8GR8zOqN26vly+cp1q2L6qo1kTINTJ5BJAL79lfJSNpwoTlDb7ePCRETOj59nJYrF8f03oO1kh4KqamJS6kLpqRsfyLk7MkMXZ0BYYiUC97TFcit6z9ZXx2TYsZavEIXVOHV80nt/OprGC6cmVzOaXvbDEMDZUnniGJLDGJxwfcVzwMUdU4RTyQaQRWYmnKMDJuyQdKSBiJwpGm5DKcNfb0By5Z6smv3YgBev/wjEzlH3bYK6huWSDkD4egH1cd7CiEFOoB5M1u0OFBXALL9IULjO8WY0ppaN7uqs76zYDS/ysWjxfX/BzktaPzSodo8AAAAAElFTkSuQmCC">04.ACL与包过滤_百度文库</A>
add_date 后面的 1339383239 和 icon后面的 04.ACL与包过滤_百度文库
如果只是提取href 后面的网址,用正则表达式方便点
那请问,如果要提取 add_date 和 icon后面的内容呢
比如
<DT><A HREF="http://wenku.baidu.com/view/13f8dac4bb4cf7ec4afed0c9.html" ADD_DATE="1339383239" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAACW0lEQVQ4jZ2Ty0vVURDHP/O7D7PsQfQgKBHa5C6yxypaWEIEQbSphAxq08PIHgRRRPYgov6CIIhqUUQEQVTLFpUgQiS+7vWGejWv1u1qof5+v3POtLjeh7pr4MB8Z84Zvuc7M8LNLgXAKagDB1ibx9aBc3lsCn4ZtoYogF6t5X9MGt/j4XRBItHn653Wce38NqMAyYSvqWSw8KI1eDg3JxYEyqOHE/L0yST372ZJJQNONI1y9lSGVDKYV8DNZZD7bYnHBWdhRVVEKmIi2V+W6Sknxqhkf9m5BYwlis0zGBwIab02rosqPeobFksua9lcV8HWHZUcO768+GZoMGRDdazIAC53qKrqi2c5rVnb53Zu/+5UVfsTvs5MOy1Yf8LXwwfTeujAkPsxEqqqKntfaFGDmo1x1m+Iyeo1UXn75i8ALWdGOd88SioZ8OrlH1L9Ae1tM5IeDGdb74ji8v/asrWSc5dW6ljGsKk2Li2nM9rT7ROv8GR8zOqN26vly+cp1q2L6qo1kTINTJ5BJAL79lfJSNpwoTlDb7ePCRETOj59nJYrF8f03oO1kh4KqamJS6kLpqRsfyLk7MkMXZ0BYYiUC97TFcit6z9ZXx2TYsZavEIXVOHV80nt/OprGC6cmVzOaXvbDEMDZUnniGJLDGJxwfcVzwMUdU4RTyQaQRWYmnKMDJuyQdKSBiJwpGm5DKcNfb0By5Z6smv3YgBev/wjEzlH3bYK6huWSDkD4egH1cd7CiEFOoB5M1u0OFBXALL9IULjO8WY0ppaN7uqs76zYDS/ysWjxfX/BzktaPzSodo8AAAAAElFTkSuQmCC">04.ACL与包过滤_百度文库</A>
就是 icon 后面的 书签名称 “04.ACL与包过滤_百度文库” 怎么提取
@混沌奇迹:
用这个取到A数据
<A[^>]*href="([^>]*)"[^>]*add_date="([^>]*)"[^>]*icon="([^>]*)"[^>]*>(.*?)</A>
然后再用
$1 取 href 内容
$2 取 add_date 内容
$3 取 icon 内容
$4 取 a标签之间的内容
http://htmlagilitypack.codeplex.com/
使用正则表达式提取超链接就可以了啊
正在表达式是正解!