过滤HTML tag，提取HTML正文

绝对是非常非常有用的东西~~~！！！

这是数据挖掘的研究范围，要做到横完善，还有横长的路要走。目前微软亚洲研究院，由相关课题。

但是，如果HTML文件结构比较简单，那么完全可以尝试使用网络上已有正则表达式过滤法，基本能达到你需要的效果。

下面贴出关键代码：

……

using System.Text.RegularExpressions;

……

  public static string StripHTML(string strHtml)
  {
   string [] aryReg ={
          @"<script[^>]*?>.*?</script>",

          @"<(/s*)?!?((w+:)?w+)(w+(s*=?s*(([""’])([""’tbnr]|[^7])*?7|w+)|.{0})|s)*?(/s*)?>",
          @"([rn])[s]+",
          @"&(quot|#34);",
          @"&(amp|#38);",
          @"&(lt|#60);",
          @"&(gt|#62);",
          @"&(nbsp|#160);",
          @"&(iexcl|#161);",
          @"&(cent|#162);",
          @"&(pound|#163);",
          @"&(copy|#169);",
          @"&#(d+);",
          @"–>",
          @"<!–.*n"

         };

   string [] aryRep = {
           "",
           "",
           "",
           """,
           "&",
           "<",
           ">",
           " ",
           "xa1",//chr(161),
           "xa2",//chr(162),
           "xa3",//chr(163),
           "xa9",//chr(169),
           "",
           "rn",
           ""
          };

   string newReg =aryReg[0];
   string strOutput=strHtml;
   for(int i = 0;i<aryReg.Length;i++)
   {
    Regex regex = new Regex(aryReg[i],RegexOptions.IgnoreCase );
    strOutput = regex.Replace(strOutput,aryRep[i]);
   }

   strOutput.Replace("<","");
   strOutput.Replace(">","");
   strOutput.Replace("rn","");

return strOutput;
}

……

同时，这里有C#版的源文件，供您参考！