[Java] Pattern 和 Matcher 使用方法－風吹乾了我就走

Pattern 和 Matcher 應用方面

我們在處理文字類型的檔案, 常需要去擷取出所要的資訊. 例如 : 找出文字中的某個特定的字出現的次數;或著是處理網頁回傳回來的內容，而取出所需的資訊(ex:網址)面對這類的問題，我們會常使用Regular expression方式來處理。

    Regular expression 其內容及使用方式請參考
   (1) http://caterpillar.onlyfun.net/Gossip/JavaGossip-V1/RegularExpression.htm
   (2) Wiki的解釋

在Java中, 也支援Regular expression的方式. 以Pattern物件設定Regular expression, 以Matcher物件來找尋所需的資料. 以下簡單基本介紹，如何透過 Pattern 和 Matcher 來達成目標。

在Pattern API裡無建構子, 建立Pattern物件是通過compile method. 今天如果你要搜索一段文字中, "love"這個單字出現過幾次, 那麼便設定Pattern如此

Pattern pattern = Pattern.compile("love");

然後透過Matcher設定搜索的範圍. 因為matcher是跟據pattern物件所設的條件來找尋, 建立matcher時便用剛才建立的pattern物件來產生一個matcher, 例如 :

Matcher matcher = pattern.matcher( text );

text為要搜索的範圍, 經過以上的設定之後, 下達find的指令(match.find()) ,藉由matcher中的groupCount(), 此method會回傳符合pattern條件的全部個數, 便能達到我們要從文章中得到某單字出現次數的功能.

另外其他例子來比較

example: 今天要處理送資料到Web server時，所產生的轉址問題。

<title> Results of Secondary Structure Prediction </title>
<meta http-equiv="refresh" content="20;url=http://www.imtech.res.in/cgibin/chkres?11667">
<center>
<h1> Results of Secondary Structure Prediction </h1></center>
<hr>Thanks for using this server. Please contact author 
<a href='/raghava'> Dr G P S Raghava</a>, email <a href='mailto:raghava@imtech.res.in'>
raghava@imtech.res.in </a>, if you face any problem or want to comment<br><h5>
<br> 
Your job number: 11667 <br>
Please check your result from URL: <a href="http://www.imtech.res.in/cgibin/chkres?11667">
http://www.imtech.res.in/cgibin/chkres?11667</a><br>
</body> </html>

我們可以用兩種方式來取出：

1. 使用 Pattern 和 Matcher

2. 使用 String 的 split method加上if判斷式

一、使用 Pattern 和 Matcher

/* 範例中網址的構成由http://協定為開頭加上 "/", ".", "?"以及數字英文所組成，
 * 所以Regular expression設成 "http://[\\w|\\?|\\.|/]+"
 */
Pattern pattern = Pattern.compile("http://[\\w|\\?|\\.|/]+");
Matcher matcher = pattern.matcher( input String);

while(matcher.find()){
    matcher.group()
}

二、使用 String 的 split method加上if判斷式

/* 抓出包含http:的字樣的行，然後再使用split */
if( input.contains("http")){
    temp[] = input.slipt(\\s);
    /** 
     * 如果目標是要從 <meta http-equiv=....>這行抓出，則要處理
     * content="20;url=http://www.imtech.res.in/cgibin/chkres?11667">
     * 取出 http://www.imtech.res.in/cgibin/chkres?11667 的部份
     */  
 
 
    /** 
     * 如果想從 Please check your result...這行抓出，則要處理
     * href="http://www.imtech.res.in/cgibin/chkres?11667">或
     * http://www.imtech.res.in/cgibin/chkres?11667</a><br>
     * 刪除多餘的部份才能取出http://www.imtech.res.in/cgibin/chkres?11667
     */ 
}