正規表現検索

Question

Javaのソースコード内から特定の単語を検索したいのですが、
このとき、以下の条件があります。

（１）ブロックコメント内を無視する
（２）行コメント内を無視する
（３）変数文字列内を無視する


01 /* 
02  * ここの abc はブロックコメント内なので無視する
03  * 
04  */
05 public class Foo() {
06 　private int abc = 0;
07 
08 　public Foo() {
09 　　// 行コメント内なのでここの abc を無視
10 　　abc = 1;
11 　　String s = "変数文字列内の abc これも無視";
12 　}
13 
14 　public String get() {
15 　　return " 1'23\" abc " + abc; // この場合後ろの abc のみヒット
16 　}
17 }


例えば、上記のテキストで abc を検索したとき、
６、１０と１５行目の後ろの３箇所のみヒットさせたいのですが、
これはどのように正規表現で記述すればよいのでしょうか。

/* で始まり、*/ が記述されるまでがブロックコメントです。
// があったら、その行末までが行コメントです。
" で囲まれた中が変数文字列です。文字列内の \" は無視します。

よろしくお願いいたします。

sakusaker7 · Accepted Answer

ファイルの内容をひとつの文字列に丸呑みするのはいちいち1行ずつ読み込んで連結するより、 #6の方の回答にあるように、$/ を操作してしまったほうが高速にできますしメモリも無駄遣いしません。また、正規表現マッチングにおいては $` $& $' を使うと速度的なペナルティがあります。ドキュメント perlvar.pod より $MATCH $& The string matched by the last successful pattern match (not counting any matches hidden within a BLOCK or eval() enclosed by the current BLOCK). (Mnemonic: like & in some editors.) This variable is read-only and dynamically scoped to the current BLOCK. The use of this variable anywhere in a program imposes a considerable performance penalty on all regular expression matches. See "BUGS". See "@-" for a replacement. $PREMATCH $` The string preceding whatever was matched by the last successful pattern match (not counting any matches hidden within a BLOCK or eval enclosed by the current BLOCK). (Mnemonic: "`" often precedes a quoted string.) This variable is read-only. The use of this variable anywhere in a program imposes a considerable performance penalty on all regular expression matches. See "BUGS". See "@-" for a replacement. $POSTMATCH $' The string following whatever was matched by the last successful pattern match (not counting any matches hidden within a BLOCK or eval() enclosed by the current BLOCK). (Mnemonic: "'" often follows a quoted string.) Example: local $_ = 'abcdefghi'; /def/; print "$`:$&:$' "; # prints abc:def:ghi This variable is read-only and dynamically scoped to the current BLOCK. The use of this variable anywhere in a program imposes a considerable performance penalty on all regular expression matches. See "BUGS". See "@-" for a replacement. #!/usr/bin/perl use strict; use warnings; #slurp the file my $content = do { local $/ = undef; }; my @linestarts = (0); push @linestarts, pos($content)+1 while ($content =~ m/ /g); my $searchword = $ARGV[1] ? $ARGV[1] : 'abc'; my $comment = qr<(?: /\* .*? \*/ )>xms; my $line_comment = qr<(?: // .*? $ )>xms; my $string = qr<(?: "[^\"]* (?: \" [^"]* )*" )>xms; my $skip = qr<(?: $comment | $line_comment | $string )>xms; sub get_lineno { my $pos = shift; my $elems = scalar @linestarts; my $idx; for ($idx=1; $idx < $elems; $idx++) { last if $linestarts[$idx] > $pos; } $idx; } sub get_start_pos { my $pos = shift; my $elems = scalar @linestarts; my $idx; for ($idx=1; $idx < $elems; $idx++) { last if $linestarts[$idx] > $pos; } $linestarts[$idx-1]; } while ($content =~ m/(?: $skip | ($searchword))/xmsg) { printf "%4d:%4d %s ", get_lineno(pos($content)), pos($content)-get_start_pos(pos($content))-1, $1 if $1; } __END__ /* * ここの abc はブロックコメント内なので無視する * */ public class Foo() { private int abc = 0; public Foo() { // 行コメント内なのでここの abc を無視 abc = 1; String s = "変数文字列内の abc これも無視"; } public String get() { return " 1'23" abc " + abc; // この場合後ろの abc のみヒット } }

kumoz · Answer

#9 の捕捉に書いてある正規表現は私の手に負えないので、質問者の望むもととは
違うかもしれません。ごく短い簡単なコードです。

use strict;
my $code = join '', <DATA>;
my @abc_idx;
push @abc_idx, length $` while $code =~ /abc/g;

foreach my $i (@abc_idx) {
  my $pre = substr $code, 0, $i;
  $pre =~ s/abc/xyz/g;
  my $aft = substr $code, $i + 3;
  $aft =~ s/abc/xyz/g;
  my $line_no = $pre =~ tr/
// + 1;
  $pre =~ /(.*)$/; my $line_pos = length($1) + 1;
  if ("${pre}abc$aft" =~ m#/\*.*?abc.*?\*/|//[^
]*?abc|[^\]"([^\"]|\")*?abc([^\"]|\")*?"#s) {
    print "X: line $line_no, pos $line_pos
";
  } else {
    print "O: line $line_no, pos $line_pos
";
  }
}

__DATA__
/*
* ここの abc はブロックコメント内なので無視する
*
*/
public class Foo() {
　private int abc = 0;

　public Foo() {
　　// 行コメント内なのでここの abc を無視
　　abc = 1;
　　String s = "変数文字列内の abc これも無視";
　}

　public String get() {
　　return " 1'23" abc " + abc; // この場合後ろの abc のみヒット
　}
}


実行結果は、次のようになります。

X: line 2, pos 10
O: line 6, pos 15
X: line 9, pos 33
O: line 10, pos 5
X: line 11, pos 32
X: line 15, pos 21
O: line 15, pos 29
X: line 15, pos 52

moon_piyo · Answer

こんちは


#!perl

use strict;
my $target = "abc";#検索ワード

my $file = "";
my $p1 = 0;
my $row = 0;
my @line = ('');
my @rc = ();

while (<DATA>) {
$row++;
#ファイル全体を1つの文字列に格納
$file .= $_;
foreach my $col (1..length($_)) {
#文字列の出現位置から[行、列]を出すテーブル作成
$rc[$p1+$col-1] = [$row, $col];
}
$p1 += length($_);
chomp;
#各行を配列に格納
push(@line, $_);
}

my $p2 = 0;#今回検索開始位置
while ($file =~ m%/\*(?:.*?)\*/|//.*?$|"(?:[^"\]|\.)*"|\z%smg) {
#除外部分(/*..*/ or //...(行末) or "..." or 終端)を探す
#検索開始位置から、マッチ部分(除外部分)の直前までの文字列に着目する
#検索ワードがみつかったら、元のファイルでの行、列に換算して表示する
my $str = substr($`, $p2);
while ($str =~ /$target/og) {
my ($row, $col) = @{$rc[$p2 + length($`)]};
print "$row行目$col文字目: $line[$row]
";
}
$p2 = pos($file);
}

__DATA__
01 /* 
02 * ここの abc はブロックコメント内なので無視する
03 * 
04 */
05 public class Foo() {
06 　private int abc = 0;
07 
08 　public Foo() {
09 　　// 行コメント内なのでここの abc を無視
10 　　abc = 1;
11 　　String s = "変数文字列内の abc これも無視";
12 　}
13 
14 　public String get() {
15 　　return " 1'23" abc " + abc; // この場合後ろの abc のみヒット
16 　}
17 }

sakusaker7 · Answer

あんまりひねたデータでいじめてないので、多分抜けはあると思いますがこんな感じでどうでしょうか。 #とりあえず文字列の中に /* とか */ が登場するとおかしくなると思います。 #!/usr/bin/perl use strict; use warnings; my $searchword = $ARGV[1] ? $ARGV[1] : 'abc'; my $comment_start = qrx; my $comment_end = qr<\*/>x; my $line_comment = qrx; my $string = qr<"[^\"]* (?: \" [^"]* )*">x; my $incomment; while (my $line = ) { my $start_pos = 0; chomp $line; #一行コメントを削除 $line =~ s/$line_comment//; #複数行コメントの中かどうか判定 if ($line =~ m/$comment_start/) { $start_pos = $-[0]; $incomment = 1; } if ($line =~ m/$comment_end/) { $incomment = 0; #複数行コメントの後ろに実データがあるときのために #コメント部分だけスペースで置き換える my $replace_length = $+[0]; substr $line, 0, $+[0], " " x $replace_length; } if ($incomment==1) { next if $start_pos == 0; my $replace_length = length($line) - $start_pos + 1; substr $line, $start_pos, $replace_length, (" " x $replace_length); } while (my $word = ($line =~ m{\G [^"]* (?:$string)? [^"]*? ($searchword)}gx)) { print "'$searchword'が", "$.行目の" , $-[1]+1, "文字目にあります : ", $line, " "; } } __END__ /* * ここの abc はブロックコメント内なので無視する * */ public class Foo() { private int abc = 0; abc = 3;/* */abc=1; public Foo() { // 行コメント内なのでここの abc を無視 abc = 1; String s = "変数文字列内の abc これも無視"; } public String get() { return " 1'23" abc " + abc + abc; // この場合後ろの abc のみヒット } } 実行結果: 'abc'が6行目の16文字目にあります : private int abc = 0; 'abc'が7行目の4文字目にあります : abc = 3; 'abc'が8行目の6文字目にあります : abc=1; 'abc'が12行目の8文字目にあります : abc = 1; 'abc'が17行目の32文字目にあります : return " 1'23" abc " + abc + abc; 'abc'が17行目の38文字目にあります : return " 1'23" abc " + abc + abc; 何行目にあるかは特殊変数 $. の値を何文字目なのかは特殊配列変数 @- の値を使っています。これらの変数の詳しい説明は perldoc perlvar でマニュアルを参照してください。なお、'あ' のような文字は一文字としては数えません。使用するエンコーディングにより、2または3になります。

sakusaker7 · Answer

いやあもったいぶるほどネタ持ってませんから。
とりあえず元のデータを別に保存しとくとかは
すぐに思いつきますけど、
どうにかできないもんかなあと頭をひねってるところです。

*/ の後の件は見落としていました。

dontoittem · Answer

そうだな、#1～#4 より #5 のほうが効率がいいだろな。*/ のあとに何か書いてると、行番号が消えるのが気になるが。コメントや文字列の情報も検索時に使いたいかも知れないから、一番いいのは元の情報を保存して検索できることだろね。#5 がもったいぶらずに回答してやればいいのではないかな。

#!/usr/bin/perl

# 全部読み込む
undef $/;
$_ = <DATA>;

# 改行を保存してコメントを除く
s%(//.*?$|/\*.*?\*/)%"
" x $1=~ tr/
//%egms;

# 文字列を除く
s/"[^"]*(?:\"[^"]*)*"//g;

my $pat = qr/\babc\b/;
my @lines = split "
";

# キーワード検索
for ($i = 0; $i < @lines; ++$i) {
  my @ids = $lines[$i] =~ m/$pat/g;
  print $i + 1 , ": @ids
" if (@ids);
}

__END__
/*
* ここの abc はブロックコメント内なので無視する
* // 行番号情報が消える
*/ abc = cde;
public class Foo() {
　private int abc = 0;

　public Foo() {
　　// 行コメント内なのでここの abc を無視
　　abc = 1;
　　String s = "変数文字列内の abc これも無視";
　}

　public String get() {
　　return " 123" abc " + abc + abc; // この場合後ろの abc のみヒット
　}
}

sakusaker7 · Answer

指定の単語(パターン?)を検索したいとのことですが、
検索結果の出力は#1～#4までで提示された形式で
いいのでしょうか?
なんとなく気になったので質問します。

#!/usr/bin/perl
use strict;
use warnings;

my $searchword = $ARGV[1] ? $ARGV[1] : 'abc';

my $i;
my $contents = join '', map {++$i . ":$_"} <DATA>;

$contents =~ s{(// .*? $)|/\* .*? \*/}{$1 ne '' ? "
" : ' '}xmseg;
$contents =~ s/"[^\"]* (?: \" [^"]* )*"//gx;

foreach my $line (split "
",  $contents) {
    my ($ln) = $line =~ m/^(\d+)/;
    my @ids = $line =~ m/\b$searchword\b/g;
    print "$ln: @ids
" if (@ids);
}

__END__
/*
* ここの abc はブロックコメント内なので無視する
*
*/
public class Foo() {
    private int abc = 0;

    public Foo() {
        // 行コメント内なのでここの abc を無視
            abc = 1;
        String s = "変数文字列内の abc これも無視";
    }
    
    public String get() {
        return " 1'23" abc " + abc + abc; // この場合後ろの abc のみヒット
        }
}

文字列の消去のところだけ直すつもりだったのに
まるきり変えてしまった…

mikaemi · Answer

あぁ、abc を探すんでしたね。
my $pat = qr/\babc\b/;          # 識別子 abc をサーチ
と変えてやると、一応、探します^^

＝＝＝
$ ./cprog.pl
6: abc
10: abc
15: abc

mikaemi · Answer

せっかく、\G を使っているのだから、

# コメントを除いてしまう
while (m%(//|/\*)%g) {
  my $p = pos;
  s%//\G.*
%
% if $1 eq "//";
  s%/\*\G.*?\*/% %s if $1 eq "/*";
  pos = $p - 1; # 2 文字戻してコメント消去を、1 文字に置き換えるので
}

したほうが効率的でしたね。位置を 0 に戻すなら、/g を使う必要はなかったです^^;

mikaemi · Answer

あっ、失礼しました。先ほどの実行結果は、cprog.pl というファイル名に入れていると仮定してです^^

＝＝＝ cprog.pl
#!/usr/bin/perl

# 行番号を書き込んでおき(行番号の情報が必要なければいらない)、
# $file にすべて読み込んでしまう
my $file;
$file .= $. . ":" . $_ while <DATA>;
$_ = $file;

# コメントを除いてしまう
while (m%(//|/\*)%g) {
  s%//\G.*
%
% if $1 eq "//";
  s%/\*\G.*?\*/% %s if $1 eq "/*";
  pos($_) = 0;# 必要ないかな？
}

# 文字列を除いてしまう
s/"(?:\"|[^"])*"//g;

my $pat = qr/(p[\w\d]+)/;# p で始まる識別子をサーチ
my @lines = split "
", $_;

# キーワード検索
for (@lines) {
  my ($ln) = m/^(\d+)/;
  my @ids = m/$pat/g;
  print "$ln: @ids
" if (@ids);
}

__END__
/*
* ここの abc はブロックコメント内なので無視する
*
*/
public class Foo() {
　private int abc = 0;

　public Foo() {
　　// 行コメント内なのでここの abc を無視
　　abc = 1;
　　String s = "変数文字列内の abc これも無視";
　}

　public String get() {
　　return " 1'23" abc " + abc; // この場合後ろの abc のみヒット
　}
}
＝＝＝＝
$ ./cprog.pl
5: public
6: private
8: public
14: public

正規表現検索

ファイルの内容をひとつの文字列に丸呑みするのは

#9 の捕捉に書いてある正規表現は私の手に負えないので、質問者の望むもととは

こんちは

この回答への補足

あんまりひねたデータでいじめてないので、多分抜けはあると思いますが

いやあもったいぶるほどネタ持ってませんから。

この回答への補足

そうだな、#1～#4 より #5 のほうが効率がいいだろな。

この回答への補足

指定の単語(パターン?)を検索したいとのことですが、

あぁ、abc を探すんでしたね。

せっかく、\G を使っているのだから、

この回答への補足

あっ、失礼しました。

関連するカテゴリからQ&Aを探す

デイリーランキングこのカテゴリの人気デイリーQ&Aランキング

マンスリーランキングこのカテゴリの人気マンスリーQ&Aランキング