ディレクトリ内のtxtファイル中の英単語数等をカウントしたいのですがわ

締切済

質問者：GMCoufs
質問日時：2010/08/09 18:05
回答数：3件

ディレクトリ内のtxtファイル中の英単語数等をカウントしたいのですがわかりません。

PERLを使って、テキストファイル中の段落数、文章数、単語数をカウントしたいと思っていて、splitをつかって頑張っていますがわかりません。
テキストファイル中では、
I stand here today humbled by the task before us, grateful for the trust you've bestowed, mindful of the sacrifices borne by our ancestors.
I thank President Bush for his service to our nation as well as the generosity and cooperation he has shown throughout this transition.
Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often, the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because we, the people, have remained faithful to the ideals of our forebears and true to our founding documents.

という具合に、
段落は改行で、文章の区切りは. で（.と半角スペース２つ）、単語はスペース1つでそれぞれ区切ってあります。

while($data = <>){
chomp($data);
@paragraph = split(/\n/, $data);
}
上記のように、ファイル全体を、改行を区切りに要素に分解することはできています。
が、これをどうやってカウントしていくのかがまったくわかりません。
ご教授願います。

通報する

この質問への回答は締め切られました。

質問の本文を隠す

回答 (3件)

最新から表示
回答順に表示

No.3

回答者： toraneko75
回答日時：2010/08/17 17:29

基本的に改行を数えて段落を、ピリオドを数えて文章を、スペース数えて単語を数えたらいいのだと思いますが、

段落は＋１して調整して、
単語は「文章の終わりはスペース２つ」「改行があるところはスペースなし」なので調整してあげたらよいかと思います。最後はピリオドで改行なしなら合いそうですが単語の数が一個くらいずれるかもしれないです。

#!/usr/bin/perl
use strict;
use warnings;

my $sentence =
"I stand here today humbled by the task before us, grateful for the trust you've bestowed, mindful of the sacrifices borne by our ancestors.
I thank President Bush for his service to our nation as well as the generosity and cooperation he has shown throughout this transition.
Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often, the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because we, the people, have remained faithful to the ideals of our forebears and true to our founding documents.";

my ( $sent, $word, $para );
my @letters = split //, $sentence;
foreach (@letters) {
if (m/\n/) { $para++ }
if (m/\./) { $sent++ }
if (m/ /) { $word++ }
}
$word = $word + $para * 2 - $sent + 2;
$para++;
print "$para,$sent,$word";

- 0
- 件

通報する

No.2

回答者： kmee
回答日時：2010/08/10 03:06

うつし間違いでなければ

> @word=split(/ /,$sentence);
$sentenceについての記述がどこにもないので、$sentenceは空のはずです。

@word=split(/ /,$in1);
$tango = $tango + $#word +1;
ではどうですか?

- 0
- 件

通報する

No.1

回答者： kmee
回答日時：2010/08/09 18:54

ん?

$data = <>
は1行しか読みこまないし、それを
chomp($data);
として改行文字を削除しているので
@paragraph = split(/\n/, $data);
としても改行文字がないので分割されないと思うのですが。
$/を変えて全部読み込むようにしているのだとすると、こんどはwhileが意味をなさないです。

また、@paragraphをその都度上書きしていて、最後の行しか残らないです。

まずは、段落とかを考えずに「テキストファイルの行数を数えるには」を考えましょう
$lines =0;
while($data=<>){
$lines ++ ;
}
これで行数が数えられるのは、理解できますか?

では、取り込んだ$dataから計算された値の合計、となると
$nanka = 0;
while($data=<>){
$nanka += &KEISAN($data) ;
}
これも理解できますか?

では、問題の「文章」はどんなKEISANでできるでしょう?「単語」は?

この回答への補足

$lines =0;
$bun=0;
$tango=0;
while($data=<>){
$lines ++ ;
@sentence=split(/. /,$data);
foreach $in1(@sentence){
@word=split(/ /,$sentence);
$bun++;
}
}

これで行数と、文章数がカウントできるようになりました。
それぞれの文章をスペース1つを区切りに分解する、という作業を繰り返すごとに
文章に1を足すという方法です。
しかしこれでは単語数をカウントできません。
$bun++;
$tango = $tango + $#word +1;
でイケるかなと思ったのですが違いました。