parsing user-entered content for rude words

am 16.04.2005 16:43:21 von Greg

Hi,

I would like to make life easier for myself my automating (as best as
possible) the removal of messages supplied by users.

If my incoming string is $input, I originally thought of searching as
follows:

foreach my $rudeword in @RudeWordList {
if ($input =~ s/$rudeword/i) {
REJECT;
}
}

However, this seems a rather unoptimised method of searching. Is there a
more optimised way of doing this?

Cheers,

Greg

Re: parsing user-entered content for rude words

am 17.04.2005 03:28:34 von Joe Smith

Greg wrote:

> I would like to make life easier for myself my automating (as best as
> possible) the removal of messages supplied by users.
> if ($input =~ s/$rudeword/i) {

The above code would improperly reject these:
Scunthorpe Hospital Radio (www.shronline.co.uk)
Sussex and Essex in England
going off half-cocked
Matsushita is the parent corporation of Panasonic
farther and farther

Failing to take context into account would reject these:
breast cancer survivor
Tom, Dick, and Harry
cute little pussy cat
prize-winning bitch and her puppies

Re: parsing user-entered content for rude words

am 17.04.2005 10:32:58 von Greg

Joe Smith wrote:
> Greg wrote:
>
>> I would like to make life easier for myself my automating (as best as
>> possible) the removal of messages supplied by users.
>> if ($input =~ s/$rudeword/i) {
>
>
> The above code would improperly reject these:

Hi, In order to reply, I've re-ordered your (very good) examples:

> Scunthorpe Hospital Radio (www.shronline.co.uk)
> Matsushita is the parent corporation of Panasonic

Word boundaries - very good point!

> Sussex and Essex in England
> going off half-cocked
> breast cancer survivor
> farther and farther
> Tom, Dick, and Harry

"Sex, cocked, breast, fart and dick" - these were not the type of words
I was planning to look for. I wouldn't call these words particularly
rude :P They are certainly "acceptable" for where this code will be deployed

> Failing to take context into account would reject these:
> cute little pussy cat
> prize-winning bitch and her puppies

VERY good point. Perhaps a better solution would be:

foreach my $rudeword in @RudeWordList {
if ($input =~ /\b$rudeword\b/i) {
PLACE MESSAGE ON ICE;
FLAG MESSAGE "Awaiting acceptance from moderator";
}
}

Thanks Joe!

Re: parsing user-entered content for rude words

am 05.05.2005 10:40:54 von Tim X

Greg writes:

> Hi,
>
> I would like to make life easier for myself my automating (as best as
> possible) the removal of messages supplied by users.
>
> If my incoming string is $input, I originally thought of searching as
> follows:
>
>
> foreach my $rudeword in @RudeWordList {
> if ($input =~ s/$rudeword/i) {
> REJECT;
> }
> }
>
>
>
> However, this seems a rather unoptimised method of searching. Is there a
> more optimised way of doing this?
>
> Cheers,
>
>
Hi Greg,

Your right, its not a very good search approach. the problem is, you
willl be comparing every word to every rudeword in the list of
rudewords. So, if you had 1000 rude words, you would do 1000
comparisons for each input.

something which may help might be to use a hash instead of a list for
your rude words. Use the rude word as the key and just put a 1 in for
the value. This would allow you to do a single comparison for each
word, rather than multiple comparisons with the whole list. If
performance is still not good enough, you could then look at other
optimizations - for example, you may be able to skip any input word
with less than 4 characters as there are not many rude words within
that set. You could also eliminate any words with more characters than
your longest 'rude' word.

Tim

--
Tim Cross
The e-mail address on this message is FALSE (obviously!). My real e-mail is
to a company in Australia called rapttech and my login is tcross - if you
really need to send mail, you should be able to work it out!

Re: parsing user-entered content for rude words

am 05.05.2005 11:58:25 von Tim X

Joe Smith writes:

> Greg wrote:
>
>> I would like to make life easier for myself my automating (as best as
>> possible) the removal of messages supplied by users.
>> if ($input =~ s/$rudeword/i) {
>
> The above code would improperly reject these:
> Scunthorpe Hospital Radio (www.shronline.co.uk)
> Sussex and Essex in England
> going off half-cocked
> Matsushita is the parent corporation of Panasonic
> farther and farther
>
> Failing to take context into account would reject these:
> breast cancer survivor
> Tom, Dick, and Harry
> cute little pussy cat
> prize-winning bitch and her puppies

This is a common problem with any filtering approach. In Australia,
the government passed legislation which required ISPs to block access
to sites considered offensive (whatever that is). All the technical
people, professors of computing science, programmers etc, tried to
explain the problems. The response from the senator pushing this
through was to accuse these people of being pornographers.

The problem comes down to more than context, it comes down to an area
of computing science called natural language processing (NLP). the aim
here is to try and write software which can understand natural
language. THis is a very difficult problem, particularly in languages
such as english, because the rules have so many exceptions and are
difficult to specify in a concise way. However, unless the computer
can 'understand' what is being expressed, there is no 100% guaranteed
to work solution. You can reduce the number of false positives by
extending the match criteria to look for more information - for
example, if the band word was 'breast' (which isn't really a rude word
- unless your one of those prudes who finds penis and vagina rude),
you could also look for the word 'cancer' within x number of words and
you would be less likely to flag it as indicating 'rude' content
i.e. getting a false positive. However, this is still what the AI
world refers to as a heuristic rule - or more commonly known as a
'rule of thumb'.

Note that if you are trying to eliminate spam with adds for porn sites
etc, you would be better off looking at some of the quite good
anti-spam algorithms. There are a number of interesting approaches
currently being developed. One approach is to use a collection of
weighted rules and apply some basic statistical calculations which
give you a probability score for the likelyhood of the message being
spam. Some of these systems use a training process to adjust the
weights of each rule. Another very intesting network approach to spam
detection is the use of a centralised database and a server. Users
send copies of spam to this server, which does some md5 checksums on
the message and puts this info in a database. You then have a client
which calculates md5 sums on the incomming messge and then your client
queries the remote anti-spam database to see if it knows of any other
messages with the same md5 checksum. The theory here is that most spam
is sent to a large number of users and has the same message body
contents. As md5 checksums are based on the actual content of the
message, the odds are very high that if you have the same md5
checksum, you have the same message and therefore you can be fairly
confident it is spam. Of course, if this begins to work, the spammers
will just begin to add random characters or blank spaces to each
message, which will change the md5 checksum for that message. .

Tim
--
Tim Cross
The e-mail address on this message is FALSE (obviously!). My real e-mail is
to a company in Australia called rapttech and my login is tcross - if you
really need to send mail, you should be able to work it out!