Extenuating Circumstances is a weblog by Dan Hon

  • Blogroll

  • Posted
    6 June 2007 @ 4pm

    Tagged
    8bar, ai, filter, filtering, ianhughes, ibm, language, linguistics, naturallanguage, rooreynolds

    A Smarter Naughty Words Filter

    In my last few days at Mind Candy I’ve taken part in a meeting with the inestimable Roo and Ian of IBM’s eightbar (Steff, upon learning of the job titles as Metaverse Evangelists, instantly declared a competition to naturally work in “the greatest swordfighter in the metaverse” into casual conversation).

    For whatever reason, we got talking about the seemingly easy problem (at least, when presented by management) of putting together a “naughty words filter” or, in Ian’s case, a “death threat filter”. Which, when you think about it, normally goes something like this:

    Requirements:

    • A naughty words filter

    Spec:

    • A filter that will filter out a list of naughty words

    Which, as many people know, is rather hard to do properly, and very easy to do badly, or such that it’s a waste of time. In Ian’s case, he told us of a rather amusing incident where, if you’re filtering/looking for death threats, an international (let’s say German) audience may not be particularly helpful when one of the indefinite articles is die.

    Anyway.

    We were talking about various strategies for maintaining an up-to-date list of naughty words when I struck upon the idea of having Urban Dictionary simply publish a list of all of their words, and blacklisting against that. Or, at least, having a human at Urban Dictionary quickly can through for the slightly more offensive words, and publish those as a blacklist, charging for the list, if they felt like it.

    Well, I thought it was a good idea…

    Update: it’s eightbar, not 8bar.


    5 Comments

    Posted by
    Brian Enigma
    6 June 2007 @ 9pm

    Urban Dictionary has a surprisingly easy API (http://wiki.urbandictionary.com/index.php/Method_documentation) and multiple calls to “nearby” to traverse the list of words would be simple enough to do. I’d worry that a lot of words in there are not “bad” per se. For instance, I believe that UF forum user and PXC player MasterCheese has his own definition, which would make automatic redacting of the term a little confusing.

    When I have had to do stuff like this before (wayyyy back in the BBS days), I found that a good initial wordlist seed plus a moderator that can add to the wordlist and talk-to/ban bad users tends to be the best solution in most cases. Kids can get pretty creative when it comes to bypassing word filters. From simple letter substitutions (”ph” in place of “f” in the F-word) to high-ASCII (”Ÿ” in place of “f”) to inserting invisible color codes between letters, they always seem to find a way. An automated filter is simply a challenge, whereas a live human being helps reinforce what is good verses bad.


    Posted by
    danhon
    6 June 2007 @ 10pm

    Yeah, it was pointed out to me later in the day that not all of Urban Dictionary is “bad” words, so it’s perhaps not as useful as I initially thought it would be.

    I completely agree on the moderation point - people are incredibly inventive, and automation is always seen as a challenge to be surmounted, rather than a human who can actually enforce and explain why moderation rules are required.


    Posted by
    matlock
    6 June 2007 @ 11pm

    I had to do this once for a public art project back in 2000 that involved people SMS-ing a 15m LED screen on the outside of a building. I found a dictionary of rude words on the web that included foreign languages (important, as the project was in Huddersfield, so we needed Urdu and Hindi as well as English).

    I then came in every morning to see what the local kids had done to evade the content filters. Letter substition was first - $hit and fu(k, etc - and then more creative uses of language. It was like a daily battle between me and the local hoodies, until finally, one day I came into work to see the huge LED screen at the front of our building repeating the message:

    BOOTHY IS A LADYBOY

    At that point, i admitted defeat!


    Posted by
    Guy P
    7 June 2007 @ 7pm

    Yeah, blacklisting is really hard, in that you can easily wipe out 99% of the Naughty Possibility Space (which will now be the name of my next album) but as soon as people work out where the 1% is, they’ll all go over there and start effyouseakaying again.

    It’s good for two things:

    * avoiding *some* drudge work
    * looks like you at least *tried* to do something about it

    …but, yes, overall quite useless without an Actual Human to keep people in line. And I have to say I think that’s much more honorable a tactic than the attempts at whitelisting I’ve seen, which are frankly utterly depressing.


    Posted by
    Roo
    7 June 2007 @ 9pm

    Hi Dan

    (By the way, It’s “eightbar” rather than “8bar”. My fault for wearing a t-shirt displaying the shorthand version though.)

    I’m so excited to hear that Urban Dictionary has an API. Perhaps now I can being to automagically parse sentences only previously accessible to da yoof.

    I do think a feed of rude words would be funny though. But then, I think ‘chutney ferret’ is funny.


    Leave a Comment

    links for 2007-06-06 More Google UI improvements