VulgarDetector – application to detect vulgar language in text

Foreword

Automatically recognition and flagged as spam comments with vulgar language – it’s possible? How implement application to take care your WordPress website and protect from vulgar comments?

Why this issue?

  • no similar solutions
  • get knowledge of develop WordPress plugin
  • get knowledge of microservices
  • get knowledge of use memcache
  • good introduction to artificial intelligence

The project consists of three parts

  1. Backend – REST API application build on the shoulders of Symfony 3 microframework.
  2. Frontend – simple static page (HTML, CSS, JS, JQuery) presents functionality of application
  3. WordPress plugin – checks comment based on backend application

How application recognize vulgar text

Checking the text is simple and is based on a dictionary of vulgar words, the whole process can be divided into five steps:

  1. Tokenization – process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens
  2. Lowercase tokens – convert uppercase to lowercase
  3. Remove common stopwords – stopword is a commonly used word
  4. Remove duplicates
  5. Search tokens in database

Presentation solutions

  1. BACKEND
    Repository:
    https://github.com/tarnawski/vulgar-detector-api
    Staging:
    http://vulgardetector-api.ttarnawski.usermd.net/status
  2. FRONTEND
    Repository:
    https://github.com/tarnawski/vulgar-detector
    Staging:
    http://vulgardetector.ttarnawski.usermd.net/
  3. WORDPRESS PLUGIN
    Repository:
    https://github.com/tarnawski/vulgar-detector-plugin
    Wordpress Plugin Directory
    https://wordpress.org/plugins/vulgar-detector/

Recognition negative comments based on artificial intelligence

Foreword

Artificial intelligence is not just for C/C++. With PHP, you can implement neural networks in your Web applications. To self learning and get experience in artificial intelligence I try project application to recognition negative comments based on artificial intelligence in particular Naive Bayes algorithm. I’m using the wikipedia entry http://en.wikipedia.org/wiki/Bayesian_spam_filtering to develop my classification code. Training data derived from: http://help.sentiment140.com/for-students/

What is it?

grimedetector is a text classification application with a focus on reuse, customizability and performance. Particularly useful in detecting negative (or positive) comments or just texts. Application based on a Naive Bayes statistical classifier.

Introduction to the Bayes Theorem

Naive Bayes classifier is one of the methods of machine learning, used to solve the problem of sorting decision classes. The task Bayes classifier to assign a new case to one of the classes, with their collection must be finite and defined a priori.

Mathematical foundation

Implemented code calculate probability that text is negative given that it contains a specific word by implementing the following formula:

  • Pr(S|W) is the probability that a comment is negative, knowing that the word “replica” is in it;
  • Pr(S) is the overall probability that any given comment is negative;
  • Pr(W|S) is the probability that the word “replica” appears in negative comment;
  • Pr(H) is the overall probability that any given comment is positive;
  • Pr(W|H) is the probability that the word “replica” appears in positive comments.

Implemented code to combine the probabilities of all the unique words in a test comment to determine negative text based on the following formula:

The result p is typically compared to a given threshold to decide whether the comment is negative or not. If p is lower than the threshold, the comment is considered as likely positive, otherwise it is considered as likely negative.

Implementation Naive Bayes Classifier in PHP

class NaiveBayesClassifier
{
    /** @var WordRepository $wordRepository */
    private $wordRepository;

    public function __construct(WordRepository $wordRepository)
    {
        $this->wordRepository = $wordRepository;
    }

    public function classify($words)
    {
        $probabilityProducts = 1;
        $probabilitySums = 1;
        foreach ($words as $word) {
            $probability = $this->wordProbability($word);
            $probabilityProducts *= $probability;
            $probabilitySums *= (1 - $probability);
        }
        $grimeProbability = $probabilityProducts / ($probabilityProducts + $probabilitySums);
        return round($grimeProbability, 2);
    }

    public function wordProbability($word)
    {
        $ps = $this->probabilityContentIsGrime();
        $ph = $this->probabilityContentIsHam();
        $pws = $this->probabilityWordInGrime($word);
        $pwh = $this->probabilityWordInHam($word);
        $psw = ($pws * $ps) / ($pws * $ps + $pwh * $ph);
        $psw = $psw == 1 ? 0.99 : $psw;
        $psw = $psw == 0 ? 0.01 : $psw;
        return $psw;
    }

    public function probabilityContentIsGrime()
    {
        return $this->wordRepository->getGrimeCount() / $this->wordRepository->getWordsCount();
    }

    public function probabilityContentIsHam()
    {
        return $this->wordRepository->getHamCount() / $this->wordRepository->getWordsCount();
    }

    public function probabilityWordInGrime($word)
    {
        /** @var Word $word */
        $word = $this->wordRepository->getWordByName($word);
        if (!$word) {
            return 0.5;
        }
        return $word->getGrimeCount() / $this->wordRepository->getGrimeCount();
    }

    public function probabilityWordInHam($word)
    {
        /** @var Word $word */
        $word = $this->wordRepository->getWordByName($word);
        if (!$word) {
            return 0.5;
        }
        return $word->getHamCount() / $this->wordRepository->getHamCount();
    }
}

Source from: https://github.com/tarnawski/grime-detector-api