Author: Franz Malten Buemann

  • Hands-on for Toxicity Classification and minimization of unintended bias for Comments using…

    Hands-on for Toxicity Classification and minimization of unintended bias for Comments using classical machine learning models

    In this blog, I will try to explain a Toxicity polarity problem solution implementation for text i.e. basically a text-based binary classification machine learning problem for which I will try to implement some classical machine learning and deep learning models.

    For this activity, I am trying to implement a problem from Kaggle competition: “Jigsaw Unintended Bias in Toxicity Classification”.

    In this problem along with toxicity classification, we have to minimize the unintended bias (which I will explain briefly in the initial section).

    source

    Business problem and Background:

    Background:

    This problem was posted by the Conversation AI team (Research Institution) in Kaggle competition.

    This problem’s main focus is to identify the toxicity in an online conversation where toxicity is defined as anything rude, disrespectful, or otherwise likely to make someone leave a discussion.

    Conversation AI team first built toxicity models, they found that the models incorrectly learned to associate the names of frequently attacked identities with toxicity. Models predicted a high likelihood of toxicity for comments containing those identities (e.g. “gay”).

    Unintended Bias

    The models are highly trained with some keywords which are frequently appearing in toxic comments such that if any of the keywords are used in a comment’s context which is actually not a toxic comment but because of the model’s bias towards the keywords it will predict it as a toxic comment.

    For example: “I am a gay woman”

    Problem Statement

    Building toxicity models that operate fairly across a diverse range of conversations. Even when those comments were not actually toxic. The main intention of this problem is to detect unintentional bias in model results.

    Constraints

    There is no such constraint for latency mentioned in this competition.

    Evaluation Metrics

    To measure the unintended bias evaluation metric is ROC-AUC but with three specific subsets of the test set for each identity. You can get more details about these metrics in Conversation AI’s recent paper.

    Trending Bot Articles:

    1. Case Study: Building Appointment Booking Chatbot

    2. IBM Watson Assistant provides better intent classification than other commercial products according to published study

    3. Testing Conversational AI

    4. How intelligent and automated conversational systems are driving B2C revenue and growth.

    Overall AUC

    This is the ROC-AUC for the full evaluation set.

    Bias AUCs

    Here we will divide the test data based on identity subgroups and then calculate the ROC-AUC for each subgroup individually. When we select one subgroup we parallelly calculate ROC-AUC for the rest of the data which we call background data.

    Subgroup AUC

    Here we calculate the ROC-AUC for a selected subgroup individually in test data. A low value in this metric means the model does a poor job of distinguishing between toxic and non-toxic comments that mention the identity.

    BPSN (Background Positive, Subgroup Negative) AUC

    Here first we select two groups from the test set, background toxic data points, and subgroup non-toxic data points. Then we will take a union of all the data and calculate ROC-AUC. A low value in this metric means the model does a poor job of distinguishing between toxic and non-toxic comments that mention the identity.

    BNSP (Background Negative, Subgroup Positive) AUC

    Here first we select two groups from the test set, background non-toxic data points, and subgroup toxic data points. Then we will take a union of all the data and calculate ROC-AUC. A low value in this metric means the model does a poor job of distinguishing between toxic and non-toxic comments that mention the identity.

    Generalized Mean of Bias AUCs

    To combine the per-identity Bias AUCs into one overall measure, we calculate their generalized mean as defined below:

    source: Jigshaw competition evaluation metric

    Final Metric

    We combine the overall AUC with the generalized mean of the Bias AUCs to calculate the final model score:

    source: Jigshaw competition evaluation metric

    Exploratory Data Analysis

    Overview of the data

    “Jigsaw provided a good amount of training data to identify the toxic comments without any unintentional bias. Training data consists of 1.8 million rows and 45 features in it.”

    comment_text” column contains the text for individual comments.

    target” column indicates the overall toxicity threshold for the comment, trained models should predict this column for test data where target>=0.5 will be considered as a positive class (toxic).

    A subset of comments has been labeled with a variety of identity attributes, representing the identities that are mentioned in the comment. Some of the columns corresponding to identity attributes are listed below.

    • male
    • female
    • transgender
    • other_gender
    • heterosexual
    • homosexual_gay_or_lesbian
    • Christian
    • Jewish
    • Muslim
    • Hindu
    • Buddhist
    • atheist
    • black
    • white
    • intellectual_or_learning_disability

    Let’s see the distribution of the target column

    First I am using the following snippet of code to convert target column thresholds to binary labels.

    Now, let’s plot the distribution

    Distribution plot for Toxic and non Toxic comments

    We can observe from the plot around 8% of data is toxic and 92% of data is non-toxic.

    Now, let’s check unique and repeated comments in “comment_text” column also check of duplicate rows in the dataset

    From the above snippet, we got there are 1780823 comments that are unique and 10180 comments are reaping more than once.

    So we can see in the above snippet there are no duplicate rows in the entire dataset.

    Let’s Plot box Plot for Non-Toxic and Toxic comments based on their lengths

    Box plot for Toxic and non Toxic comments

    From the above plot, we can observe that most of the points for toxic and non-toxic class comments lengths distributions are overlapping but there are few points in the toxic comments where comment length is more than a thousand words and in the non-toxic comment, we have some point’s are more than 1800 words.

    Let’s quickly check the percentile for comment lengths

    From the above code snippet, we can see the 90th percentile for comment lengths of toxic labels is 652 and the 100th percentile comment length is 1000 words.

    From the above code snippet, we can see the 90th percentile for comment lengths of non-toxic labels is 755 and the 100th percentile comment length is 1956 words.

    We can easily print some of the comments for which comment has more than 1300 words.

    Let’s plot Word cloud for Toxic and Non-Toxic comments

    word cloud plot for Toxic comments

    In the above plot, we can see words like trump, stupid, ignorant, people, idiot used in toxic comments with high frequency.

    word cloud plot for Non-Toxic comments

    In the above plot, we can see there are no negative words with the high frequency used in non-toxic comments.

    Basic Feature Extraction

    Let us now construct a few features for our classical models:

    • total_length = length of comment
    • capitals = number of capitals in comment
    • caps_vs_length = (number of capitals in comment)
    • num_exclamation_marks = num exclamation marks
    • num_question_marks = num question marks
    • num_punctuation = Number of num punctuation
    • num_symbols = Number of symbols (@, #, $, %, ^, &, *, ~)
    • num_words = Total numer of words
    • num_unique_words = Number unique words
    • words_vs_unique = number of unique words/number of words
    • num_smilies = number of smilies
    • word_density = taking average of each word density within the comment
    • num_stopWords = number of Stopwords in comment
    • num_nonStopWords = number of non Stopwords in comment
    • num_nonStopWords_density = (num stop words)/(num stop words + num non stop words)
    • num_stopWords_density = (num non stop words)/(num stop words + num non stop words)

    To check the implementation of these features you can check my EDA notebook.

    After the implementation of these features let’s check the correlation of extracted feature with target and some of the other features form the dataset.

    So for correlation first I implemented the person correlation of extracted features with some of the features from the dataset and plotted the following the seaborn heatmap plot.

    Colrelation Plot
    Correlation plot between extracted features and some features in the dataset

    From the above plot, we can observe some good correlations between extracted features and existing features, for example, there is a high positive correlation between target and num_explation_marks, high negative correlation between num_stopWords and funny comments and we can see there are lots of features where there is no correlation between them, for example, num_stopWords with Wow comments and sad comments with num_smilies.

    Feature selection

    I applied some feature selection methods on extracted features like Filter Method, Wrapper Method, and Embedded Method by following this blog.

    Filter Method: In this method, we just filter and select only a subset of relevant features, and filtering is done using the Pearson correlation.

    Wrapper Method: In this method, we have to use one machine learning model and will evaluate features based on the performance of the selected features. This is an iterative process but better accurate than the filter method. It could be implemented in two ways.

    1. Backward Elimination: First we feed all the features to the model and evaluate its performance then we eliminate worst-performing features one by one till we get some good relevant performance. It uses pvalue as a performance metric.

    2. RFE (Recursive Feature Elimination): In this method, we recursively remove features and build the model on the remaining features. It uses an accuracy metric to rank the feature according to their importance.

    Embedded Method: It is one of the methods commonly used by regularization methods that penalize a feature-based n given coefficient threshold of the feature.

    I followed the implantation from this blog and used the Lasso regularization. If the feature is irrelevant, lasso penalizes its coefficient and make it 0. Hence the features with coefficient = 0 are removed and the rest is taken.

    To understand these methods and its implementation more briefly please check this blog.

    After implementing all the methods of feature selection I selected the results of the Embedded Method and will include selected features in our training.

    Results of embedded method

    Embedded method mark the following features as unimportant

    1. caps_vs_length

    2. words_vs_unique

    3. num_smilies

    4. num_nonStopWords_density

    5. num_stopWords_density

    I plotted some of the selected extracted features as the following plots:

    Violin plots 2 selected features

    Summary of EDA analysis:

    1. The number of Toxic comments is less than Non Toxic comments i.e. 8 percent of toxic 92 percent of non-toxic.

    2. We printed the percentile values of length of comments and observed the 90th percentile value is 652 for the Toxic and 90th percentile value is 755 for non-Toxic. We also checked there are 7 comments with length more than 1300 and all are non-toxic.

    3. We created some text features and plotted the correlation table with Target and Identities Features. We also plotted the correlation table between extracted features Vs extracted features, to check correlation among them.

    4. We applied some feature selection methods and used the results of Embedded Method where out of 16 we selected 11 as relevant features.

    5. We plotted the Villon and Density plot for some of the extracted features with target labels.

    6. We plotted Word cloud for Toxic and Non-Toxic comments and observed some words which are frequently used in Toxic comments.

    Now let’s do some basic pre-processing on our data

    Checking for NULL values

    Number of null values in features

    From the above snippet, we can observe there is a lot of identity numerical features with lots of null values and there are no null values in comment text feature. Identity numerical features values are thresholds between 0 to1 along with target columns.

    So, I converted the identity features along with the target column as Boolean features as mentioned in competition. Values greater than equal to 0.5 will be marked as True and other than that as False (null values now become False).

    In the below snippet I selected few Identity features for our model’s evaluation and converting them to Boolean features along with the target column.

    As the target column is binary now our data is ready for binary classification models.

    Now I split the data into Train and Test in 80:20 ratio:

    For text pre-processing, I am using the following function in which I am implementing

    1. Tokenization

    2. Lemmatization

    3. Stop words removal

    4. Applying Regex to remove all non-words token and keeping all special symbols that are commonly used in comments.

    Function to preprocess comment texts

    Vectorizing Text with Tf Idf Vectorize

    As models can’t understand text directly, we have to vectorize our data. So to vectorize our data I selected TfIdf.

    To understand TfIdf and its implementation in python briefly you can check this blog.

    So, first, we are applying TfIdf on our train data with maximum features as 10000.

    In TfIdf I got the scores for all tokens based on term frequency and inverse document frequency.

    source: TfIdf weightage for a token

    Higher the score more weightage of the token.

    In the following snippet, I collected the tokens along with the TfIdf score in tuples and stored all sorted tuples on the list.

    Now based on the scores I selected 150 as threshold and now I will collect all tokens from TfIdf with score greater than 150.

    In the following function, I am removing all tokens having a TfIdf score of less than 150.

    Selecting tokens with TfIdf score greater than 150

    Standardization of numerical features

    To normalize the extracted numerical features I am using sklearn’s StandardScaler which will standardize the numerical features.

    As I pre-processed the numerical extracted features and text feature lets now stack them using scipy’s hstack as follows:

    Now our data is ready so let’s start implementing some classical models on it.

    Classical models:

    In this section I will show you some Classical Machine Learning implementation, to check the implementation of all the classical models you can check my notebook.

    Logistic Regression:

    Logistic Regression one of the popular classification problems. There is a lot of applications that can be applied using a logistic regression classifier for binary classification problems like spam email filter detection, online transaction fraud or not, etc. Now I will apply logistic regression on the pre-processed stacked data.

    To know more about logistic regression you can check this blog.

    In the below snippet you can see I am using sklearn’s GridsearchCV to tune the hyperparameters, I choose value 4 for k cross-validation. Cross validations basically used to evaluate trained models with the different values of hyperparameters.

    I am using different values of alpha’s get the best score. For alpha value 0.0001 I am getting the best score on CV based on GridSearchCV results.

    Next, I trained a logistic regression model with the selected hyperparameter values on trained data, then I used the trained model to predict probabilities on test data.

    Now I passed the data frame containing Identities columns along with logistic regression probability scores to Evaluation metric function.

    You can see the complete implantation of the evaluation metric in this Kaggle notebook.

    LR Classifier scores for all 3Evaluation parameters

    Evaluation metric score on test data: 0.57022

    On preprocessing text and numerical features on competition’s train data I preprocessed the submission data as well. Now using the trained model I got probability scores for Submission data as well. Now I prepared submission.csv as mentioned in the competition.

    submission file preparation

    On submitting the submission.csv I got Kaggle score of 0.57187

    Kaggle submission score for LR classifier

    Now following the same procedure, I trained all classical models. Please check the following table for the scores of all classification models I implemented.

    Classical models Scores

    So, the XGB classifier gives the best Evaluation metric and test score among all Classical models.

    XGB Classifier scores for all 3 Evaluation parameters
    Kaggle submission Score for XGB classifier

    Please check part 2 of this project’s blog which explains the deep learning implementations and deployment function of this problem. You can check the implementation of EDA and classical machine learning models notebook on my Github repository.

    Future work

    Can apply deep learning architectures which can improve the model performance.

    References

    1. https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification
    2. https://www.kaggle.com/nz0722/simple-eda-text-preprocessing-jigsaw
    3. Paper provided by Conversation AI team explaining the problem and metric briefly https://arxiv.org/abs/1903.04561
    4. https://www.appliedaicourse.com/
    5. https://towardsdatascience.com/feature-selection-with-pandas-e3690ad8504b

    Don’t forget to give us your 👏 !


    Hands-on for Toxicity Classification and minimization of unintended bias for Comments using… was originally published in Chatbots Life on Medium, where people are continuing the conversation by highlighting and responding to this story.

  • Alexa Answers Crowdsourcing Arrives in the UK

    Amazon has extended its voice-based data publicly supporting Alexa Answers to the United Kingdom. The feature asks the overall population…

  • Amazon Alexa Skill Growth Has Slowed Further in 2020

    In July of 2019, information showed Alexa aptitude development was easing back in most worldwide business sectors. It wasn’t clear to…

  • Why does your Business need a WhatsApp Chatbot?

    With WhatsApp becoming a standard communication platform for personal and professional usage, businesses and new ventures are looking forward to adding this as a significant marketing and business operating platform. Here is a detailed analysis on “why does your business need a WhatsApp chatbot?

    submitted by /u/botpenguin1
    [link] [comments]

  • Users of KUKI wanted for a study

    Hello all! I am excited🤩 to carry out a study for my PhD on chatbots, and specifically Kuki (Mitsuku). I am interested in participants aged 18-30 (autistic users also welcome) who have been chatting with Kuki for 3 months or more. If you are willing to participate, please contact me here or at my email address: [ax23@kent.ac.uk](mailto:ax23@kent.ac.uk)

    submitted by /u/annaksig
    [link] [comments]

  • Alexa gains support for location-based reminders and routines

    Alexa clients can now tweak the orders fused into Alexa’s schedules. The new component permitting clients to work out increases the mix of…

  • Facebook Shares 100-Language Translation Model, First Without English Reliance

    Facebook revealed and made open-source a language interpretation model this week to move between any two of 100 dialects. The M2M-100 was…

  • Case Study | myTEXT

    App with chatbot as reading companion for delinquent teenagers

    Overview

    Time: September to December 2020

    Tasks: User research, experience strategy, information architecture, interaction design

    Tools: AdobeXD, Notion, Miro

    Team:

    • Martin — Product manager
    • Ramona — Product Manager
    • Christine — UX/UI Designer (LinkedIn)

    Objective

    How might we help teenagers finish their reading assignment with the least possible effort, while also igniting the joy of reading in them?

    Context

    In this project I worked for KonTEXT, a reading project for delinquent teenagers. Teens who are sentenced to reading a certain number of pages come to KonTEXT for supervision and guidance. They regularly meet with mentors who reflect with them on the book they read and on their lives. However, many teenagers struggle to finish their reading assignment. Therefore, KonTEXT wants to help the youths by giving them a reading companion on their phone in the shape of an Android app with a chatbot.

    User Interviews

    First, we needed to get to know our users to understand what was keeping them from finishing their reading assignments on time. So we conducted interviews with 6 people who had either recently finished their reading assignment or were still working on it.

    We wanted to find out about the problems they had with reading, as well as strategies they used to master the challenge and things that motivated them.

    That’s what the teenagers we interviewed said.

    Main takeaways from user interviews

    • 4 of 6 participants had no problems getting started with reading
    • 5 of 6 made some kind of plan for reading (by defining a time in their day where they would read or by setting a goal of a certain number of pages per day)
    • 3 of 6 said the motivation was not important, because they were forced to read and just had to accept it

    Trending Bot Articles:

    1. The Messenger Rules for European Facebook Pages Are Changing. Here’s What You Need to Know

    2. This Is Why Chatbot Business Are Dying

    3. Facebook acquires Kustomer: an end for chatbots businesses?

    4. The Five P’s of successful chatbots

    What helped these youths succeed was the fact that they actively organised their reading and that they accepted they had to do it. None of the youths we interviewed had major problems with reading, some of them even enjoyed reading in their free time before the assignment. We were surprised by that, because our team members who were also mentors for KonTEXT assured us that many of their mentees had grave problems.

    Mentor Interviews

    To understand if the teenagers we interviewed were just outliers, we wanted to talk to some youths who actually struggled with the program. But most of them were unwilling to speak to us, even when we offered a reward. The ones who agreed to meet with us didn’t show up for their appointments. So we decided to interview the mentors at KonTEXT instead.

    From the results we could see that the youths we interviewed had not been representative of our user group. Most teenagers actually disliked reading and were not at all motivated.

    Here’s how the teenagers we interviewed compare to the average.

    We wanted to find out how many of the mentees had problems with reading, what kinds of problems they had and which strategies and tools they used to succeed.

    Some quotes from mentors about the problems the teenagers face

    Main takeaways from mentor interviews

    • Motivation is often named as the biggest problem: Reading is perceived as a punishment for their crimes and so they reject it
    • Many teenagers lack the organisational skills to plan their day and their reading, which is exacerbated by stress
    • Since only 25% of teenagers had big problems with the assignment, we would take into account the other user group as well: People that don’t struggle a lot, but could use some extra motivation and organisation tips

    Personas

    To put a face to our research results, I created personas for the two user groups. I wanted us to keep in mind the needs of users struggling with the reading assignment and those who were (mostly) taking it in stride.

    Our two personas: Elias and Armin

    Main takeaways from personas

    • Motivation: Users need to see their progress and get encouragement to stay motivated.
    • Organization: Users need help with organizing their reading, so they can finish their assignment on time.
    • Strategies: Users need tips to approach reading more strategically, so they can read effectively.

    We understood that the app needed to give personalized tips to be helpful to both user groups. The level of support it offers needs to be adjustable. We decided to manage that in part with the chatbot, that would be able to give personalized advice.

    Navigation and main features

    We then utilized user stories, user flows and journey maps to define the features of the app.

    The main features of the app would be:

    • A chatbot that motivated users and gives tips
    • Reading tools that provided aid while reading
    • Statistics that help track the reading progress
    • A reading plan to stay organised
    • And an activities section that allowed users to dive deeper into the topics of their book
    Site map and key screens pf the app

    Reading plan

    The initial idea for the reading plan was that it would inform users if they had read the required amount of pages each day. But in the usability test we discovered that all of our testers had a very different mental model of the reading plan. To them it was supposed to be a calendar that lets you plan your reading. They saw a planning tool, instead of a progress tracking tool, like we envisioned.

    So we decided to change the reading plan to match our users’ mental model.

    Main features of the reading plan

    The new reading plan tells users when they have their next appointment with their mentor and how many pages they have yet to read until then. They can also set days in the calendar to be reading days by simply tapping them. The app automatically calculates how many pages they have to read per day to reach their goal.

    Key takeaways

    • Show users how many pages they have to read per day so they can divide their assignment into manageable portions
    • Let users set a reminder for all reading days so they don’t forget to read and then have to read all the pages the day (or night) before their appointment

    Chatbot

    In the beginning, the chatbot mainly had the job of talking with the teenagers about their life and about the book they were reading as a peer. That changed a lot after the usability tests, where we asked our users for feedback on the chatbot.

    Key takeaways from the usability tests

    • User don’t want to be asked about personal things by the chatbot
    • Users see the role of the chatbot as a mentor and guide, not as a friend

    We redefined the role of the chatbot to be a coach for the teenagers. He is older than our users, so he is someone the teenager can look up to. The chatbot can give personalized reading tips, encourage and motivate the users and guide them through the app. The bot can also talk about his own experiences, because he is based on a real person.

    Persona of our chatbot Maximilian

    We collaborated with a former criminal to create chatbot interactions based on his personality and experiences. The idea behind that is, that the teenagers can relate to these experiences and reflect on their own lives through them. In the future, the users will be able to choose between a handful of chatbots when they start using the app, and select the one they can best relate to.

    Visual design

    For the visual design of the app, we chose shades of blue to go with the strong brand color red. We made a style guide and component library and also started creating a design system to prepare for the coming development phase.

    A selection of UI elements

    Retrospective

    What worked well:

    • Involving the whole design team in research (e.g. as note takers)
    • Facilitating synthesis workshops for research with the design team

    What I would do different:

    • Involve team members outside of the design team
    • Outline the strategy in the beginning of the process

    What I would do next:

    • Conduct another round of user testing to validate changes
    • Define the MVP and bring in the devs

    Thank you for reading through this case study!

    You can find more of my work on my portfolio website stefaniemue.com.

    Don’t forget to give us your 👏 !


    Case Study | myTEXT was originally published in Chatbots Life on Medium, where people are continuing the conversation by highlighting and responding to this story.