Grinbox

Grinbox was an email enhancement tool that I created in early 2013 as my junior independent project at Princeton. The fundamental idea was to color code the inbox according to each message's content and its demands of the recipient. The system itself was composed of three main components. At its core, the system used machine learning to distinguish between formal and informal emails, between positive and negative sentiments, and between written or automatically generated emails. On top of this was a server that exposed a web API to retrieve an authorized users's email categorizations, and finally a client extension for Google Chrome that color coded each email in the user's Gmail inbox according to the categorization. The final system is visually summarized by this graphic I created for my independent work poster presentation.

The web interface side of this project was fairly straightforward, a good first experience building a browser extension. For the server too, I used web.py to wrap my Python classifiers in a web API to make them accessible, and to provide OAuth authentication to Gmail to get access to a user's email. Those parts I cranked out in the last few weeks of the semester to make the web experience work.

The real meat of the project was in the iterative attempts to find a meaningful way to differentiate emails. This project was my first foray into trying to build a tool based on machine learning libraries, the underlying concepts of which I had just learned about in my artificial intelligence class.

Originally, I hoped to use unsupervised learning to cluster a user's email messages in a high dimensional topic space. Latent Dirichlet allocation (LDA) allows one to identify topics in a text corpus as vectors of words with correspondence coefficients. The original system identified the most important topics within the user's email text corpus, then used these topics as the dimensions for the space in which naturally occurring message clusters were identified and labeled. The hope was this would allow the user's email to be automatically categorized by the user's own topics - perhaps business email, the pickup soccer club, family emails, and Amazon receipts would each receive their own categorization.

As I implemented it, it became clear that this approach had several issues. First was simply the limitations of the clustering algorithms I worked with. I started with k-means, which demands a pre-identified number of clusters. Furthermore, there was no clear way to label identified clusters in a way useful to the user - it was up to them to realize what each grouping meant. Second, I was unable to find a happy medium between using enough topics to make important distinctions, but not using so many topics that the space became too high-dimensional and the clusters too sparse.

After deciding to change my approach entirely, I decide to use a supervised learning approach where I trained a collection of classifiers to do binary discrimination on the most important aspects of email. I settled on discrimnating between emails auto-generated and written by hand, betwen positive and negative sentinments, and betwen formal and informal style. Though binary classification on three dimensions would imply eight options, for approachability I opted to use only four color labels - auto generated, negative sentiment, formal, and informal.

Supervised learning necessitates a labeled data set, and as I had no labeled email set, I had to use non-email sources. After some experimentation, I settled using the Yelp academic dataset to train negative labels with 1-star and 2-star reviews and positive labels with 4-star and 5-star reviews. For formality, I trained with IRC chat logs as informal and Reuters news reports as formal. To identify personally written email, I found after experimentation that machine learning was unhelpful and I could get the best results simply checking for various HTML components in the email.

For the two classifiers I was training, I decided to go with a Naive Bayes classifier because my interviews at the time with Sift Science had convinced me that despite it's simplicity, Naive Bayes was a production ready algorithm. Furthermore, it allowed me to set a prior on the probability of each label, allowing me to ensure negative sentiment was a rare enough occurrence as to be a literal red flag for the user.

Finally, the project really impressed upon me the extent to which feature selection could make or break the effectiveness of machine learning. As it's put in Pedro Domingos' A Few Useful Things to Know about Machine Learning, "At the end of the day, some machine learning projects succeed, and some fail. What makes the difference? Easily the most important factor is the features used." For something like formality of text, finding signals that allow for probabilistic differentiation between classes is more of an art than a science. I eventually started testing for the presence of abbreviations, lax adherence to capitalization standards, emoticons, swear words, and misspellings. For sentiment analysis, I was unsuccessful with a more nuanced featureset, and simply used a bag-of-words approach.

Grinbox was an experimental project, and between the many different components, I just managed to hack together an interesting solution. In no way is it tested or production ready code - to be honest, it's a mess. Nonetheless, you can see the source code on Github - for the classifiers and server here and for the client Chrome extension here. However, my concept was validated when Google launched their tabbed inbox to do something very similar a few weeks after I concluded my project.