MiniCat

MiniCat is short for Mini Text Categorizer.

The goals of this tool is to :

Setup

Virtual Environments

It is recommended to use a Virtual Environment, but not required. Installing the above dependencies in a new virtual environment allows you to run the sample without changing global python packages on your system.

There are two options for the virtual environments:

Requirements

Python 2.7 required.

pip install -r requirements.txt

Google Cloud setup

Setup a google cloud project and enable the following APIs:

Then create a Google Cloud Storage bucket. This is where all your model and training related data will be stored. For more information check out the tutorials in the documentation pages.

Usage

Labeler

A simple terminal-based tool that allows document labeling for training, as well as label curation.

python main.py label --data_csv_file <filename.csv> \
                     --local_working_dir <MiniCat/data>

Trainer

Use the NL API and ML Engine to train a classifier using the text and labels prepared by the labeler.

python main.py train --local_working_dir <MiniCat/data> \
                     --version <version_number> \
                     --gcs_working_dir <gs://bucket_name/file_path> \
                     --vocab_size <number> \
                     --region <us-central-1> \
                     --scale_tier

Quickstart

This tool could be used to classify different types of text data such as emails, support-tickets, movie reviews, news topics etc.

Let's consider the case of emails.

Preparing data

Create a working directory emails in your home directory.

As an example, export your emails from gmail into a mailbox file. Then post-process into the following csv format.

Create a spreadsheet similar to :

. file_path text labels
1 ~/emails/file1.txt Important
2 ~/emails/file2.txt Unimportant
3 ~/emails/file3.txt
4 ~/emails/file4.txt Important

.
.

In this example each email's text is in a file. There are some seed labels that can be used to partially label the set of emails.

The spreadsheet can also be in this format :

. file_path text labels
1 You just won a prize for $5000 ... Unimportant
2 Your friends Alice tagged you in ... Important
3 Call #0000 and get a free Iphone ...
4 Signup today for holiday packages... Important

.
.

Note: You could also use a mix of both text and file_path in the spreadsheet.

Create the spreadsheet according to your requirements and save it in working_directory emails under the name emails.csv.

Environment Setup

Make sure python 2.7 is installed. Follow the commands in the Virtual Environments Setup section. Fork the git repository and from inside the directory run :- pip install -r requirements.txt

Create a Google Cloud Platform project and setup billing and credentials. For info on how to do that see the steps 1,2,4,5 and 6 on this page.

Set up APIs by following the setup mentioned above.

Create a Google cloud Storage bucket emails and then create a directory under it called working_dir.

Labelling the Data

From the git-repo directory, run the following command

python main.py label --data_csv_file ~/emails/email.csv \
                     --local_working_dir ~/emails/

First the tool will ask you to select a set of target labels :-

Automatically detected labels :
Important
Unimportant
Enter a new label or enter 'd' for done :

Then the tool will allow you to label the text :-

Id  Label
0   Important
1   Unimportant
Call #0000 and get a free Pixel today. Select between all google phones........

Enter the Label id ('d' for done, 's' to skip) : 1

The labelling workflow will continue until you have labelled all the unlabelled text or you type 'd'.

The tool should exit at the end saying a new version 1 was created.

Training a Classifier

From the git-repo directory, run the following command

python main.py train --local_working_dir ~/emails/  \
                     --version 1 \
                     --gcs_working_dir gs://emails/working_dir \
                     --scale_tier

Note: Don't use the flag scale_tier if you do not want to use a GPU while training.

This will start the training on the version 1 labels file which was created using the labeler tool. The tool will output a url which can be used to view the job's progress. Wait for the job to finish and the results to be displayed. There should be a file in ~/emails/v1/predictions.csv that will contain the predicted labels and prediction confidence for all your data points.

Iterate

At this point if the results are unsatisfactory then label some more examples. Predictions in ~/emails/v1/predictions.csv could be used to help in labelling the new version of labels.

Run the command below to start labelling again. :-

python main.py label --data_csv_file ~/emails/v1/predictions.csv \
                     --local_working_dir ~/emails/

Note: We call the labelling on the predictions.csv file from version 1.

This will lead to the same labelling process. After labelling some more examples call the trainer module :-

python main.py train --local_working_dir ~/emails/  \
                     --version 2 \
                     --gcs_working_dir gs://emails/working_dir \
                     --scale_tier

Repeat the same process if the results are still unsatisfactory.

Possible Next Steps

Trobleshooting

A few errors that might commonly occur and their possible solutions :-

Disclaimer

This is not an official Google product.