MiniCat is short for Mini Text Categorizer.
The goals of this tool is to :
It is recommended to use a Virtual Environment, but not required. Installing the above dependencies in a new virtual environment allows you to run the sample without changing global python packages on your system.
There are two options for the virtual environments:
virtualenv MiniCat-env
source MiniCat-env/bin/activate
conda create --name MiniCat-env python=2.7
source activate MiniCat-env
Python 2.7 required.
pip install -r requirements.txt
Setup a google cloud project and enable the following APIs:
Then create a Google Cloud Storage bucket. This is where all your model and training related data will be stored. For more information check out the tutorials in the documentation pages.
A simple terminal-based tool that allows document labeling for training, as well as label curation.
python main.py label --data_csv_file <filename.csv> \
--local_working_dir <MiniCat/data>
data_csv_file
: path to your csv which should contain these 3 column
headers :
file_path
: full file path of where the text is to be read fromtext
: Text for the data point (Only one of either file_path or text
is required.)labels
: The class which the text belong to (can be empty)local_working_dir
: This is where all the different csv versions of your
data and the prediction results is going to be located at.
Use the NL API and ML Engine to train a classifier using the text and labels prepared by the labeler.
python main.py train --local_working_dir <MiniCat/data> \
--version <version_number> \
--gcs_working_dir <gs://bucket_name/file_path> \
--vocab_size <number> \
--region <us-central-1> \
--scale_tier
local_working_dir
: Directory where all the csv version files are located.version
: Version number of csv to be used for training.gcs_working_dir
: Path to your Google Cloud Storage directory to use for
training and storing the models and dataset (of the form :-
gs://bucket_name/some_path
).vocab_size
: Size of the vocabulary to use for training. (Default :- 20000)region
: REGION where training should occur. Ideally set this the same as
the REGION where your Google Cloud Storage bucket is located. (Default :-
us-central-1
)scale_tier
: Mention this flag to train with GPU's. The scale_tier will be
set to BASIC_GPU
.This tool could be used to classify different types of text data such as emails, support-tickets, movie reviews, news topics etc.
Let's consider the case of emails.
Create a working directory emails
in your home directory.
As an example, export your emails from gmail into a mailbox file. Then post-process into the following csv format.
Create a spreadsheet similar to :
. | file_path | text | labels |
---|---|---|---|
1 | ~/emails/file1.txt | Important | |
2 | ~/emails/file2.txt | Unimportant | |
3 | ~/emails/file3.txt | ||
4 | ~/emails/file4.txt | Important |
.
.
In this example each email's text is in a file. There are some seed labels that can be used to partially label the set of emails.
The spreadsheet can also be in this format :
. | file_path | text | labels |
---|---|---|---|
1 | You just won a prize for $5000 ... | Unimportant | |
2 | Your friends Alice tagged you in ... | Important | |
3 | Call #0000 and get a free Iphone ... | ||
4 | Signup today for holiday packages... | Important |
.
.
Note: You could also use a mix of both text
and file_path
in the
spreadsheet.
Create the spreadsheet according to your requirements and save it in
working_directory emails
under the name emails.csv
.
Make sure python 2.7 is installed. Follow the commands in the Virtual
Environments Setup section. Fork the git repository and
from inside the directory run :- pip install -r requirements.txt
Create a Google Cloud Platform project and setup billing and credentials. For info on how to do that see the steps 1,2,4,5 and 6 on this page.
Set up APIs by following the setup mentioned above.
Create a Google cloud Storage bucket emails
and then create a
directory under it called working_dir
.
From the git-repo directory, run the following command
python main.py label --data_csv_file ~/emails/email.csv \
--local_working_dir ~/emails/
First the tool will ask you to select a set of target labels :-
Automatically detected labels :
Important
Unimportant
Enter a new label or enter 'd' for done :
Then the tool will allow you to label the text :-
Id Label
0 Important
1 Unimportant
Call #0000 and get a free Pixel today. Select between all google phones........
Enter the Label id ('d' for done, 's' to skip) : 1
The labelling workflow will continue until you have labelled all the unlabelled text or you type 'd'.
The tool should exit at the end saying a new version 1 was created.
From the git-repo directory, run the following command
python main.py train --local_working_dir ~/emails/ \
--version 1 \
--gcs_working_dir gs://emails/working_dir \
--scale_tier
Note: Don't use the flag scale_tier
if you do not want to use a GPU
while training.
This will start the training on the version 1 labels file which was created
using the labeler tool. The tool will output a url which can be used to view the
job's progress. Wait for the job to finish and the results to be displayed.
There should be a file in ~/emails/v1/predictions.csv
that will contain the
predicted labels and prediction confidence for all your data points.
At this point if the results are unsatisfactory then label some more examples.
Predictions in ~/emails/v1/predictions.csv
could be used to help in
labelling the new version of labels.
Run the command below to start labelling again. :-
python main.py label --data_csv_file ~/emails/v1/predictions.csv \
--local_working_dir ~/emails/
Note: We call the labelling on the predictions.csv
file from version 1.
This will lead to the same labelling process. After labelling some more examples call the trainer module :-
python main.py train --local_working_dir ~/emails/ \
--version 2 \
--gcs_working_dir gs://emails/working_dir \
--scale_tier
Repeat the same process if the results are still unsatisfactory.
params.json
.params.json
.A few errors that might commonly occur and their possible solutions :-
google.cloud.exceptions.TooManyRequests:
time.sleep(0.1)
before making the NL API
requests in trainer.py.The provided GCS paths [] cannot be read by service account $srvacct
$srvcacct
doesn't have write permissions to
the GCS bucket. Run the following command to set the ACL permissions :-gsutil defacl ch -u $SVCACCT:O gs://$BUCKET/
This is not an official Google product.