A natural language generation language, intended for creating training data for intent parsing systems.
Nalgene generates pairs of sentences and grammar trees by a random (or guided) walk through a grammar file.
Tree: a nested list of tokens (an s-expression) generated alongside the sentence, e.g.
( %setDeviceState
( $device.name light )
( $device.state on ) ) )
$ python generate.py [template.nlg] [entry] [--key=value] ...
By default, generation walks through the template tree from the entry %
node and chooses phrases and values randomly:
$ python generate.py examples/iot.nlg
> if the temperature in minnesota is equal to 2 then please turn the office light off thanks
( %if
( %condition
( %currentWeather
( $location minnesota ) )
( $operator equal to )
( $number 2 ) )
( %setDeviceState
( $device.name office light )
( $device.state off ) ) )
You can choose an entry point to start generation from:
$ python generate.py examples/iot.nlg getWeather
> tell me what it's like in new york
( %getWeather
( $location new york ) )
You can also supply values from the command line (unspecified values will be randomly chosen):
$ python generate.py examples/iot.nlg getWeather --location tokyo
> what is the weather in tokyo ?
( %getWeather
( $location tokyo ) )
Or from a JSON file:
$ cat command.json
{"entry": "%setDeviceState", "values": {"$device.state": "off", "$device.name": "office light"}}
$ cat command.json | python generate.py examples/iot.nlg
> please turn off the office light
( %setDeviceState
( $device.state off )
( $device.name office light ) )
A .nlg nalgene grammar file is a set of sections separated by a blank line. Every section takes this shape:
node_name
token sequence 1
token sequence 2
The indented lines under a node are the node's possible token sequences. Each token in a sequence is either
%phrase
node,$value
node,@ref
node,~synonym
word.Each token is added to the output sentence and/or tree during generation, depending on the type.
A standard .nlg file starts with a start phrase %
, which is the default entry point for the generator. The generator may also use a specific entry point.
A phrase (%phrase
) is a general set of token sequences. A phrase is potentially recursive, using tokens which represent other phrases (even itself). Each phrase defines one or more possible sequences.
The regular words in a phrase are ignored in the output tree. This makes them useful for defining higher level grammar for the same intent - for example, for different word orders ("turn on the light" vs "turn the light on").
Using this grammar:
%
%greeting
%farewell
%greeting and %farewell
%greeting
hey there
hi
%farewell
goodbye
bye
The generator might output:
> hey there and bye
( %
( %greeting )
( %farewell ) )
Here's how the generator arrived at this specific sentence and tree pair:
%
, with an empty output sentence ""
and tree ( % )
%greeting and %farewell
%greeting
, so
( %greeting )
to the parent tree%greeting
hey there
"hey there"
and the parse tree is ( % ( %greeting ) )
"and"
, so add it to the output sentence%farewell
, so
( %farewell )
to the parent tree%farewell
bye
"hey there and bye"
Sometimes you need to capture the specific words in a sentence, for example to capture the location in a sentence like "how is the weather in boston". Values, marked with a dollar sign as $value
, are a type of leaf node that capture the regular word tokens in the tree.
%getWeather
what is the weather in $location
how is the $location weather
$location
boston
san francisco
tokyo
> what is the weather in san francisco
( %getWeather
( $location san francisco ) )
TODO: Better name for this
As an alternative to the freeform $value
, there is a @ref
leaf node which references a specific value without capturing the words beneath it. This allows you to reference a specific entity, e.g. a specific room or device name, with multiple expansions.
%turnOnLight
turn the %light on
%light
@office_light
@living_room_light
@office_light
office light
light in the office
@living
light in the den
light in the living room
living room light
Synonyms, marked ~synonym
, are output only on the sentence side, and are useful for supplying word variations.
%good
~exclamation this is ~so ~good
~exclamation
wow
omg
~so
so
very
extremely
~good
good
great
wonderful
> wow this is extremely great
( %good )
Tokens with a ?
at the end will be used only 50% of the time.
%findFood
~find $price? $food ~near $location
> find me sushi in san francisco
( %
( %findFood
( $food sushi )
( $location san francisco ) ) )
> tell me the cheap fried chicken around tokyo
( %
( %findFood
( $price cheap )
( $food fried chicken )
( $location tokyo ) ) )
Tokens with a =
at the end are called "passthrough" tokens and will not be included in the output tree, but their children will be. This is defined at the root level, rather than within a token sequence.
%
~please? %command
%command=
%getTime
%getFact
%getTime
what time is it
what is the time
%getFact=
%getLocationFact
%getPersonFact
%getPersonalFact
In this case, whenever the %command
token is encountered, whatever its children output will be directly added to the tree (as opposed to prefixed with the %command
token), so it will be output as %getTime
or %getFact
. But in fact %getFact
is another passthrough token, so the value of its children will be passed all the way up the tree.
> what is the time
( %
( %getTime ) )
> pretty please what is the population of tokyo
( %
( %getLocationFact
( $location_fact population )
( $location tokyo ) ) )