Embulk output plugin to dump records as Apache Parquet files on S3.
"%03d.%02d."
)
"parquet"
)"uncompressed"
,"snappy"
,"gzip"
,"lzo"
,"brotli"
,"lz4"
or "zstd"
, default: "uncompressed"
)"%Y-%m-%d %H:%M:%S.%6N %z"
)"UTC"
)timestamp-millis
, timestamp-micros
, timestamp-nanos
, json
, int8
, int16
, int32
, int64
, uint8
, uint16
, uint32
, uint64
) (string, optional)timestamp-millis
, timestamp-micros
, timestamp-nanos
, json
, int8
, int16
, int32
, int64
, uint8
, uint16
, uint32
, uint64
) (string, optional)"date"
, "decimal"
, "int"
, "json"
, "time"
, "timestamp"
) (string, required)"int"
logical type (Allowed bit width values are 8
, 16
, 32
, 64
). (int, default: 64
)"int"
logical type (boolean, default: true
)"decimal"
logical type (int, default: 0
)"decimal"
logical type (int, default: 0
)true
)"time"
or "timestamp"
logical type (Allowed values are "MILLIS
, MICROS
, NANOS
)private
)134217728
(128MB))1048576
(1MB))8388608
(8MB))true
)auth_method: name of mechanism to authenticate requests ("basic"
, "env"
, "instance"
, "profile"
, "properties"
, "anonymous"
, "session"
, "web_identity_token"
, default: "default"
)
"basic"
: uses access_key_id and secret_access_key to authenticate."env"
: uses AWS_ACCESS_KEY_ID
(or AWS_ACCESS_KEY
) and AWS_SECRET_KEY
(or AWS_SECRET_ACCESS_KEY
) environment variables."instance"
: uses EC2 instance profile or attached ECS task role."profile"
: uses credentials written in a file. Format of the file is as following, where [...]
is a name of profile.
[default]
aws_access_key_id=YOUR_ACCESS_KEY_ID
aws_secret_access_key=YOUR_SECRET_ACCESS_KEY
[profile2] ...
- `"properties"`: uses aws.accessKeyId and aws.secretKey Java system properties.
- `"anonymous"`: uses anonymous access. This auth method can access only public files.
- `"session"`: uses temporary-generated **access_key_id**, **secret_access_key** and **session_token**.
- `"assume_role"`: uses temporary-generated credentials by assuming **role_arn** role.
- `"web_identity_token"`: uses temporary-generated credentials by assuming **role_arn** role with web identity.
- `"default"`: uses AWS SDK's default strategy to look up available credentials from runtime environment. This method behaves like the combination of the following methods.
1. `"env"`
1. `"properties"`
1. `"profile"`
1. `"instance"`
"profile"
. (string, default: given by AWS_CREDENTIAL_PROFILES_FILE
environment variable, or ~/.aws/credentials)."profile"
. (string, default: "default"
)"basic"
or "session"
. (string, optional)"basic"
or "session"
. (string, optional)"session"
. (string, optional)"assume_role"
or "web_identity_token"
. (string, optional)"assume_role"
or "web_identity_token"
. (string, optional)"assume_role"
. (string, optional)"assume_role"
. (int, optional)"web_identity_token"
. (string, optional)"assume_role"
. (string, optional)catalog: Register a table if this option is specified (optional)
catalog_id: glue data catalog id if you use a catalog different from account/region default catalog. (string, optional)
database: The name of the database (string, required)
table: The name of the table (string, required)
column_options: a key-value pairs where key is a column name and value is options for the column. (string to options map, default: {}
)
type: type of column when this plugin creates new tables (e.g. string
, bigint
) (string, default: depends on the input embulk column type, or the parquet logical type. See the below table)
embulk column type | glue data type |
---|---|
long | bigint |
boolean | boolean |
double | double |
string | string |
timestamp | string |
json | string |
parquet converted type | glue data type | note |
---|---|---|
timestamp-millis | timestamp | |
timestamp-micros | long | Glue cannot recognize timestamp-micros. |
timestamp-nanos | long | Glue cannot recognize timestamp-nanos. |
int8 | tinyint | |
int16 | smallint | |
int32 | int | |
int64 | bigint | |
uint8 | smallint | Glue tinyint is a minimum value of -2^7 and a maximum value of 2^7-1 |
uint16 | int | Glue smallint is a minimum value of -2^15 and a maximum value of 2^15-1. |
uint32 | bigint | Glue int is a minimum value of-2^31 and a maximum value of 2^31-1. |
uint64 | ConfigException | Glue bigint supports only a 64-bit signed integer. |
json | string |
operation_if_exists: operation if the table already exist. Available operations are "delete"
and "skip"
(string, default: "delete"
)
"https"
)boolean
, long
, double
, string
, timestamp
, json
), and values are configuration with following parameters (optional)
timestamp-millis
, timestamp-micros
, timestamp-nanos
, json
, int8
, int16
, int32
, int64
, uint8
, uint16
, uint32
, uint64
) (string, optional)timestamp-millis
, timestamp-micros
, timestamp-nanos
, json
, int8
, int16
, int32
, int64
, uint8
, uint16
, uint32
, uint64
) (string, optional)"date"
, "decimal"
, "int"
, "json"
, "time"
, "timestamp"
) (string, required)"int"
logical type (Allowed bit width values are 8
, 16
, 32
, 64
). (int, default: 64
)"int"
logical type (boolean, default: true
)"decimal"
logical type (int, default: 0
)"decimal"
logical type (int, default: 0
)true
)"time"
or "timestamp"
logical type (Allowed values are "MILLIS
, MICROS
, NANOS
)out:
type: s3_parquet
bucket: my-bucket
path_prefix: path/to/my-obj.
file_ext: snappy.parquet
compression_codec: snappy
default_timezone: Asia/Tokyo
canned_acl: bucket-owner-full-control
$ ./run_s3_local.sh
$ ./example/prepare_s3_bucket.sh
$ ./gradlew gem
$ embulk run example/config.yml -Ibuild/gemContents/lib
$ ./run_s3_local.sh
$ ./gradlew scalatest
$ ./gradlew gem --write-locks # -t to watch change of files and rebuild continuously
Fix build.gradle, then
$ ./gradlew gemPush