hive-third-functions

Author Build Status Documentation Status Documentation Status Release Stars

Introduction

Some useful custom hive udf functions, especial array and json functions.

Note: hive-third-functions support hive-0.11.0 or higher.

Build

1. install dependency

Now, jdo2-api-2.3-ec.jar not available in the maven central repository, so we have to manually install it into our local maven repository.

wget http://www.datanucleus.org/downloads/maven2/javax/jdo/jdo2-api/2.3-ec/jdo2-api-2.3-ec.jar -O ~/jdo2-api-2.3-ec.jar
mvn install:install-file -DgroupId=javax.jdo -DartifactId=jdo2-api -Dversion=2.3-ec -Dpackaging=jar -Dfile=~/jdo2-api-2.3-ec.jar

2. mvn package

cd ${project_home}
mvn clean package

If you want to skip unit tests, please run:

cd ${project_home}
mvn clean package -DskipTests

It will generate hive-third-functions-${version}-shaded.jar in target directory.

You can also directly download file from release page.

current latest version is 2.2.1

Maven

Now, I had already release hive-third-functions to maven repositories. To add a dependency on hive-third-functions using Maven, use the following:

<dependency>
  <groupId>com.github.aaronshan</groupId>
  <artifactId>hive-third-functions</artifactId>
  <version>2.2.1</version>
</dependency>

Functions

1. string functions

function description
pinyin(string) -> string convert chinese to pinyin
md5(string) -> string md5 hash
sha256(string) -> string sha256 hash
codepoint(string) -> integer Returns the Unicode code point of the only character of string.
hamming_distance(string1, string2) -> bigint Returns the Hamming distance of string1 and string2.
levenshtein_distance(string1, string2) -> bigint Returns the Levenshtein edit distance of string1 and string2.
normalize(string, form) -> varchar Transforms string with the specified normalization form. form must be be one of the following keywords: Normalize Form Description
strpos(string, substring) -> bigint Returns the starting position of the first instance of substring in string. Positions start with 1. If not found, 0 is returned.
split_to_map(string, entryDelimiter, keyValueDelimiter) -> map<varchar, varchar> Splits string by entryDelimiter and keyValueDelimiter and returns a map. entryDelimiter splits string into key-value pairs. keyValueDelimiter splits each pair into key and value.
split_to_multimap(string, entryDelimiter, keyValueDelimiter) -> map(varchar, array(varchar)) Splits string by entryDelimiter and keyValueDelimiter and returns a map containing an array of values for each unique key. entryDelimiter splits string into key-value pairs. keyValueDelimiter splits each pair into key and value. The values for each key will be in the same order as they appeared in string.

Normalize Form Description

Form Description
NFD Canonical Decomposition
NFC Canonical Decomposition, followed by Canonical Composition
NFKD Compatibility Decomposition
NFKC Compatibility Decomposition, followed by Canonical Composition

2. array functions

function description
array_contains(array<E>, E) -> boolean whether array contains value or not.
array_equals(array<E>, array<E>) -> boolean whether two array equals or not.
array_intersect(array, array) -> array returns the two array's intersection, without duplicates.
array_max(array<E>) -> E returns the maximum value of input array.
array_min(array<E>) -> E returns the minimum value of input array.
array_join(array, delimiter, null_replacement) -> string concatenates the elements of the given array using the delimiter and an optional null_replacement to replace nulls.
array_distinct(array) -> array remove duplicate values from the array.
array_position(array<E>, E) -> long returns the position of the first occurrence of the element in array (or 0 if not found).
array_remove(array<E>, E) -> array remove all elements that equal element from array.
array_reverse(array) -> array reverse the array element.
array_sort(array) -> array sorts and returns the array. The elements of array must be orderable.
array_concat(array, array) -> array concatenates two arrays.
array_value_count(array<E>, E) -> long count array's element number that element value equals given value.
array_slice(array, start, length) -> array subsets array starting from index start (or starting from the end if start is negative) with a length of length.
array_element_at(array<E>, index) -> E returns element of array at given index. If index < 0, element_at accesses elements from the last to the first.
array_shuffle(array) -> array Generate a random permutation of the given array x.
sequence(start, end) -> array Generate a sequence of integers from start to stop.
sequence(start, end, step) -> array Generate a sequence of integers from start to stop, incrementing by step.
sequence(start_date_string, end_data_string, step) -> array Generate a sequence of date string from start to stop, incrementing by step.
array_value_count(array<E>, E) -> long count array's element number that element value equals given value..

3. map functions

function description
map_build(x<K>, y<V>) -> map<K, V> returns a map created using the given key/value arrays.
map_concat(x<K, V>, y<K, V>) -> map<K,V> returns the union of two maps. If a key is found in both x and y, that key’s value in the resulting map comes from y.
map_element_at(map<K, V>, key) -> V returns value for given key, or NULL if the key is not contained in the map.
map_equals(x<K, V>, y<K, V>) -> boolean whether map x equals with map y or not.

4. date functions

function description
day_of_week(date_string | date) -> int day of week,if monday,return 1, sunday return 7, error return null.
day_of_year(date_string | date) -> int day of year. The value ranges from 1 to 366.
zodiac_en(date_string | date) -> string convert date to zodiac
zodiac_cn(date_string | date) -> string convert date to zodiac chinese
type_of_day(date_string | date) -> string for chinese. 获取日期的类型(1: 法定节假日, 2: 正常周末, 3: 正常工作日 4:攒假的工作日),错误返回-1.

5. json functions

function description
json_array_get(json, jsonPath) -> array(varchar) returns the element at the specified index into the json_array. The index is zero-based.
json_array_length(json, jsonPath) -> array(varchar) returns the array length of json (a string containing a JSON array).
json_array_extract(json, jsonPath) -> array(varchar) extract json array by given jsonPath.
json_array_extract_scalar(json, jsonPath) -> array(varchar) like json_array_extract, but returns the result value as a string (as opposed to being encoded as JSON).
json_extract(json, jsonPath) -> array(varchar) extract json by given jsonPath.
json_extract_scalar(json, jsonPath) -> array(varchar) like json_extract, but returns the result value as a string (as opposed to being encoded as JSON).
json_size(json, jsonPath) -> array(varchar) like json_extract, but returns the size of the value. For objects or arrays, the size is the number of members, and the size of a scalar value is zero.

6. bitwise functions

function description
bit_count(x, bits) -> bigint count the number of bits set in x (treated as bits-bit signed integer) in 2’s complement representation
bitwise_and(x, y) -> bigint returns the bitwise AND of x and y in 2’s complement arithmetic.
bitwise_not(x) -> bigint returns the bitwise NOT of x in 2’s complement arithmetic.
bitwise_or(x, y) -> bigint returns the bitwise OR of x and y in 2’s complement arithmetic.
bitwise_xor(x, y) -> bigint returns the bitwise XOR of x and y in 2’s complement arithmetic.

7. china id card functions

function description
id_card_province(string) -> string get user's province
id_card_city(string) -> string get user's city
id_card_area(string) -> string get user's area
id_card_birthday(string) -> string get user's birthday
id_card_gender(string) -> string get user's gender
is_valid_id_card(string) -> boolean determine is valid china id card No.
id_card_info(string) -> json get china id card info. include province, city, area etc.

8. geographic functions

function description
wgs_distance(double lat1, double lng1, double lat2, double lng2) -> double calculate WGS84 coordinate distance, in meters.
gcj_to_bd(double,double) -> json GCJ-02(火星坐标系) convert to BD-09(百度坐标系), 谷歌、高德——>百度
bd_to_gcj(double,double) -> json BD-09(百度坐标系) convert to GCJ-02(火星坐标系), 百度——>谷歌、高德
wgs_to_gcj(double,double) -> json WGS84(地球坐标系) convert to GCJ02(火星坐标系)
gcj_to_wgs(double,double) -> json GCJ02(火星坐标系) convert to GPS84(地球坐标系), output coordinate WGS-84 accuracy within 1 to 2 meters.
gcj_extract_wgs(double,double) -> json GCJ02(火星坐标系) convert to GPS84, output coordinate WGS-84 accuracy within 0.5 meters. but compute cost more time than gcj_to_wgs.

关于互联网地图坐标系的说明见: 当前互联网地图的坐标系现状

9. url functions

function description
url_encode(value) -> string escapes value by encoding it so that it can be safely included in URL query parameter names and values
url_decode(value) -> string unescape the URL encoded value. This function is the inverse of url_encode.

10. math functions

function description
infinity() -> double Returns the constant representing positive infinity.
is_finite(x) -> boolean Determine if x is finite.
is_infinite(x) -> boolean Determine if x is infinite.
is_nan(x) -> boolean Determine if x is not-a-number.
nan() -> double Returns the constant representing not-a-number.
from_base(string, radix) -> bigint Returns the value of string interpreted as a base-radix number.
to_base(x, radix) -> varchar Returns the base-radix representation of x.
cosine_similarity(x, y) -> double Returns the cosine similarity between the sparse vectors x and y
inverse_normal_cdf(mean, sd, p) -> double Compute the inverse of the Normal cdf with given mean and standard deviation (sd) for the cumulative probability (p): P(N < n). The mean must be a real value and the standard deviation must be a real and positive value. The probability p must lie on the interval (0, 1).
normal_cdf(mean, sd, v) -> double Compute the Normal cdf with given mean and standard deviation (sd): P(N < v; mean, sd). The mean and value v must be real values and the standard deviation must be a real and positive value.

11. regexp functions

function description
regexp_like(string, pattern) -> boolean Evaluates the regular expression pattern and determines if it is contained within string.
regexp_extract_all(string, pattern) -> array(varchar) Returns the substring(s) matched by the regular expression pattern in string.
regexp_extract(string, pattern) -> varchar Returns the first substring matched by the regular expression pattern in string.
regexp_replace(string, pattern) -> varchar Removes every instance of the substring matched by the regular expression pattern from string.
regexp_replace(string, pattern, replacement) -> varchar Replaces every instance of the substring matched by the regular expression pattern in string with replacement.

Use

Put these statements into ${HOME}/.hiverc or exec its on hive cli env.

add jar ${jar_location_dir}/hive-third-functions-${version}-shaded.jar
create temporary function array_contains as 'com.github.aaronshan.functions.array.UDFArrayContains';
create temporary function array_equals as 'com.github.aaronshan.functions.array.UDFArrayEquals';
create temporary function array_intersect as 'com.github.aaronshan.functions.array.UDFArrayIntersect';
create temporary function array_max as 'com.github.aaronshan.functions.array.UDFArrayMax';
create temporary function array_min as 'com.github.aaronshan.functions.array.UDFArrayMin';
create temporary function array_join as 'com.github.aaronshan.functions.array.UDFArrayJoin';
create temporary function array_distinct as 'com.github.aaronshan.functions.array.UDFArrayDistinct';
create temporary function array_position as 'com.github.aaronshan.functions.array.UDFArrayPosition';
create temporary function array_remove as 'com.github.aaronshan.functions.array.UDFArrayRemove';
create temporary function array_reverse as 'com.github.aaronshan.functions.array.UDFArrayReverse';
create temporary function array_sort as 'com.github.aaronshan.functions.array.UDFArraySort';
create temporary function array_concat as 'com.github.aaronshan.functions.array.UDFArrayConcat';
create temporary function array_value_count as 'com.github.aaronshan.functions.array.UDFArrayValueCount';
create temporary function array_slice as 'com.github.aaronshan.functions.array.UDFArraySlice';
create temporary function array_element_at as 'com.github.aaronshan.functions.array.UDFArrayElementAt';
create temporary function array_shuffle as 'com.github.aaronshan.functions.array.UDFArrayShuffle';
create temporary function sequence as 'com.github.aaronshan.functions.array.UDFSequence';
create temporary function array_value_count as 'com.github.aaronshan.functions.array.UDFArrayValueCount';
create temporary function bit_count as 'com.github.aaronshan.functions.bitwise.UDFBitCount';
create temporary function bitwise_and as 'com.github.aaronshan.functions.bitwise.UDFBitwiseAnd';
create temporary function bitwise_not as 'com.github.aaronshan.functions.bitwise.UDFBitwiseNot';
create temporary function bitwise_or as 'com.github.aaronshan.functions.bitwise.UDFBitwiseOr';
create temporary function bitwise_xor as 'com.github.aaronshan.functions.bitwise.UDFBitwiseXor';
create temporary function map_build as 'com.github.aaronshan.functions.map.UDFMapBuild';
create temporary function map_concat as 'com.github.aaronshan.functions.map.UDFMapConcat';
create temporary function map_element_at as 'com.github.aaronshan.functions.map.UDFMapElementAt';
create temporary function map_equals as 'com.github.aaronshan.functions.map.UDFMapEquals';
create temporary function day_of_week as 'com.github.aaronshan.functions.date.UDFDayOfWeek';
create temporary function day_of_year as 'com.github.aaronshan.functions.date.UDFDayOfYear';
create temporary function type_of_day as 'com.github.aaronshan.functions.date.UDFTypeOfDay'; 
create temporary function zodiac_cn as 'com.github.aaronshan.functions.date.UDFZodiacSignCn';
create temporary function zodiac_en as 'com.github.aaronshan.functions.date.UDFZodiacSignEn';
create temporary function pinyin as 'com.github.aaronshan.functions.string.UDFChineseToPinYin';
create temporary function md5 as 'com.github.aaronshan.functions.string.UDFMd5';
create temporary function sha256 as 'com.github.aaronshan.functions.string.UDFSha256';
create temporary function codepoint as 'com.github.aaronshan.functions.string.UDFCodePoint';
create temporary function hamming_distance as 'com.github.aaronshan.functions.string.UDFStringHammingDistance';
create temporary function levenshtein_distance as 'com.github.aaronshan.functions.string.UDFStringLevenshteinDistance';
create temporary function normalize as 'com.github.aaronshan.functions.string.UDFStringNormalize';
create temporary function strpos as 'com.github.aaronshan.functions.string.UDFStringPosition';
create temporary function split_to_map as 'com.github.aaronshan.functions.string.UDFStringSplitToMap';
create temporary function split_to_multimap as 'com.github.aaronshan.functions.string.UDFStringSplitToMultimap';
create temporary function json_array_get as 'com.github.aaronshan.functions.json.UDFJsonArrayGet';
create temporary function json_array_length as 'com.github.aaronshan.functions.json.UDFJsonArrayLength';
create temporary function json_array_extract as 'com.github.aaronshan.functions.json.UDFJsonArrayExtract';
create temporary function json_array_extract_scalar as 'com.github.aaronshan.functions.json.UDFJsonArrayExtractScalar';
create temporary function json_extract as 'com.github.aaronshan.functions.json.UDFJsonExtract';
create temporary function json_extract_scalar as 'com.github.aaronshan.functions.json.UDFJsonExtractScalar';
create temporary function json_size as 'com.github.aaronshan.functions.json.UDFJsonSize';
create temporary function id_card_province as 'com.github.aaronshan.functions.card.UDFChinaIdCardProvince';
create temporary function id_card_city as 'com.github.aaronshan.functions.card.UDFChinaIdCardCity';
create temporary function id_card_area as 'com.github.aaronshan.functions.card.UDFChinaIdCardArea';
create temporary function id_card_birthday as 'com.github.aaronshan.functions.card.UDFChinaIdCardBirthday';
create temporary function id_card_gender as 'com.github.aaronshan.functions.card.UDFChinaIdCardGender';
create temporary function is_valid_id_card as 'com.github.aaronshan.functions.card.UDFChinaIdCardValid';
create temporary function id_card_info as 'com.github.aaronshan.functions.card.UDFChinaIdCardInfo';
create temporary function wgs_distance as 'com.github.aaronshan.functions.geo.UDFGeoWgsDistance';
create temporary function gcj_to_bd as 'com.github.aaronshan.functions.geo.UDFGeoGcjToBd';
create temporary function bd_to_gcj as 'com.github.aaronshan.functions.geo.UDFGeoBdToGcj';
create temporary function wgs_to_gcj as 'com.github.aaronshan.functions.geo.UDFGeoWgsToGcj';
create temporary function gcj_to_wgs as 'com.github.aaronshan.functions.geo.UDFGeoGcjToWgs';
create temporary function gcj_extract_wgs as 'com.github.aaronshan.functions.geo.UDFGeoGcjExtractWgs';
create temporary function url_encode as 'com.github.aaronshan.functions.url.UDFUrlEncode';
create temporary function url_decode as 'com.github.aaronshan.functions.url.UDFUrlDecode';
create temporary function infinity as 'com.github.aaronshan.functions.math.UDFMathInfinity';
create temporary function is_finite as 'com.github.aaronshan.functions.math.UDFMathIsFinite';
create temporary function is_infinite as 'com.github.aaronshan.functions.math.UDFMathIsInfinite';
create temporary function nan as 'com.github.aaronshan.functions.math.UDFMathNaN';
create temporary function is_nan as 'com.github.aaronshan.functions.math.UDFMathIsNaN';
create temporary function from_base as 'com.github.aaronshan.functions.math.UDFMathFromBase';
create temporary function to_base as 'com.github.aaronshan.functions.math.UDFMathToBase';
create temporary function cosine_similarity as 'com.github.aaronshan.functions.math.UDFMathCosineSimilarity';
create temporary function normal_cdf as 'com.github.aaronshan.functions.math.UDFMathNormalCdf';
create temporary function inverse_normal_cdf as 'com.github.aaronshan.functions.math.UDFMathInverseNormalCdf';
create temporary function regexp_extract as 'com.github.aaronshan.functions.regexp.UDFRe2JRegexpExtract';
create temporary function regexp_extract_all as 'com.github.aaronshan.functions.regexp.UDFRe2JRegexpExtractAll';
create temporary function regexp_like as 'com.github.aaronshan.functions.regexp.UDFRe2JRegexpLike';
create temporary function regexp_replace as 'com.github.aaronshan.functions.regexp.UDFRe2JRegexpReplace';
create temporary function regexp_split as 'com.github.aaronshan.functions.regexp.UDFRe2JRegexpSplit';

You can use these statements on hive cli env get detail of function.

hive> describe function zodiac_cn;
zodiac_cn(date) - from the input date string or separate month and day arguments, returns the sing of the Zodiac.

or

hive> describe function extended zodiac_cn;
zodiac_cn(date) - from the input date string or separate month and day arguments, returns the sing of the Zodiac.
Example:
 > select zodiac_cn(date_string) from src;
 > select zodiac_cn(month, day) from src;

example

 select pinyin('中国') => zhongguo
 select md5('aaronshan') => 95686bc0483262afe170b550dd4544d1
 select sha256('aaronshan') => d16bb375433ad383169f911afdf45e209eabfcf047ba1faebdd8f6a0b39e0a32
select day_of_week('2016-07-12') => 2
select day_of_year('2016-01-01') => 1
select type_of_day('2016-10-01') => 1
select type_of_day('2016-07-16') => 2
select type_of_day('2016-07-15') => 3
select type_of_day('2016-09-18') => 4
select zodiac_cn('1989-01-08') => 魔羯座
select zodiac_en('1989-01-08') => Capricorn
select array_contains(array(16,12,18,9), 12) => true
select array_equals(array(16,12,18,9), array(16,12,18,9)) => true
select array_intersect(array(16,12,18,9,null), array(14,9,6,18,null)) => [null,9,18]
select array_max(array(16,13,12,13,18,16,9,18)) => 18
select array_min(array(16,12,18,9)) => 9
select array_join(array(16,12,18,9,null), '#','=') => 16#12#18#9#=
select array_distinct(array(16,13,12,13,18,16,9,18)) => [9,12,13,16,18]
select array_position(array(16,13,12,13,18,16,9,18), 13) => 2
select array_remove(array(16,13,12,13,18,16,9,18), 13) => [16,12,18,16,9,18]
select array_reverse(array(16,12,18,9)) => [9,18,12,16]
select array_sort(array(16,13,12,13,18,16,9,18)) => [9,12,13,13,16,16,18,18]
select array_concat(array(16,12,18,9,null), array(14,9,6,18,null)) => [16,12,18,9,null,14,9,6,18,null]
select array_value_count(array(16,13,12,13,18,16,9,18), 13) => 2
select array_slice(array(16,13,12,13,18,16,9,18), -2, 3) => [9,18]
select array_element_at(array(16,13,12,13,18,16,9,18), -1) => 18
select array_shuffle(array(16,12,18,9))
select sequence(1, 5) => [1, 2, 3, 4, 5]
select sequence(5, 1) => [5, 4, 3, 2, 1]
select sequence(1, 9, 4) => [1, 5, 9]
select sequence('2016-04-12 00:00:00', '2016-04-14 00:00:00', 24*3600*1000) => ['2016-04-12 00:00:00', '2016-04-13 00:00:00', '2016-04-14 00:00:00']
select map_build(array('key1','key2'), array(16,12)) => {"key1":16,"key2":12}
select map_concat(map_build(array('key1','key2'), array(16,12)), map_build(array('key1','key3'), array(17,18))) => {"key1":17,"key2":12,"key3":18}
select map_element_at(map_build(array('key1','key2'), array(16,12)), 'key1') => 16
select map_equals(map_build(array('key1','key2'), array(16,12)), map_build(array('key1','key2'), array(16,12))) => true
select id_card_info('110101198901084517') => {"valid":true,"area":"东城区","province":"北京市","gender":"男","city":"北京市"}
select json_array_get("[{\"a\":{\"b\":\"13\"}}, {\"a\":{\"b\":\"18\"}}, {\"a\":{\"b\":\"12\"}}]", 1); => {"a":{"b":"18"}}
select json_array_get('["a", "b", "c"]', 0); => a
select json_array_get('["a", "b", "c"]', 1); => b
select json_array_get('["c", "b", "a"]', -1); => a
select json_array_get('["c", "b", "a"]', -2); => b
select json_array_get('[]', 0); => null
select json_array_get('["a", "b", "c"]', 10); => null
select json_array_get('["c", "b", "a"]', -10); => null
select json_array_length("[{\"a\":{\"b\":\"13\"}}, {\"a\":{\"b\":\"18\"}}, {\"a\":{\"b\":\"12\"}}]"); => 3
select json_array_extract("[{\"a\":{\"b\":\"13\"}}, {\"a\":{\"b\":\"18\"}}, {\"a\":{\"b\":\"12\"}}]", "$.a.b"); => ["\"13\"","\"18\"","\"12\""]
select json_array_extract_scalar("[{\"a\":{\"b\":\"13\"}}, {\"a\":{\"b\":\"18\"}}, {\"a\":{\"b\":\"12\"}}]", "$.a.b") => ["13","18","12"]
select json_extract("{\"a\":{\"b\":\"12\"}}", "$.a.b"); => "12"
select json_extract_scalar("{\"a\":{\"b\":\"12\"}}", "$.a.b") => 12
select json_extract_scalar('[1, 2, 3]', '$[2]');
select json_extract_scalar(json, '$.store.book[0].author');
select json_size('{"x": {"a": 1, "b": 2}}', '$.x'); => 2
select json_size('{"x": [1, 2, 3]}', '$.x'); => 3
select json_size('{"x": {"a": 1, "b": 2}}', '$.x.a'); => 0
select gcj_to_bd(39.915, 116.404) => {"lng":116.41036949371029,"lat":39.92133699351022}
select bd_to_gcj(39.915, 116.404) => {"lng":116.39762729119315,"lat":39.90865673957631}
select wgs_to_gcj(39.915, 116.404) => {"lng":116.41024449916938,"lat":39.91640428150164}
select gcj_to_wgs(39.915, 116.404) => {"lng":116.39775550083061,"lat":39.91359571849836}
select gcj_extract_wgs(39.915, 116.404) => {"lng":116.39775549316407,"lat":39.913596801757805}
select url_encode('http://shanruifeng.cc/') => http%3A%2F%2Fshanruifeng.cc%2F
select cosine_similarity(map_build(array['a'], array[1.0]), map_build(array['a'], array[2.0])); => 1.0