测试数据库
hive> describe empinfo;
OK
firstname string
lastname string
id int
age int
city string
state string
Time taken: 0.047 seconds, Fetched: 6 row(s)
hive> select * from empinfo;
OK
John Jones 99980 45 Payson Arizona
Mary Jones 99982 25 Payson Arizona
Eric Edwards 88232 32 San Diego California
Mary Ann Edwards 88233 32 Phoenix Arizona
Ginger Howell 98002 42 Cottonwood Arizona
Sebastian Smith 92001 23 Gila Bend Arizona
Gus Gray 22322 35 Bagdad Arizona
Mary Ann May 32326 52 Tucson Arizona
Erica Williams 32327 60 Show Low Arizona
Leroy Brown 32380 22 Pinetop Arizona
Elroy Cleaver 32382 22 Globe Arizona
Time taken: 0.033 seconds, Fetched: 12 row(s)
Hive Macro
hive> CREATE TEMPORARY MACRO nominal_age(age int) age+1;
OK
Time taken: 0.006 seconds
hive> select firstname,lastname,age,nominal_age(age) from empinfo;
OK
John Jones 45 46
Mary Jones 25 26
Eric Edwards 32 33
Mary Ann Edwards 32 33
Ginger Howell 42 43
Sebastian Smith 23 24
Gus Gray 35 36
Mary Ann May 52 53
Erica Williams 60 61
Leroy Brown 22 23
Elroy Cleaver 22 23
Time taken: 0.039 seconds, Fetched: 11 row(s)
hive> DROP TEMPORARY MACRO IF EXISTS nominal_age;
OK
Time taken: 0.005 seconds
Hive UDF
查看当前所安装的Hive支持的UDF
hive> show functions;
查看UDF介绍
选个好理解的concat函数吧。
hive> describe function concat;
OK
concat(str1, str2, ... strN) - returns the concatenation of str1, str2, ... strN or concat(bin1, bin2, ... binN) - returns the concatenation of bytes in binary data bin1, bin2, ... binN
Time taken: 0.004 seconds, Fetched: 1 row(s)
hive> describe function extended concat;
OK
concat(str1, str2, ... strN) - returns the concatenation of str1, str2, ... strN or concat(bin1, bin2, ... binN) - returns the concatenation of bytes in binary data bin1, bin2, ... binN
Returns NULL if any argument is NULL.
Example:
> SELECT concat('abc', 'def') FROM src LIMIT 1;
'abcdef'
Time taken: 0.006 seconds, Fetched: 5 row(s)
测试内建UDF
concat函数是软柿子么?为啥还拿它掐呢!!
hive> select concat(firstname,' ',lastname),age from empinfo;
OK
John Jones 45
Mary Jones 25
Eric Edwards 32
Mary Ann Edwards 32
Ginger Howell 42
Sebastian Smith 23
Gus Gray 35
Mary Ann May 52
Erica Williams 60
Leroy Brown 22
Elroy Cleaver 22
Time taken: 0.039 seconds, Fetched: 11 row(s)
定制UDF
参考二中的代码挺好,改一改再添个自己的Hello类吧。我的代码、编译和jar命令都做在一个Shell脚本中,方便理解这个流程: mryqu_udf.jar文件内容如下:
hadoop@node50064:~/hivetest$ jar -tf mryqu_udf.jar
META-INF/
META-INF/MANIFEST.MF
com/
com/mryqu/
com/mryqu/hive/
com/mryqu/hive/udf/
com/mryqu/hive/udf/Lower.class
com/mryqu/hive/udf/Hello.class
com/mryqu/hive/udf/Hello.java
com/mryqu/hive/udf/Lower.java
首先测试一下临时函数,临时函数仅能工作在Hive会话范围内,不会将其元数据写入Hive的元数据数据库,因此使用mryqu_udf.jar中类,必须在每个会话都创建临时函数。测试过程如下:
hive> add jar /home/hadoop/hivetest/mryqu_udf.jar;
Added [/home/hadoop/hivetest/mryqu_udf.jar] to class path
Added resources: [/home/hadoop/hivetest/mryqu_udf.jar]
hive> list jars;
/home/hadoop/hivetest/mryqu_udf.jar
hive> create temporary function my_hello as 'com.mryqu.hive.udf.Hello';
OK
Time taken: 0.01 seconds
hive> create temporary function my_lower as 'com.mryqu.hive.udf.Lower';
OK
Time taken: 0.004 seconds
hadoop@node50064:~/hivetest$ hive -e 'show functions' | grep my
OK
Time taken: 1.549 seconds, Fetched: 216 row(s)
hadoop@node50064:~/hivetest$
hive> select firstname, my_hello(firstname), my_lower(firstname) from empinfo;
OK
John Hello John john
Mary Hello Mary mary
Eric Hello Eric eric
Mary Ann Hello Mary Ann mary ann
Ginger Hello Ginger ginger
Sebastian Hello Sebastian sebastian
Gus Hello Gus gus
Mary Ann Hello Mary Ann mary ann
Erica Hello Erica erica
Leroy Hello Leroy leroy
Elroy Hello Elroy elroy
Time taken: 0.037 seconds, Fetched: 11 row(s)
hive> drop temporary function if exists my_hello;
OK
Time taken: 0.005 seconds
hive> drop temporary function if exists my_lower;
OK
Time taken: 0.003 seconds
hive> delete jar /home/hadoop/hivetest/mryqu_udf.jar;
Deleted [/home/hadoop/hivetest/mryqu_udf.jar] from class path
hive> list jars;
hive>
接下来试试持久函数(permanentfunction)。参考三中特意提及Hive如果不是在本地模式下,JAR资源文件也不能是本地URI,可以使用HDFSURI等。 看到区别没,showfunctions返回结果包含我所创建的my_hello和my_lower函数。Hive元数据FUNCS表确实包含my_hello和my_lower的信息: 下面的命令同样可以创建上述创建持久函数:
hive> create function my_hello as 'com.mryqu.hive.udf.Hello' using jar 'hdfs://node50064.mryqu.com:9000/user/hive/mryqu_udf.jar';
hive> create function my_lower as 'com.mryqu.hive.udf.Lower' using jar 'hdfs://node50064.mryqu.com:9000/user/hive/mryqu_udf.jar';
最后删除这些持久函数:
hive> drop function if exists my_hello;
OK
Time taken: 0.013 seconds
hive> drop function if exists my_lower;
OK
Time taken: 0.011 seconds
这里仅对普通Hive UDF进行实践,后继学习将在博文Hive 表UDTF和汇聚UDAF学习 中记录。
参考
Hive Operators and User-Defined Functions (UDFs)
Hive Plugins
Hive Data Definition Language - Create/Drop/Reload Function
Hive Data Definition Language - Create/DropMacro
Apache Hive Customization Tutorial Series