【Hive】Hive 分区表、桶的基本操作

网友投稿 288 2022-08-25

【Hive】Hive 分区表、桶的基本操作

文章目录

​​一、Hive 分区表​​

​​1、创建分区表​​​​2、插入数据​​​​3、查看分区​​​​4、修改表分区​​​​5、删除表分区​​

​​二、Hive 桶​​

​​1、创建桶​​​​2、插入数据​​​​3、抽样桶表​​

环境准备Hadoop 完全分布式(一主两从即可)MySQL环境、Hive环境

一、Hive 分区表

1、创建分区表

在 ​​Hive​​ 中创建一个分区表 ​​goods​​,包含 ​​goods_id​​ 和 ​​goods_status​​ 两个字段,字符类型为 ​​string​​ ,分区为 ​​cat_id​​ ,字符类型为 ​​string​​ ,以 ​​‘\t’​​ 为分隔符:

hive> create table if not exists goods(goods_id string,goods_status string) > partitioned by (cat_id string) > row format delimited fields terminated by '\t';OKTime taken: 0.424

查看表 ​​goods​​ 的表结构:

hive> desc goods;OKgoods_id string goods_status string cat_id string # Partition Information # col_name data_type comment cat_id string Time taken: 0.226 seconds, Fetched: 8 row(s)

这样一个分区表就建好啦~

​​返回顶部​​

2、插入数据

向分区表中插入数据,将本地 ​​/home/data/hive-data​​ 下的 goods 中的数据插入到分区表中。

首先在 ​​Hive​​ 中创建一个非分区表 ​​goods1​​ ,用于存储本地 goods 文件的数据:

hive> create table goods1(goods_id string,goods_status string,cat_id string) > row format delimited fields terminated by '\t';OKTime taken: 0.084

将本地数据导入 ​​goods1​​:

hive> load data local inpath '/home/data/hive-data/goods' into table goods1;Loading data to table db.goods1OKTime taken: 1.316

再将表 ​​goods1​​ 中的数据导入到分区表 ​​goods​​ 中:

hive> insert into table goods partition(cat_id='52052') select goods_id,goods_status from goods1 where cat_id='52052';WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.Query ID = root_20220306171401_5343b66f-8685-487a-be77-537fd35ccb3aTotal jobs = 3Launching Job 1 out of 3Number of reduce tasks is set to 0 since there's no reduce operatorStarting Job = job_1646556082284_0003, Tracking URL = Command = /usr/local/src/hadoop/bin/hadoop job -kill job_1646556082284_0003Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 02022-03-06 17:14:52,886 Stage-1 map = 0%, reduce = 0%2022-03-06 17:14:59,984 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.7 secMapReduce Total cumulative CPU time: 1 seconds 700 msecEnded Job = job_1646556082284_0003Stage-4 is selected by condition resolver.Stage-3 is filtered out by condition resolver.Stage-5 is filtered out by condition resolver.Moving data to directory hdfs://192.168.64.183:9000/user/hive/warehouse/db.db/goods/cat_id=52052/.hive-staging_hive_2022-03-06_17-14-01_837_747076726384884371-1/-ext-10000Loading data to table db.goods partition (cat_id=52052)MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 1.7 sec HDFS Read: 213175 HDFS Write: 378 SUCCESSTotal MapReduce CPU Time Spent: 1 seconds 700 msecOKTime taken: 60.848

查看结果:

hive> select * from goods limit 10;OK1020405 6 520521020406 6 520521020408 6 520521020326 6 520521020407 6 520521020404 6 520521020327 6 520521020409 6 520521020325 6 520521024641 1 52052`在这里插入代码片`Time taken: 0.136 seconds, Fetched: 10 row(s)

​​返回顶部​​

3、查看分区

查看 ​​goods​​ 表中的分区:

hive> show partitions goods;OKcat_id=52052Time taken: 0.128 seconds, Fetched: 1 row(s)

​​返回顶部​​

4、修改表分区

修改分区表 ​​goods​​ 中的分区列为 ​​cat_id=52051​​,并查看修改后的分区名:

hive> alter table goods partition(cat_id=52052 ) rename to partition(cat_id=52051);OKTime taken: 0.391 secondshive> show partitions goods;OKcat_id=52051Time taken: 0.061 seconds, Fetched: 1 row(s)

​​返回顶部​​

5、删除表分区

删除 ​​goods​​ 表中的 ​​cat_id​​ 分区:

hive> alter table goods drop if exists partition (cat_id='52051');Dropped the partition cat_id=52051OKTime taken: 0.456 secondshive> show partitions goods;OKTime taken: 0.059

​​返回顶部​​

二、Hive 桶

1、创建桶

设置环境变量,开启Hive分桶:

hive> set hive.enforce.bucketing=true;

创建一个名为 ​​goods_t​​ 的表,包含两个字段:​​goods_id​​ 和 ​​goods_status​​,字符类型首位 ​​string​​,按 ​​cat_id string​​ 作为分区,按照 ​​goods_status​​ 列聚类和 ​​goods_id​​ 列排列,划分成两个桶:

hive> create table goods_t(goods_id string,goods_status string) > partitioned by (cat_id string) > clustered by (goods_status) > sorted by (goods_id) > into 2 buckets;OKTime taken: 0.08

​​返回顶部​​

2、插入数据

向 ​​goods_t​​ 表中插入 ​​goods1​​ 表中的数据:

hive> insert overwrite table goods_t partition(cat_id='52063') select goods_id,goods_status from goods1;WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.Query ID = root_20220306173319_d8682df4-91ba-42cc-b3cd-76ced5fdf898Total jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time: 2In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max=In order to set a constant number of reducers: set mapreduce.job.reduces=Starting Job = job_1646556082284_0004, Tracking URL = Command = /usr/local/src/hadoop/bin/hadoop job -kill job_1646556082284_0004Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 22022-03-06 17:33:45,839 Stage-1 map = 0%, reduce = 0%2022-03-06 17:33:51,130 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.31 sec2022-03-06 17:33:58,420 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 3.31 sec2022-03-06 17:34:12,914 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.1 secMapReduce Total cumulative CPU time: 5 seconds 100 msecEnded Job = job_1646556082284_0004Loading data to table db.goods_t partition (cat_id=52063)MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 2 Cumulative CPU: 5.1 sec HDFS Read: 220296 HDFS Write: 112278 SUCCESSTotal MapReduce CPU Time Spent: 5 seconds 100 msecOKTime taken: 54.668

​​返回顶部​​

3、抽样桶表

查看表数据:

hive> select * from goods_t tablesample(bucket 1 out of 2 on goods_status) limit 10;OK1000016 4 520631000018 4 520631000096 4 520631000218 4 520631000219 4 520631000223 4 520631000310 4 520631000368 4 520631000371 4 520631000380 4 52063Time taken: 0.131 seconds, Fetched: 10 row(s)

​​返回顶部​​

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:【Hive】基本建库、建表操作
下一篇:【Hive】Hive 查询
相关文章

 发表评论

暂时没有评论,来抢沙发吧~