linux怎么查看本机内存大小
288
2022-08-25
【Hive】Hive 分区表、桶的基本操作
文章目录
一、Hive 分区表
1、创建分区表2、插入数据3、查看分区4、修改表分区5、删除表分区
二、Hive 桶
1、创建桶2、插入数据3、抽样桶表
环境准备Hadoop 完全分布式(一主两从即可)MySQL环境、Hive环境
一、Hive 分区表
1、创建分区表
在 Hive 中创建一个分区表 goods,包含 goods_id 和 goods_status 两个字段,字符类型为 string ,分区为 cat_id ,字符类型为 string ,以 ‘\t’ 为分隔符:
hive> create table if not exists goods(goods_id string,goods_status string) > partitioned by (cat_id string) > row format delimited fields terminated by '\t';OKTime taken: 0.424
查看表 goods 的表结构:
hive> desc goods;OKgoods_id string goods_status string cat_id string # Partition Information # col_name data_type comment cat_id string Time taken: 0.226 seconds, Fetched: 8 row(s)
这样一个分区表就建好啦~
返回顶部
2、插入数据
向分区表中插入数据,将本地 /home/data/hive-data 下的 goods 中的数据插入到分区表中。
首先在 Hive 中创建一个非分区表 goods1 ,用于存储本地 goods 文件的数据:
hive> create table goods1(goods_id string,goods_status string,cat_id string) > row format delimited fields terminated by '\t';OKTime taken: 0.084
将本地数据导入 goods1:
hive> load data local inpath '/home/data/hive-data/goods' into table goods1;Loading data to table db.goods1OKTime taken: 1.316
再将表 goods1 中的数据导入到分区表 goods 中:
hive> insert into table goods partition(cat_id='52052') select goods_id,goods_status from goods1 where cat_id='52052';WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.Query ID = root_20220306171401_5343b66f-8685-487a-be77-537fd35ccb3aTotal jobs = 3Launching Job 1 out of 3Number of reduce tasks is set to 0 since there's no reduce operatorStarting Job = job_1646556082284_0003, Tracking URL = Command = /usr/local/src/hadoop/bin/hadoop job -kill job_1646556082284_0003Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 02022-03-06 17:14:52,886 Stage-1 map = 0%, reduce = 0%2022-03-06 17:14:59,984 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.7 secMapReduce Total cumulative CPU time: 1 seconds 700 msecEnded Job = job_1646556082284_0003Stage-4 is selected by condition resolver.Stage-3 is filtered out by condition resolver.Stage-5 is filtered out by condition resolver.Moving data to directory hdfs://192.168.64.183:9000/user/hive/warehouse/db.db/goods/cat_id=52052/.hive-staging_hive_2022-03-06_17-14-01_837_747076726384884371-1/-ext-10000Loading data to table db.goods partition (cat_id=52052)MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 1.7 sec HDFS Read: 213175 HDFS Write: 378 SUCCESSTotal MapReduce CPU Time Spent: 1 seconds 700 msecOKTime taken: 60.848
查看结果:
hive> select * from goods limit 10;OK1020405 6 520521020406 6 520521020408 6 520521020326 6 520521020407 6 520521020404 6 520521020327 6 520521020409 6 520521020325 6 520521024641 1 52052`在这里插入代码片`Time taken: 0.136 seconds, Fetched: 10 row(s)
返回顶部
3、查看分区
查看 goods 表中的分区:
hive> show partitions goods;OKcat_id=52052Time taken: 0.128 seconds, Fetched: 1 row(s)
返回顶部
4、修改表分区
修改分区表 goods 中的分区列为 cat_id=52051,并查看修改后的分区名:
hive> alter table goods partition(cat_id=52052 ) rename to partition(cat_id=52051);OKTime taken: 0.391 secondshive> show partitions goods;OKcat_id=52051Time taken: 0.061 seconds, Fetched: 1 row(s)
返回顶部
5、删除表分区
删除 goods 表中的 cat_id 分区:
hive> alter table goods drop if exists partition (cat_id='52051');Dropped the partition cat_id=52051OKTime taken: 0.456 secondshive> show partitions goods;OKTime taken: 0.059
返回顶部
二、Hive 桶
1、创建桶
设置环境变量,开启Hive分桶:
hive> set hive.enforce.bucketing=true;
创建一个名为 goods_t 的表,包含两个字段:goods_id 和 goods_status,字符类型首位 string,按 cat_id string 作为分区,按照 goods_status 列聚类和 goods_id 列排列,划分成两个桶:
hive> create table goods_t(goods_id string,goods_status string) > partitioned by (cat_id string) > clustered by (goods_status) > sorted by (goods_id) > into 2 buckets;OKTime taken: 0.08
返回顶部
2、插入数据
向 goods_t 表中插入 goods1 表中的数据:
hive> insert overwrite table goods_t partition(cat_id='52063') select goods_id,goods_status from goods1;WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.Query ID = root_20220306173319_d8682df4-91ba-42cc-b3cd-76ced5fdf898Total jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time: 2In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=
返回顶部
3、抽样桶表
查看表数据:
hive> select * from goods_t tablesample(bucket 1 out of 2 on goods_status) limit 10;OK1000016 4 520631000018 4 520631000096 4 520631000218 4 520631000219 4 520631000223 4 520631000310 4 520631000368 4 520631000371 4 520631000380 4 52063Time taken: 0.131 seconds, Fetched: 10 row(s)
返回顶部
版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。
发表评论
暂时没有评论,来抢沙发吧~