手把手带你玩转iceberg - trino_on_k8s
前言
本文旨在搭建一套iceberg数据湖的系统,为学习iceberg的功能做准备。使用集成了iceberg的trino相对更简单易懂,且github上已经有现成的项目可以借鉴,我们只需要做些适配就行。整个工程我们基于阿里云,后期如果有需要我也会在华为云或者腾讯云做一些demo。
下图是基于iceberg的数据湖的基本框架。
trino是一个大数据查询引擎,它使用常驻进程和纯内存计算等达到低延迟高并发的查询效果,一般用于秒级到几分钟的近实时场景。网上介绍的文章很多,这里不作过多介绍。
下图大概是我规划的整个demo工程的设计图:
我们使用一台低配版的ECS作为统一连接公网的入口。
申请多台阿里云按量计费的突发型云主机(便宜的很,适合demo),可以先申请两台4核8G的,然后创建K8S集群。
为了简单,我们的对象存储先使用minio的二进制安装方式安装在我们的对外连接的云主机上(方便直接web连接,查看数据存储)。
K8S集群中我们先安装一台mariadb,在iceberg的数据湖方案中主要用来存储当前快照的对象存储路径信息。
然后我们安装metastore,用来管理元信息。
最后,我们安装下trino就ok了。
代码仓库地址
gitee.com/nativesailo…
注:文章中的yaml配置可能不是最终版本,可以直接参考代码仓配置。
本文主要分两个部分,第一部分是环境搭建,第二部分就是使用trino简单体验存储数据到iceberg数据湖。
环境搭建
小A同学闪亮登场
申请云主机
参考:juejin.cn/post/684516…
准备一台ECS服务器,同样我们也称之为小A同学。
远程连接到云主机
使用pem密钥登录
安装git
sudo yum install git
下载github代码
搜索 trino-on-k8s github 获取到工程连接github.com/joshuarobin…
clone代码到云主机
git clone github.com/joshuarobin…
查看README
主要应用流程
Build Docker image for Hive Metastore
Deploy Hive Metastore: MariaDB (pvc and deployment), init-schemas, Metastore
Deploy Trino services (coordinator, workers, and cli)
小A上安装minio-server
下载 minio
wget dl.min.io/server/mini…
添加可执行权限
chmod +x minio
设置登录minio的 access key
export MINIO_ACCESS_KEY=minioadmin
设置登录minio的 secret key
export MINIO_SECRET_KEY=minioadmin
后台启动 minio
指定Console访问的端口为固定的9090
nohup ./minio server -console-address :9090 /data/minio/ > /data/minio/minio.log 2>&1 &复制代码
前台启动 minio
指定Console访问的端口为固定的9090
[root@iZ8vbgn00yu75o270lb2azZ code]# ./minio server --console-address :9090 /data/minio/ WARNING: MINIO_ACCESS_KEY and MINIO_SECRET_KEY are deprecated. Please use MINIO_ROOT_USER and MINIO_ROOT_PASSWORD API: http://172.28.166.95:9000 http://127.0.0.1:9000 RootUser: minioadmin RootPass: minioadmin Console: http://172.28.166.95:9090 http://127.0.0.1:9090 RootUser: minioadmin RootPass: minioadmin Command-line: https://docs.min.io/docs/minio-client-quickstart-guide $ mc alias set myminio http://172.28.166.95:9000 minioadmin minioadmin Documentation: https://docs.min.io复制代码
访问
使用小A的公网地址 39.103.234.55 和 9090端口进行web访问
小A上安装minio-client
下载 minio
wget http://dl.minio.org.cn/client/mc/release/linux-amd64/mc复制代码
添加权限
chmod +x mc ./mc --help复制代码
添加host
./mc config host add minio http://172.28.166.95:9000 minioadmin minioadmin --api s3v4复制代码
访问存储
./mc ls minio/datalake复制代码
搭建K8S集群
申请两台突发型云主机
创建突发性实例参考文档 :help.aliyun.com/document_de…
按量服务;2核4G突发性能实例;
不分配公网IP;使用和小A一样的安全组;
密钥对选择和小A一样的;
点击创建
创建K8S集群
ACK托管版; 标准版; 按量付费;
不配置SNAT; API Server访问选用简约型; 不适用EIP暴露API Server;
选择刚刚申请的两台ECS作为Worker实例;操作系统选择CentOS 7.9;
组件配置只保留Ingerss, 负载均衡类型使用私网;负载均衡规格同样选择简约型;(此篇博客并没有使用K8S集群的Ingerss功能,这边可以不按照Ingress组件)
点击创建集群,然后ok。
小A安装kubectl客户端
yum install -y kubectl
小A配置集群访问权限
vi $HOME/.kube/config 黏贴,保存。
[root@iZ8vbgn00yu75o270lb2azZ nfs]# kubectl get nodes NAME STATUS ROLES AGE VERSION cn-zhangjiakou.172.28.166.120 Ready <none> 27m v1.18.8-aliyun.1 cn-zhangjiakou.172.28.166.121 Ready <none> 27m v1.18.8-aliyun.1复制代码
安装nfs
进入trino-on-k8s目录,查看maria_pvc.yaml
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: maria-pv-claim spec: storageClassName: pure-block accessModes: - ReadWriteOnce resources: requests: storage: 20Gi复制代码
可以看到mariaDB需要依赖一个storageClass来作为数据存储 所以这边我们先在集群里定义storageClass
参考:
blog.csdn.net/huwh_/artic…
www.cnblogs.com/allmdzz/p/1…
小A上创建NFS共享服务
安装nfs服务器
dnf install nfs-utils
启动nfs-server
systemctl start nfs-server && systemctl enable nfs-server
创建共享目录
mkdir /data/nfs
配置共享目录
vi /etc/exports
/data/nfs *(insecure,rw,async,no_root_squash)
启动
exportfs -rv
检查
showmount -e
创建命名空间
kubectl create ns datalake复制代码
创建SA并绑定权限
rbac.yaml
apiVersion: v1 kind: ServiceAccount metadata: name: nfs-client-provisioner namespace: datalake --- kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: nfs-client-provisioner-runner rules: - apiGroups: [""] resources: ["persistentvolumes"] verbs: ["get", "list", "watch", "create", "delete"] - apiGroups: [""] resources: ["persistentvolumeclaims"] verbs: ["get", "list", "watch", "update"] - apiGroups: ["storage.k8s.io"] resources: ["storageclasses"] verbs: ["get", "list", "watch"] - apiGroups: [""] resources: ["events"] verbs: ["create", "update", "patch"] --- kind: ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: run-nfs-client-provisioner subjects: - kind: ServiceAccount name: nfs-client-provisioner namespace: datalake roleRef: kind: ClusterRole name: nfs-client-provisioner-runner apiGroup: rbac.authorization.k8s.io --- kind: Role apiVersion: rbac.authorization.k8s.io/v1 metadata: name: leader-locking-nfs-client-provisioner namespace: datalake rules: - apiGroups: [""] resources: ["endpoints"] verbs: ["get", "list", "watch", "create", "update", "patch"] --- kind: RoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: leader-locking-nfs-client-provisioner subjects: - kind: ServiceAccount name: nfs-client-provisioner namespace: datalake roleRef: kind: Role name: leader-locking-nfs-client-provisioner apiGroup: rbac.authorization.k8s.io复制代码
应用rbac.yaml
kubectl create ns datalake kubectl apply -f rbac.yaml -n datalake复制代码
创建storageclass
nfs-StorageClass.yaml
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: pure-block provisioner: qgg-nfs-storage parameters: archiveOnDelete: "false"复制代码
provisioner的名称要和provisioner配置文件中的环境变量PROVISIONER_NAME保持一致;
nfs-StorageClass.yaml metadata.name 要和 maria_pvc.yaml里指定的storage-class名字保持一致。
kubectl apply -f nfs-StorageClass.yaml -n datalake复制代码
创建阿里云私有仓库
创建了命名空间为sailor-datalake的仓库,比较简单,忽略。
镜像
因为k8s集群目前不能联网,所以需要选择镜像,重新打tag,并推送到私有仓库(内网仓库)内。
小A拉取最新的mariadb服务端镜像
docker pull quay.io/external_storage/nfs-client-provisioner:latest复制代码
给镜像添加新的标签
docker tag quay.io/external_storage/nfs-client-provisioner:latest registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/nfs-client-provisioner:latest复制代码
登录镜像仓库
docker login --username=XXX registry-vpc.cn-zhangjiakou.aliyuncs.com复制代码
注意:这里的账号和地址要根据自己创建的阿里云的容器镜像仓库的信息填写。复制代码
将镜像推送到镜像仓库
docker push registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/nfs-client-provisioner:latest复制代码
K8S集群worker节点配置docker login
配置访问集群worker节点
配置hosts
打开配置文件:
vi /etc/hosts复制代码
添加ECS节点的内外IP和对应给其取的名字
172.28.166.xxx worker1 172.28.166.yyy worker2复制代码
删除host key (如果重启worker后,需要删除之前记录的hosts)
/root/.ssh/known_hosts复制代码
ssh访问
ssh-agent bash ssh-add -k Happy_123.pem ssh worker1 或者 ssh -i Happy_123.pem worker1 复制代码
访问worker节点执行docker login
发现registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/nfs-client-provisioner:latest 这我推到自己镜像仓库的镜像包, nfs-client-provisioner的deployment都会因为权限问题拉取失败导致pod起不来。其他镜像都没有此问题。
通过登录worker节点,手动执行docker login后台进行拉取到worker节点,然后配置 nfs-client-provisioner的deployment的镜像拉取规则为IfNotPresent才能规避解决。具体原因我会在我的其他博客进行讲解。
ssh worker1 sudo docker login --username=XXX registry-vpc.cn-zhangjiakou.aliyuncs.com sudo docker login --username=悟滋味的人生 registry-vpc.cn-zhangjiakou.aliyuncs.com Happy@123 sudo docker pull registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/nfs-client-provisioner:latest复制代码
注意使用内网地址 registry-vpc
配置worker加入小A安全组
注:我们也可以申请ECS(直接配置小A的安全组),然后再创建K8S集群,创建集群的时候再选择ECS。千万注意,选择ECS的时候不要把小A选了,不然会导致小A重新安装系统,所有你的数据和配置都会丢失哟!
创建NFS provisioner
nfs-provisioner.yaml
apiVersion: apps/v1 kind: Deployment metadata: name: nfs-client-provisioner labels: app: nfs-client-provisioner namespace: datalake spec: replicas: 1 selector: matchLabels: app: nfs-client-provisioner strategy: type: Recreate selector: matchLabels: app: nfs-client-provisioner template: metadata: labels: app: nfs-client-provisioner spec: serviceAccountName: nfs-client-provisioner containers: - name: nfs-client-provisioner image: registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/nfs-client-provisioner:latest volumeMounts: - name: nfs-client-root mountPath: /persistentvolumes env: - name: PROVISIONER_NAME value: qgg-nfs-storage - name: NFS_SERVER value: 172.28.166.95 - name: NFS_PATH value: /data/nfs volumes: - name: nfs-client-root nfs: server: 172.28.166.95 path: /data/nfs 复制代码
NFS 挂载卷的路径为之前我在小A上创建的 /data/nfs;
我们的NFS服务器安装在小A上,NFS_SERVER 和 volumes的server 地址都要改成小A的私网地址;
PROVISIONER_NAME是provisioner名称,请确保该名称与 nfs-StorageClass.yaml文件中的provisioner名称保持一致;
kubectl apply -f nfs-provisioner.yaml -n datalake复制代码
安装mariaDB
应用pvc
修改pvc文件中存储容量大小为5Gi
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: maria-pv-claim spec: storageClassName: pure-block accessModes: - ReadWriteOnce resources: requests: storage: 5Gi 复制代码
应用:
kubectl apply -f maria_pvc.yaml -n datalake 复制代码
小A安装docker
yum -y install docker systemctl start docker systemctl status docker复制代码
镜像
因为k8s集群目前不能联网,所以需要选择镜像,重新打tag,并推送到私有仓库(内网仓库)内。
小A拉取最新的mariadb服务端镜像
docker pull mariadb/server:latest复制代码
给镜像添加新的标签
docker tag mariadb/server:latest registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/mariadb-server:latest复制代码
登录镜像仓库
docker login --username=XXX registry-vpc.cn-zhangjiakou.aliyuncs.com复制代码
注意:这里的账号和地址要根据自己创建的阿里云的容器镜像仓库的信息填写。
将镜像推送到镜像仓库
docker push registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/mariadb-server:latest复制代码
安装mariaDB服务
cd /opt/code/trino-on-k8s vi maria_deployment.yaml复制代码
修改yaml
--- apiVersion: v1 kind: Service metadata: name: metastore-db spec: ports: - port: 13306 targetPort: 3306 selector: app: mysql --- apiVersion: apps/v1 kind: Deployment metadata: name: mysql spec: selector: matchLabels: app: mysql strategy: type: Recreate template: metadata: labels: app: mysql spec: containers: - name: mariadb image: "registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/mariadb-server:latest" env: - name: MYSQL_ROOT_PASSWORD value: Happy_123 ports: - containerPort: 3306 name: mysql volumeMounts: - name: mariadb-for-hive mountPath: /var/lib/mysql resources: requests: memory: "1G" cpu: 0.5 volumes: - name: mariadb-for-hive persistentVolumeClaim: claimName: maria-pv-claim复制代码
修改镜像地址;修改mariadb的密码;修改所需资源为很小的值。
kubectl apply -f maria_deployment.yaml -n datalake复制代码
查看数据库metastore_db
小A上安装mysql客户端
yum install mysql -y
集群开放Nodeport
vi maria_deployment.yaml
修改yaml里的Service,添加NodePort端口32005
--- apiVersion: v1 kind: Service metadata: name: metastore-db spec: type: NodePort ports: - port: 13306 targetPort: 3306 nodePort: 32005 selector: app: mysql复制代码
查看数据库
[root@iZ8vbgn00yu75o270lb2azZ trino-on-k8s]# mysql -h worker1 -P 32005 -u root -p Enter password: Happy_123 Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 26 Server version: 10.5.10-MariaDB-1:10.5.10+maria~bionic mariadb.org binary distribution Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | metastore_db | | mysql | | performance_schema | +--------------------+ 4 rows in set (0.00 sec) MariaDB [(none)]> use metastore_db Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed MariaDB [metastore_db]> show tables;复制代码
mysql -h worker1 -P 32005 -u root -p mysql -h 172.28.166.120 -P 32005 -u root -p复制代码
172.28.166.120是任意一个node节点的IP,32005是我们设置mariaDB服务的NodePort
安装metastore
执行脚本,构建镜像
编辑编译脚本 cd /opt/code/trino-on-k8s/hive_metastore vi build_image.sh
设置镜像地址和TAG;创建configmap时指定命名空间datalake;
set -e REPONAME=registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake TAG=metastore:v1.0.1 docker build -t $TAG . # Tag and push to the public docker repository. docker tag $TAG $REPONAME/$TAG docker push $REPONAME/$TAG # Update configmaps kubectl create configmap metastore-cfg --dry-run --from-file=metastore-site.xml --from-file=core-site.xml -n datalake -o yaml | kubectl apply -f - -n datalake 复制代码
初始化schema
vi hive-initschema.yaml
修改下密码和镜像名;
apiVersion: batch/v1 kind: Job metadata: name: hive-initschema spec: template: spec: containers: - name: hivemeta image: registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/metastore:v1.0.1 command: ["/opt/hive-metastore/bin/schematool"] args: ["--verbose" ,"-initSchema" , "-dbType", "mysql" , "-userName", "root", "-passWord", "mypass" , "-url", "jdbc:mysql://metastore-db:13306/metastore_db?createDatabaseIfNotExist=true"] restartPolicy: Never backoffLimit: 4复制代码
应用
[root@iZ8vbgn00yu75o270lb2azZ trino-on-k8s]# kubectl apply -f hive-initschema.yaml -n datalake job.batch/hive-initschema created [root@iZ8vbgn00yu75o270lb2azZ trino-on-k8s]# kubectl get pod -n datalake NAME READY STATUS RESTARTS AGE hive-initschema-9zd75 0/1 Completed 0 14s mysql-8466d8694c-mmlqk 1/1 Running 0 75m nfs-client-provisioner-75bc5bc89f-xmbw8 1/1 Running 0 37m复制代码
修改core-site.xml
原来的core-site.xml 里访问S3存储的配置不够全,这边补充一些.
可参考 trino.io/docs/curren…
<configuration> <property> <name>fs.s3a.connection.ssl.enabled</name> <value>false</value> </property> <property> <name>fs.s3a.endpoint</name> <value>http://172.28.166.95:9000</value> </property> <property> <name>hive.s3a.aws-access-key</name> <value>minioadmin</value> </property> <property> <name>hive.s3a.aws-secret-key</name> <value>minioadmin</value> </property> <property> <name>fs.s3a.access.key</name> <value>minioadmin</value> </property> <property> <name>fs.s3a.secret.key</name> <value>minioadmin</value> </property> <property> <name>fs.s3a.path.style.access</name> <value>true</value> </property> <property> <name>fs.s3a.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> </property> <property> <name>fs.s3a.fast.upload</name> <value>true</value> </property> </configuration>复制代码
fs.s3a.endpoint 改成 小A的私网地址+minio的端口
修改metastore-site.xml
<configuration> <property> <name>metastore.task.threads.always</name> <value>org.apache.hadoop.hive.metastore.events.EventCleanerTask</value> </property> <property> <name>metastore.expression.proxy</name> <value>org.apache.hadoop.hive.metastore.DefaultPartitionExpressionProxy</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://metastore-db.datalake.svc.cluster.local:13306/metastore_db</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>Happy_123</value> </property> <property> <name>metastore.warehouse.dir</name> <value>s3a://datalake/warehouse/</value> </property> <property> <name>metastore.thrift.port</name> <value>9083</value> </property> </configuration>复制代码
修改数据库访问密码;修改url里的命名空间为datalake
执行脚本,更新configmap
安装metastore服务
修改yaml
--- apiVersion: v1 kind: Service metadata: name: metastore spec: ports: - port: 9083 selector: app: metastore --- apiVersion: apps/v1 kind: Deployment metadata: name: metastore spec: selector: matchLabels: app: metastore strategy: type: Recreate template: metadata: labels: app: metastore spec: containers: - name: metastore image: registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/metastore:v1.0.1 env: - name: AWS_ACCESS_KEY_ID valueFrom: secretKeyRef: name: my-s3-keys key: access-key - name: AWS_SECRET_ACCESS_KEY valueFrom: secretKeyRef: name: my-s3-keys key: secret-key ports: - containerPort: 9083 volumeMounts: - name: metastore-cfg-vol mountPath: /opt/hive-metastore/conf/metastore-site.xml subPath: metastore-site.xml - name: metastore-cfg-vol mountPath: /opt/hadoop/etc/hadoop/core-site.xml subPath: core-site.xml command: ["/opt/hive-metastore/bin/start-metastore"] args: ["-p", "9083"] resources: requests: memory: "1G" cpu: 0.5 imagePullPolicy: Always volumes: - name: metastore-cfg-vol configMap: name: metastore-cfg复制代码
新增secret的yaml配置文件 my-s3-keys.yaml
apiVersion: v1 kind: Secret metadata: name: my-s3-keys type: Opaque data: access-key: bWluaW9hZG1pbg== secret-key: bWluaW9hZG1pbg==复制代码
应用:
kubectl apply -f my-s3-keys.yaml -n datalake kubectl apply -f metastore.yaml -n datalake复制代码
安装trino
修改trino配置
vi trino-cfgs.yaml
编辑yaml
--- kind: ConfigMap apiVersion: v1 metadata: name: trino-configs data: jvm.config: |- -server -Xmx2G -XX:-UseBiasedLocking -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+ExplicitGCInvokesConcurrent -XX:+ExitOnOutOfMemoryError -XX:+UseGCOverheadLimit -XX:+HeapDumpOnOutOfMemoryError -XX:ReservedCodeCacheSize=512M -Djdk.attach.allowAttachSelf=true -Djdk.nio.maxCachedBufferSize=2000000 config.properties.coordinator: |- coordinator=true node-scheduler.include-coordinator=false http-server.http.port=8080 query.max-memory=200GB query.max-memory-per-node=0.2GB query.max-total-memory-per-node=0.6GB query.max-stage-count=200 task.writer-count=4 discovery-server.enabled=true discovery.uri=http://trino:8080 config.properties.worker: |- coordinator=false http-server.http.port=8080 query.max-memory=200GB query.max-memory-per-node=0.2GB query.max-total-memory-per-node=0.6GB query.max-stage-count=200 task.writer-count=4 discovery.uri=http://trino:8080 node.properties: |- node.environment=test spiller-spill-path=/tmp max-spill-per-node=4TB query-max-spill-per-node=1TB hive.properties: |- connector.name=hive-hadoop2 hive.metastore.uri=thrift://metastore:9083 hive.allow-drop-table=true hive.max-partitions-per-scan=1000000 hive.s3.endpoint=172.28.166.95 hive.s3.path-style-access=true hive.s3.ssl.enabled=false hive.s3.max-connections=100 iceberg.properties: |- connector.name=iceberg hive.metastore.uri=thrift://metastore:9083 hive.max-partitions-per-scan=1000000 hive.s3.endpoint=172.28.166.95 hive.s3.path-style-access=true hive.s3.ssl.enabled=false hive.s3.max-connections=100 mysql.properties: |- connector.name=mysql connection-url=jdbc:mysql://metastore-db.datalake.svc.cluster.local:13306 connection-user=root connection-password=Happy_123复制代码
应用:
kubectl apply -f trino-cfgs.yaml -n datalake复制代码
编辑trino.yaml
--- apiVersion: v1 kind: Service metadata: name: trino spec: ports: - port: 8080 selector: app: trino-coordinator --- apiVersion: apps/v1 kind: Deployment metadata: name: trino-coordinator spec: selector: matchLabels: app: trino-coordinator strategy: type: Recreate template: metadata: labels: app: trino-coordinator spec: containers: - name: trino image: registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/trino:361 ports: - containerPort: 8080 env: - name: AWS_ACCESS_KEY_ID valueFrom: secretKeyRef: name: my-s3-keys key: access-key - name: AWS_SECRET_ACCESS_KEY valueFrom: secretKeyRef: name: my-s3-keys key: secret-key volumeMounts: - name: trino-cfg-vol mountPath: /etc/trino/jvm.config subPath: jvm.config - name: trino-cfg-vol mountPath: /etc/trino/config.properties subPath: config.properties.coordinator - name: trino-cfg-vol mountPath: /etc/trino/node.properties subPath: node.properties - name: trino-cfg-vol mountPath: /etc/trino/catalog/hive.properties subPath: hive.properties - name: trino-cfg-vol mountPath: /etc/trino/catalog/iceberg.properties subPath: iceberg.properties - name: trino-cfg-vol mountPath: /etc/trino/catalog/mysql.properties subPath: mysql.properties resources: requests: memory: "1G" cpu: 0.5 imagePullPolicy: Always volumes: - name: trino-cfg-vol configMap: name: trino-configs --- apiVersion: apps/v1 kind: StatefulSet metadata: name: trino-worker spec: serviceName: trino-worker replicas: 1 selector: matchLabels: app: trino-worker template: metadata: labels: app: trino-worker spec: securityContext: fsGroup: 1000 containers: - name: trino image: registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/trino:361 ports: - containerPort: 8080 env: - name: AWS_ACCESS_KEY_ID valueFrom: secretKeyRef: name: my-s3-keys key: access-key - name: AWS_SECRET_ACCESS_KEY valueFrom: secretKeyRef: name: my-s3-keys key: secret-key volumeMounts: - name: trino-cfg-vol mountPath: /etc/trino/jvm.config subPath: jvm.config - name: trino-cfg-vol mountPath: /etc/trino/config.properties subPath: config.properties.worker - name: trino-cfg-vol mountPath: /etc/trino/node.properties subPath: node.properties - name: trino-cfg-vol mountPath: /etc/trino/catalog/hive.properties subPath: hive.properties - name: trino-cfg-vol mountPath: /etc/trino/catalog/iceberg.properties subPath: iceberg.properties - name: trino-cfg-vol mountPath: /etc/trino/catalog/mysql.properties subPath: mysql.properties - name: trino-tmp-data mountPath: /tmp resources: requests: memory: "1G" cpu: 0.5 imagePullPolicy: Always volumes: - name: trino-cfg-vol configMap: name: trino-configs volumeClaimTemplates: - metadata: name: trino-tmp-data spec: storageClassName: pure-block accessModes: - ReadWriteOnce resources: requests: storage: 40Gi --- apiVersion: v1 kind: Pod metadata: name: trino-cli spec: containers: - name: trino-cli image: registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/trino:361 command: ["tail", "-f", "/dev/null"] imagePullPolicy: Always restartPolicy: Always复制代码
修改镜像名;修改trino-tmp-data里资源大小为40Gi; 设置worker个数为1个; worker和coordinator所占资源;
应用trino.yaml
kubectl apply -f trino.yaml -n datalake [root@iZ8vbgn00yu75o270lb2azZ trino-on-k8s]# kubectl get pod -n datalake NAME READY STATUS RESTARTS AGE hive-initschema-9zd75 0/1 Completed 0 71m metastore-7df56797cc-j7dcd 1/1 Running 0 11m mysql-8466d8694c-mmlqk 1/1 Running 0 146m nfs-client-provisioner-75bc5bc89f-xmbw8 1/1 Running 0 108m trino-cli 1/1 Running 0 64s trino-coordinator-54f78c7984-zqf8t 1/1 Running 0 64s trino-worker-0 0/1 Pending 0 64s复制代码
资源不够,新增一台worker,并加入集群。
二进制客户端
wget https://repo1.maven.org/maven2/io/trino/trino-cli/361/trino-cli-361-executable.jar mv trino-cli-361-executable.jar trino chmod +x trino #连接trino,8080端口跟config.properties配置文件中端口对应 ./trino --server localhost:8080 --catalog iceberg复制代码
测试
kubectl exec -it trino-cli -n datalake -- trino --server trino:8080 --catalog iceberg --schema default复制代码
访问trino的ui
由于我们的trino安装在集群中,但是K8s集群未配置外网访问,所以如果想访问UI的话只能通过小A。 方案是在小A上安装nginx,然后nginx后端对接trino开放的NodePort。
小A上安装nginx
参考文档:blog.csdn.net/lxh_worldpe…
安全组添加端口32001(32001为trino暴露的NodePort)
配置nginx
监听端口32001(其实随便,这里配置成trino暴露的NodePort) proxy_pass配置K8S集群node节点IP以及trino暴露的NodePort
cd /usr/local/nginx/conf vim nginx.conf server { listen 32001; server_name localhost; location / { proxy_read_timeout 300; proxy_connect_timeout 300; proxy_redirect off; proxy_set_header Host $http_host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_set_header X-Frame-Options SAMEORIGIN; proxy_pass http://172.28.166.135:32001; } error_page 500 502 503 504 /50x.html; location = /50x.html { root /usr/share/nginx/html; } }复制代码
可参考www.cnblogs.com/fps2tao/p/9…
nginx 更改配置文件后需要重启生效
/usr/local/nginx/sbin/nginx -t测试配置文件修改是否正常
/usr/local/nginx/sbin/nginx -s reload重新加载
web访问
http://39.103.234.55:32001/ui/
使用trino体验iceberg
创建表
CREATE TABLE iceberg.default.events (level VARCHAR, event_time TIMESTAMP(6), message VARCHAR, call_stack ARRAY(VARCHAR)) WITH (partitioning= ARRAY['day(event_time)']);复制代码
插入三条数据
INSERT INTO iceberg.default.events VALUES('ERROR', timestamp '2021-04-01 12:00:00.000001', 'Oh noes', ARRAY ['Exception in thread "main" java.lang.NullPointerException']);复制代码
INSERT INTO iceberg.default.events\ VALUES\ ( 'ERROR', timestamp '2021-04-01 12:00:00.000001', 'Oh noes', ARRAY ['Exception in thread "main" java.lang.NullPointerException'] ),\ ( 'ERROR', timestamp '2021-04-02 15:55:55.555555', 'Double oh noes', ARRAY ['Exception in thread "main" java.lang.NullPointerException']),\ ( 'WARN', timestamp '2021-04-02 00:00:11.1122222', 'Maybeh oh noes?', ARRAY ['Bad things could be happening??']\ );复制代码
查看数据
SELECT * FROM iceberg.default.events WHERE event_time < timestamp '2021-04-02';复制代码
查看mariaDB中存储的数据
MariaDB [(none)]> select * from metastore_db.TABLE_PARAMS; +--------+----------------------------+---------------------------------------------------------------------------------------------------+ | TBL_ID | PARAM_KEY | PARAM_VALUE | +--------+----------------------------+---------------------------------------------------------------------------------------------------+ | 1 | EXTERNAL | TRUE | | 1 | metadata_location | s3a://datalake/warehouse/events/metadata/00001-9bfdcfaa-6e00-425e-aee7-6d4128cc6cfb.metadata.json | | 1 | numFiles | 6 | | 1 | previous_metadata_location | s3a://datalake/warehouse/events/metadata/00000-4f86466c-c60c-4b2a-b2c0-5a8b6fb76e65.metadata.json | | 1 | table_type | iceberg | | 1 | totalSize | 13167 | | 1 | transient_lastDdlTime | 1633849635 | +--------+----------------------------+---------------------------------------------------------------------------------------------------+ 7 rows in set (0.00 sec)复制代码
查看minio中存储的数据
[root@iZ8vbgn00yu75o270lb2azZ code]# ./mc ls minio/datalake/warehouse/events/metadata [2021-10-10 14:32:25 CST] 1.9KiB 00000-1b2ca292-85c7-435c-b884-5218d74d2605.metadata.json [2021-10-10 15:02:04 CST] 1.9KiB 00000-21e6307c-8b63-41f6-987c-fbc86de000f9.metadata.json [2021-10-10 15:07:14 CST] 1.9KiB 00000-4f86466c-c60c-4b2a-b2c0-5a8b6fb76e65.metadata.json [2021-10-10 13:45:22 CST] 1.9KiB 00000-b88a389c-e3ca-491b-bd9e-e901a01f82fe.metadata.json [2021-10-10 13:58:33 CST] 1.9KiB 00000-fe989767-841a-4412-a2e3-bc094e148ab1.metadata.json [2021-10-10 15:09:51 CST] 2.8KiB 00001-9bfdcfaa-6e00-425e-aee7-6d4128cc6cfb.metadata.json [2021-10-10 15:09:51 CST] 5.9KiB a10aa346-3b55-437b-93a3-20219f7d8ad1-m0.avro [2021-10-10 15:07:14 CST] 3.3KiB snap-1941033720141809751-1-2b0c70bb-ab8d-4db8-b377-9722e4926e08.avro [2021-10-10 15:09:51 CST] 3.4KiB snap-4869152224340281922-1-a10aa346-3b55-437b-93a3-20219f7d8ad1.avro [root@iZ8vbgn00yu75o270lb2azZ code]# ./mc ls minio/datalake/warehouse/events/data [2021-10-10 23:49:32 CST] 0B event_time_day=2021-04-01/ [root@iZ8vbgn00yu75o270lb2azZ code]# ./mc ls minio/datalake/warehouse/events/data/event_time_day=2021-04-01/复制代码
停用集群
停用ECS
1.停机节省模式下,计算资源(vCPU和内存)、固定公网IP和带宽不再收费。
2.仍旧收费的资源有:系统盘、数据盘、弹性公网IP和带宽(固定带宽模式)、收费镜像。
停用负载均衡
停用后发现负载均衡仍然在收费,ECS也在收费但只收了25%左右。已经可以省不少了。
启用集群
启用ECS
启用负载均衡
都启用后,可以观察集群中的实例都恢复了。
作者:NativeSailor
链接:https://juejin.cn/post/7025177076952989726