问题

原先公司的Airflow是基于LocalExecutor的一个单机应用, 随着业务发展任务数不断增加, 导致单机性能不断进行升级. 但是数仓任务的特殊性(集中于凌晨开始运行), 白天机器有大量资源被浪费. 基于此将Airflow以容器化的方式进行部署. 来实行Scheduler可用和Woker组的自动扩缩容.

容器化技术

简单地说就是将 程序程序所需的依赖 组装好 放到盒子了. 以后需要的时候将盒子拿出来用就好了, 不需要重新组装了. 他呢由如下几个特点:

Airflow 镜像构建:

公司使用的Airflow镜像由两层组成, 第一层是一个base airflow: 用来解决Airlow依赖的问题, 构建一个最小的可用的airflow镜像. 第二层 是基于第一层镜像进行修改, 做一些公司定制, 如修改认证, 配置邮箱服务等. IMAGE

Airflow 部署流程:

IMAGE

DW 项目配置:

# Dockerfile

FROM alpine:3.4

COPY airflow /usr/local/airflow-projects/dw-code
# _initContainers.yaml

- name: dw-code
  image: "registry.git.saybot.net/data-warehouse/dw/\{\{ .Values.project.env \}\}:latest"
  imagePullPolicy: \{\{ .Values.image.pullPolicy \}\}
  command:
    - /bin/sh
    - '-exc'
  args:
    - 'mv -f /usr/local/airflow-projects/dw-code \{\{ .Values.dags.dag_path \}\}/;
      mkdir -p \{\{ .Values.dags.dag_path \}\}/dw-code/config ;
      for conf in /tmp/dw-code/config/*; do cat $conf >> \{\{ .Values.dags.dag_path \}\}/dw-code/config/`basename $conf`; done;
      '
  volumeMounts:
    - name: dags-data
      mountPath: \{\{ .Values.dags.dag_path \}\}
    - name: airflow-secret
      mountPath: /tmp/dw-code/config
# .gitlab-ci.yml

stages:
  - build # build image
  - deploy # trigger airflow deployment

variables: &VARIABLES
  IMAGE_PER_BRANCH_COMMIT: $CI_REGISTRY_IMAGE/$CI_COMMIT_REF_NAME:$CI_BUILD_REF

.default: &BUILD
  image: docker:latest
  stage: build
  services:
    - name: docker:dind
      command: ["--registry-mirror", "https://ixceb9no.mirror.aliyuncs.com"]
  variables: &VARIABLES
    DOCKER_DRIVER: overlay2
    IMAGE_PER_BRANCH: $CI_REGISTRY_IMAGE/$CI_BUILD_REF_NAME:latest
  before_script:
    - docker login -u "$CI_REGISTRY_USER" -p "$CI_REGISTRY_PASSWORD" $CI_REGISTRY
  script:
    - echo $IMAGE_PER_BRANCH_COMMIT
    - docker pull ${IMAGE_PER_BRANCH} || true
    - docker build --pull --cache-from ${IMAGE_PER_BRANCH} -t ${IMAGE_PER_BRANCH_COMMIT} -t ${IMAGE_PER_BRANCH} --build-arg CI_JOB_TOKEN=$CI_JOB_TOKEN .
    - docker push ${IMAGE_PER_BRANCH_COMMIT}
    - docker push ${IMAGE_PER_BRANCH}
  tags:
    - docker
  except:
    - tags

build_dev:
  <<: *BUILD
  only:
    - dev

.deploy: &deploy
  stage: deploy
  image: appropriate/curl:latest
  script:
  - curl --request POST --form "token=$CI_JOB_TOKEN" --form "ref=$AIRFLOW_CI_BRANCH" https://git.saybot.net/api/v4/projects/1711/trigger/pipeline
  tags:
    - docker

deploy_dev:
  <<: *deploy
  variables:
    AIRFLOW_CI_BRANCH: k8s-ci
  only:
    - dev