動機

突然想起有k8s,就來看看 越看越像linux主機的抽象化,最後變成近乎linux主機的framework

架構: 把主機拆掉再變多

一開始主機是 一台電腦有

  1. 應用程式
  2. 檔案(設定檔)
  3. 處理連線的介面(load balancer或是firewall)

應用程式 & 虛擬的主機

都是共用同一台實體主機的資源


但應用程式越來越多會彼此影響 所以在同一台實體主機中用chroot與network namespace分開應用程式 但彼此之間要溝通就不能透過 記憶體與filesystem 他們已經被分開了 所以彼此之間用網路連接在一起 這就是container或是pod


如果每個應用程式都是用網路連接在一起,那應用程式還需要限定在同一台實體主機上嗎? 因此我們可以把實體主機變多,讓應用程式跑在不同的主機上 這就是cluster

對於使用者來說這些實體主機都是可以提供服務的,所以cluster也可以說是虛擬的主機

如果相同的應用程式跑在不同的主機上,都提供一樣的服務,這叫scaling 如果子任務跑在不同的主機上,來跑出計算結果,這叫parllelism


有了這麼多實體主機要怎麼分配pod? 在k8s就是deployment

處理連線的介面

原本在同一台主機上,使怎麼區分要把連線導到哪個應用程式? 用port,像http是80、https是443 在應用程式用bind就可以綁到指定的port


但現在每個應用程式都是被分開,所以會需要

  1. ip(在不同的network namespace)
  2. port(在第四層網路)

這個除了用bind外,還要用iptables,也就是firewall去控制ip的封包往那邊走

但這都還在同一台實體主機上


如果變成像cluster那樣的虛擬主機要怎麼辦?

基本上就是用主從式架構 一個收連線分配到某台cluster下的主機,等資料return到這裡

這就是master node的由來


接下去要問的是 怎麼使用服務?

現在master node知道服務的ip與port

所以要讓服務可以被使用,方法有

  1. port forwarding
  2. reverse proxy

port forwarding就是service(type: NodePort) reverse proxy就是ClusterIP(ingress可以想成加料過的loadbalancer)

service的NodePort & LoadBalancer差在? port開在哪裡, NodePort開在主機上, LoadBalancer會先拉一台主機出來再開port

LoadBalancer可以看成把master node的firewall部份抽出來單獨用

為什麼service再分配一個ClusterIP? 為了可以用service的name直接連到Pod,service的name會被放在kube-dns中 這樣用service的name做nslookup就會拿到ip,就像部屬一台server一樣

同樣都是讓服務可以被使用, 比起實體主機,為什麼k8s還多了ingress? 第一個是現在應用程式都有了自己的ip 第二個是一般常見的服務都是 network protocol綁定port的,像80,443

所以ingress可以想像成DNS不過是把path換成ip&port DNS是domain name換成ip

(有注意到嗎,在網站上,path是換成對應的controller,所以ingress的角色其實是與網站上的router相當)

service不加上selector(不是把流量導到cluster中)的話? service就是在實體主機上的bind 所以也可以把流量導到其他地方去,像

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  ports:
    - protocol: TCP
      port: 80
      targetPort: 9376
apiVersion: v1
kind: Endpoints
metadata:
  name: my-service
subsets:
  - addresses:
      - ip: 192.0.2.42
    ports:
      - port: 9376

Namespaces就是不同的cluster,其實就是不同的虛擬主機

部署服務

流程是

  1. deployment生pod
  2. service把port綁到指定的pod
  3. ingress把路徑綁到port(service)

指定某東西

如何指定就是用label,像

apiVersion: apps/v1beta2 # for kubectl versions >= 1.9.0 use apps/v1
kind: Deployment
metadata:
  name: hello-deployment
spec:
  replicas: 3
  selector:
    matchLabels: # here
      app: my-deployment
    matchExpressions:
       - {key: tier, operator: In, values: [cache]}
       - {key: environment, operator: NotIn, values: [dev]}
  template:
    metadata:
      labels: # here
        app: my-deployment
    spec:
      containers:
      - name: my-pod
        image: zxcvbnius/docker-demo:latest
        ports:
        - containerPort: 3000

如果是在commandline上要指定的話

kubectl get pods -l environment=production,tier=frontend

kubectl get pods -l 'environment in (production),tier in (frontend)'

kubectl get pods -l 'environment in (production, qa)'

kubectl get pods -l 'environment,environment notin (frontend)'

另外可以用go-template,來客製顯示的資訊

[root@node root]# kubectl get pods --all-namespaces -o go-template --template='{{range .items}}{{.metadata.uid}}
{{end}}'
0313ffff-f1f4-11e7-9cda-40f2e9b98448
ee49bdcd-f1f2-11e7-9cda-40f2e9b98448
f1e0eb80-f1f2-11e7-9cda-40f2e9b98448

[root@node root]# kubectl get pods --all-namespaces -o go-template --template='{{range .items}}{{printf "|%-20s|%-50s|%-30s|\n" .metadata.namespace .metadata.name .metadata.uid}}{{end}}'
|console             |console-4d2d7eab-1377218307-7lg5v                 |0313ffff-f1f4-11e7-9cda-40f2e9b98448|
|console             |console-4d2d7eab-1377218307-q3bjd                 |ee49bdcd-f1f2-11e7-9cda-40f2e9b98448|
|cxftest             |ipreserve-f15257ec-3788284754-nmp3x               |f1e0eb80-f1f2-11e7-9cda-40f2e9b98448|

Horizontal Pod Autoscaling

Deployment控制Pod的數量 Horizontal Pod Autoscaling控制Deployment的數量

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: helloworld-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta2
    kind: Deployment
    name: helloworld-deployment
  minReplicas: 2
  maxReplicas: 5
  targetCPUUtilizationPercentage: 50

DaemonSet

如果每個Node都要跑這個pod,像是monitor或log 就可以用DaemonSet

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd-elasticsearch
  namespace: kube-system
  labels:
    k8s-app: fluentd-logging
spec:
  selector:
    matchLabels:
      name: fluentd-elasticsearch
  template:
    metadata:
      labels:
        name: fluentd-elasticsearch
    spec:
      tolerations:
      # this toleration is to have the daemonset runnable on master nodes
      # remove it if your masters can't run pods
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluentd-elasticsearch
        image: quay.io/fluentd_elasticsearch/fluentd:v2.5.2
        resources:
          limits:
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 200Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      terminationGracePeriodSeconds: 30
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

管理Node

停用node: kubectl drain {node_name} 啟用node: kubectl uncordon {node_name}

限制資源

Pod本身可以限制或要求最多或最少需要多少資源

當然也可以從外部限制

apiVersion: v1
kind: List
items:
- apiVersion: v1
  kind: ResourceQuota
  metadata:
    name: pods-high
  spec:
    hard:
      cpu: "1000"
      memory: 200Gi
      pods: "10"
    scopeSelector:
      matchExpressions:
      - operator : In
        scopeName: PriorityClass
        values: ["high"]
- apiVersion: v1
  kind: ResourceQuota
  metadata:
    name: pods-medium
  spec:
    hard:
      cpu: "10"
      memory: 20Gi
      pods: "10"
    scopeSelector:
      matchExpressions:
      - operator : In
        scopeName: PriorityClass
        values: ["medium"]
- apiVersion: v1
  kind: ResourceQuota
  metadata:
    name: pods-low
  spec:
    hard:
      cpu: "5"
      memory: 10Gi
      pods: "10"
    scopeSelector:
      matchExpressions:
      - operator : In
        scopeName: PriorityClass
	values: ["low"]
apiVersion: v1
kind: Pod
metadata:
  name: high-priority
spec:
  containers:
  - name: high-priority
    image: ubuntu
    command: ["/bin/sh"]
    args: ["-c", "while true; do echo hello; sleep 10;done"]
    resources:
      requests:
        memory: "10Gi"
        cpu: "500m"
      limits:
        memory: "10Gi"
        cpu: "500m"
  priorityClassName: high

除了pod的限制,也能限制namespace

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-resources
spec:
  hard:
    requests.cpu: "1"
    requests.memory: 1Gi
    limits.cpu: "2"
    limits.memory: 2Gi
    requests.nvidia.com/gpu: 4

---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: object-counts
spec:
  hard:
    configmaps: "10"
    persistentvolumeclaims: "4"
    pods: "4"
    replicationcontrollers: "20"
    secrets: "10"
    services: "10"
    services.loadbalancers: "2"

角色控制

Role是指定在一個namespace下可以操作的資源 ClusterRole是在任何namespace下都可以操作的資源

資源是在不同的apigroup下

用rolebinding把Role與user或group或service account綁在一起

儲存

在 使用檔案 之前我們需要 filesystem 在 使用filesystem 之前我們需要 做partition 在 做partition 之前我們需要 硬碟

filesystem 就是 volumne 做partition 就是 PersistentVolumeClaim 虛擬的硬碟(有特別的屬性) 就是 Storage Class 實體的硬碟 就是 Persistent Volume (docker的volumne)

這裡就轉貼一些別人的範例

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: standard
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp2
  zone: us-west-2
reclaimPolicy: Delete # [Delete | Retain]
apiVersion: v1
metadata:
  name: myclaim
spec:
  accessModes:
    - ReadWriteOnce # [ReadWriteOnce | ReadOnlyMany | ReadWriteMany]
  resources:
    requests:
      storage: 8Gi
  storageClassName: standard # Ref
apiVersion: v1
kind: Pod
metadata:
  name: apiserver
  labels:
    app: apiserver
    tier: backend
spec:
  containers:
  - name: my-pod
    image: zxcvbnius/docker-demo
    ports:
    - containerPort: 3000
    volumeMounts: # \/ 看下面的volumes
    - name: my-pvc
      mountPath: "/tmp"
  volumes:
  - name: my-pvc
    persistentVolumeClaim:
      claimName: myclaim # Ref

Persistent Volume的例子 (注意到在這裡PVC沒有指定storageClassName,PVC直接從PV拿)

apiVersion: v1
kind: PersistentVolume    <=== 指定物件種類為 PV
metadata:
  name: pv001		      <=== PV 名稱
spec:
  capacity:
    storage: 2Gi          <=== 指定大小
  accessModes:
  - ReadWriteOnce         <=== 指定存取模式
  hostPath:               <=== 綁定在 host 的 /tmp 目錄
    path: /tmp
---
apiVersion: v1
kind: Pod         <=== 使用一個 Pod 並試著掛載 Volume
metadata:
  name: pvc-nginx
spec:
  containers:
  - name: nginx
    image: nginx
    ports:
    - containerPort: 80
    volumeMounts:       <=== 將名為 volume-pv 的 Volume 掛載到 /usr/share/nginx/html 目錄底下
      - name: volume-pv
        mountPath: /usr/share/nginx/html
  volumes:
  - name: volume-pv   <=== 宣告一個名為 volume-pv 的 Volume 物件
    persistentVolumeClaim:   <=== 綁定名為 pv-claim 的 PVC 物件
      claimName: pv-claim

---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: pv-claim
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi   <=== 要求 1G 容量

filesystem只能從硬碟開始嗎? 沒有現成的嗎?

有,

  1. 實體主機的檔案 (emptyDir, hostPath)
  2. NFS
apiVersion: v1
kind: Pod
metadata:
  name: apiserver
spec:
  containers:
  - name: apiserver
    image: zxcvbnius/docker-demo
    volumeMounts:
    - mountPath: /tmp
      name: tmp-volume
    imagePullPolicy: Always
  volumes:
  - name: tmp-volume
    hostPath:
      path: /tmp
  - name: nfs-volumes
    nfs:
     server: {YOUR_NFS_SERVER_URL}
     path: /
      type: Directory
  - name: cache-volume
    emptyDir: {}

從剛剛的資料來源來看可分成

  1. 網路: NFS、Storage Class&PersistentVolumeClaim
  2. 實體主機: emptyDir、hostPath

哪虛擬主機有沒有自己的檔案(資料)? 有,ConfigMap 與 Secret

不過因為secret是放秘密的資料 所以也會透過env var傳

要注意到ConfigMap 與 Secret都是hash table, 也就是用key去取值

apiVersion: v1
kind: Pod
metadata:
  name: apiserver
  labels:
    app: webserver
    tier: backend
spec:
  containers:
  - name: nodejs-app
    image: zxcvbnius/docker-demo
    ports:
    - containerPort: 3000
  - name: nginx
    image: nginx:1.13
    ports:
    - containerPort: 80
    volumeMounts:
    - name: nginx-conf-volume
      mountPath: /etc/nginx/conf.d
    env:
    - name: SECRET_USERNAME
      valueFrom:
        secretKeyRef:
          name: demo-secret-from-yaml
          key: username
    - name: SECRET_PASSWORD
      valueFrom:
        secretKeyRef:
          name: demo-secret-from-yaml
          key: password
  volumes:
  - name: nginx-conf-volume
    configMap:
      name: nginx-conf
      items:
      - key: my-nginx.conf
	path: my-nginx.conf
   - name: secret-volume
     secret:
      secretName: demo-secret-from-yaml

只想跑某個指令

只跑一次: Job

apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  template:
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: Never
  backoffLimit: 4

也可以平行跑,見官方文件

反覆: CronJob

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hello
            image: apline
            args:
            - /bin/sh
            - -c
            - echo "Hi, current time is $(date)"
	  restartPolicy: OnFailure

Ref

Job NodePort&Loadbalancer 官網 30天鐵人賽 go-template