Log Persistence

Allows users to persists the job submission logs. By default, the job log is retrieved from the underlying pod. As the pod is deleted, the log is no longer accessed by the user.

Prerequisites

The PrimeHub Store feature must be enabled

Features

  • The job log can be still accessible even the underlying pod is deleted.

  • Support to store on s3 or gcs

  • Support flush interval and max buffer size

  • Support txt and gzip format

Configruation

To enable PHFS, set the store.eanbled and store.logPersistence.enabled to true.

Path
Description
Default Value

store.enabled

If the PrimeHub store is enabled

false

store.logPersistence.enabled

If the log persistence is enabled

true

fluentd.flushAtShutdown

false

fluentd.flushInterval

3600s

fluentd.chunkLimitSize

"256m"

fluentd.storeAs

txt

fluentd.*

The other fluentd settings

Design

  • Flunetd: The log collector to collect pod logs to PrimeHub store

  • GraphQL server: The log endpoint retrieve the log from PrimeHub store if pod does not exist

  • Console: Get the log from graphql server

Fluentd

Fluentd is based on fluentd kuberentes dameonset. The behavior is

  • Get the logs from /var/log/containers

  • Get the pod metadata from kubernetes API

  • Filter the log by label

  • Flush the log to minio by s3 plugin

GraphQL

  • Enhance the original log endpoint

  • Add a new query parameter persist=true. If it marked as true, the log is retrieve from persistent log

Console

  • The log UI would try to get the log from pod persist=false

  • Once the response has code 404, it will continue to get the persistent log by persist=true

Prefix in PrimeHub store

  • The prefix of log persistence is /logs

  • The output of one job is /logs/phjob/<phjob>/<date>/ (e.g /logs/hub/job-202006030120-gxpavy/2020-06-03/log-*.txt)

Limitation

The default flush time of fluentd is 1 hour. So the log may have 1 hour delay from persistent log. It is possible to shorten the flush interval in configuration. However, it may generate more files in the storage and lead to more query overhead.

Last updated