PrimeHub
  • Introduction
  • Installation
  • Tiers and Licenses
  • End-to-End Tutorial
    • 1 - MLOps Introduction and Scoping the Project
    • 2 - Train and Manage the Model
    • 3 - Compare, Register and Deploy the Model
    • 4 - Build the Web Application
    • 5 - Summary
  • User Guide
    • User Portal
    • Notebook
      • Notebook Tips
      • Advanced Settings
      • PrimeHub Notebook Extension
      • Submit Notebook as Job
    • Jobs
      • Job Artifacts
      • Tutorial
        • (Part1) MNIST classifier training
        • (Part2) MNIST classifier training
        • (Advanced) Use Job Submission to Tune Hyperparameters
        • (Advanced) Model Serving by Seldon
        • Job Artifacts Simple Usecase
    • Models
      • Manage and Deploy Model
      • Model Management Configuration
    • Deployments
      • Pre-packaged servers
        • TensorFlow server
        • PyTorch server
        • SKLearn server
        • Customize Pre-packaged Server
        • Run Pre-packaged Server Locally
      • Package from Language Wrapper
        • Model Image for Python
        • Model Image for R
        • Reusable Base Image
      • Prediction APIs
      • Model URI
      • Tutorial
        • Model by Pre-packaged Server
        • Model by Pre-packaged Server (PHFS)
        • Model by Image built from Language Wrapper
    • Shared Files
    • Datasets
    • Apps
      • Label Studio
      • MATLAB
      • MLflow
      • Streamlit
      • Tutorial
        • Create Your Own App
        • Create an MLflow server
        • Label Dataset by Label Studio
        • Code Server
    • Group Admin
      • Images
      • Settings
    • Generate an PrimeHub API Token
    • Python SDK
    • SSH Server Feature
      • VSCode SSH Notebook Remotely
      • Generate SSH Key Pair
      • Permission Denied
      • Connection Refused
    • Advanced Tutorial
      • Labeling the data
      • Notebook as a Job
      • Custom build the Seldon server
      • PrimeHub SDK/CLI Tools
  • Administrator Guide
    • Admin Portal
      • Create User
      • Create Group
      • Assign Group Admin
      • Create/Plan Instance Type
      • Add InfuseAI Image
      • Add Image
      • Build Image
      • Gitsync Secret for GitHub
      • Pull Secret for GitLab
    • System Settings
    • User Management
    • Group Management
    • Instance Type Management
      • NodeSelector
      • Toleration
    • Image Management
      • Custom Image Guideline
    • Volume Management
      • Upload Server
    • Secret Management
    • App Settings
    • Notebooks Admin
    • Usage Reports
  • Reference
    • Jupyter Images
      • repo2docker image
      • RStudio image
    • InfuseAI Images List
    • Roadmap
  • Developer Guide
    • GitHub
    • Design
      • PrimeHub File System (PHFS)
      • PrimeHub Store
      • Log Persistence
      • PrimeHub Apps
      • Admission
      • Notebook with kernel process
      • JupyterHub
      • Image Builder
      • Volume Upload
      • Job Scheduler
      • Job Submission
      • Job Monitoring
      • Install Helper
      • User Portal
      • Meta Chart
      • PrimeHub Usage
      • Job Artifact
      • PrimeHub Apps
    • Concept
      • Architecture
      • Data Model
      • CRDs
      • GraphQL
      • Persistence Storages
      • Persistence
      • Resources Quota
      • Privilege
    • Configuration
      • How to configure PrimeHub
      • Multiple Jupyter Notebook Kernels
      • Configure SSH Server
      • Configure Job Submission
      • Configure Custom Image Build
      • Configure Model Deployment
      • Setup Self-Signed Certificate for PrimeHub
      • Chart Configuration
      • Configure PrimeHub Store
    • Environment Variables
Powered by GitBook
On this page
  • Features
  • Configuration
  • Design
  1. Developer Guide
  2. Design

Job Monitoring

Monitoring indicates how many resources are utilized by the job.

Features

  1. Monitoring: In the Job details, there is a Monitoring to show resource consumption metrics.

  2. Resources It monitors three kinds of resources: CPU, Memory and GPUs.

Configuration

It could be enabled or disabled from helm value (the feature is enabled by default)

jobSubmission:
  monitoring:
    enabled: true

Design

User Journey

Monitor the running job

  1. A user submits a job and enter the job detail page.

  2. In the job page, switch to Monitoring tab, it will show the current cpu/gpu/memory metrics for a given of time. The metrics keep updating every 10 seconds.

  3. Click the different timespan (15min, 1hrs, 3hrs, lifetime) can switch to different timespan.

  4. Once the job is completed, stop updating and show the final metric state.

See the metrics for completed job

  1. Go to a completed job.

  2. Go to the Monitoring tab.

  3. It will show the latest cpu/gpu/memory metrics. And there are still different timespan to select.

See a warning when phfs is not enabled

  1. In the monitoring tab, see the message "feature not enabled, please contact admin", if underlying prerequisite (phfs) is not enabled.

Architecture

  1. For every running job, there would be a agent to collect the cpu/gpu/memory metrics

    • Monitoring Page should refresh itself every 10 seconds.

  2. The agent periodically flush the current report to /phfs/jobArtifacts/<jobname>/.metadata/monitoring.

  3. The GraphQL has the report endpoint for the job to return current monitoring-report to client.

  4. The console periodically queries GraphQL to get the current metrics of the job.

Components

Agent

  • Collect cpu/memory/gpu metrics.

  • Keep the different interval metrics in the memory.

  • Flush the monitoring metrics report to a file periodically (default monitoring in working directory).

Controller

  • Inject the agent to the job container in the init container.

  • Run the agent at the start of the job in right working directory.

  • Terminate(kill) the agent on the command is terminated.

GraphQL

  • Query the report from store and return to client (add monitoring field in artifact resource)

Client

  • Query GraphQL and render report periodically when running.

  • Query GraphQL and render final report when completed.

Data Format

/phfs/jobArtifacts/<jobname>/.metadata/monitoring

  • There are 4 timespans 15m, 1h, 3h and lifetime

  • Each timespan has its interval of freshness.

    • 15m: 10s → 15 * 60 / 10 = 90 points

    • 1h: 30s → 60 * 60 / 30 = 120 points

    • 3h: 2m → 3 * 60 * 60 / 120 = 90 points

    • Lifetime: 5m → (4 weeks by default, configurable option in helm value)

      • examples

        • 1 day → 24 * 60 * 60 / 300 = 288 points

        • 4 week → 4 * 7 * 24 * 60 * 60 / 300 = 8064 points if 1 point uses 1k size, the bandwidth estimation will be 8064 / 1024 / 8 = 0.98 Mb/s

Reference

PreviousJob SubmissionNextInstall Helper

Last updated 2 years ago

GitHub repo of PrimeHub Monitoring Agent