Nexus Distributed System
The Nexus subsystem is the runtime fabric that runs on your cluster. It is responsible for node discovery, job assignment, gradient aggregation, and high-availability via leader election.Nexus is the backbone of distributed orchestration in LoomOS. It ensures reliable, scalable, and fault-tolerant operation of all jobs and workers in the cluster.
Architecture Overview
Key Concepts
- Master Node: Handles cluster coordination, job scheduling, and health monitoring
- Worker Nodes: Execute assigned jobs, report metrics, and handle local aggregation
- Consensus Group: Raft-based leader election and state replication
- Parameter Server: Aggregates gradients and synchronizes model parameters
Advanced Features
- Multi-master failover with Raft consensus
- Dynamic worker scaling and auto-restart
- Secure gRPC and REST APIs for management
- Real-time metrics and event streaming
Best Practices
- Deploy at least 3 master nodes for high availability
- Tune heartbeat and election timeouts for your network
- Use secure endpoints and rotate credentials regularly
- Monitor
/metricsand/healthzendpoints for cluster health
Troubleshooting
- Split-brain: Ensure odd number of masters and reliable network
- Worker flapping: Check resource limits and node health
- Job stuck: Inspect master logs and job queue state
- Slow aggregation: Profile network and parameter server throughput
Example: Full cluster startup
Distributed Training Over the Internet (WAN)
LoomOS Nexus supports distributed training across data centers, cloud regions, and even over the public internet. This enables hybrid and multi-site clusters for large-scale, collaborative AI workloads.WAN Deployment Architecture
- Multi-region clusters: Deploy master and worker nodes in different cloud regions or on-premises sites.
- Global job scheduling: Nexus can assign jobs to the nearest or most cost-effective site.
- Cross-site parameter sync: Parameter server and gradient aggregation support WAN-optimized protocols.
Security & Connectivity
- TLS everywhere: All traffic between masters, workers, and parameter servers is encrypted with TLS.
- Mutual authentication: Use client certificates or token-based auth for all nodes.
- NAT traversal: Workers behind NAT/firewall can connect out to public master endpoints using reverse tunnels (e.g., WireGuard, SSH tunnels, or VPN).
- Firewall rules: Restrict inbound ports to only those required (typically 443/8443 for HTTPS/gRPC).
Example: Secure WAN Worker
Best Practices for Multi-Site Clusters
- Use a global DNS or service mesh for master endpoint discovery
- Monitor WAN latency and tune heartbeat/election timeouts accordingly
- Prefer regional parameter servers for bandwidth efficiency
- Enable audit logging and anomaly detection for cross-site traffic
- Regularly rotate credentials and update firewall rules
Troubleshooting WAN Deployments
- Connection drops: Check NAT/firewall, VPN tunnels, and TLS cert validity
- High latency: Deploy parameter servers closer to workers, use compression
- Split-brain: Ensure reliable cross-site connectivity and odd number of masters
Master Node — responsibilities
- Worker registration and capability discovery
- Job queue management and placement decisions
- Cluster health, metrics, and scaling events
Worker Node — responsibilities
- Receive assigned tasks and manage execution lifecycle
- Efficient gradient compression, checkpointing, and reporting
- Local aggregation and opportunistic communication optimizations
Failover & consensus
LoomOS uses a Raft-based consensus for leader election and state replication to avoid split-brain and to allow seamless failover. Key considerations:- Heartbeat intervals and election timeouts must be tuned for your network
- Use multiple master replicas in production for High Availability
LoomCtl — management API
LoomCtl exposes REST endpoints for cluster and job management. Common endpoints:Endpoint Discovery & Documentation
LoomCtl and the LoomOS API are self-documenting. To discover all available endpoints:User-Configurable Endpoints
All LoomOS endpoints are user-configurable. You can set the API domain or endpoint URL in:- CLI:
loomos --endpoint https://your-domain.com - SDK:
LoomOSClient(endpoint="https://your-domain.com", ...) - Environment variable:
LOOMOS_API_ENDPOINT=https://your-domain.com
Creating and Extending Endpoints
LoomOS supports custom endpoints via plugin modules. To add a new endpoint:- Implement a Python function and register it with the API router in your plugin.
- Restart the master node to load the new endpoint.
- Document the endpoint in your team’s internal docs or contribute to the main API Reference.
