How I Built AI Agents to Automate My Software Development Workflow
Every developer knows the frustration of context switching when cross-platform builds fail. What if you could delegate those tedious fixes to AI agents running on dedicated hardware? After implementing this system, I've reclaimed 5-10 hours per week while improving code quality across platforms.
The Problem of Context Switching
Developing cross-platform applications creates an invisible tax on developer productivity. When working on my Rust-based video editor Kiru (which needs to support Windows, Mac and Linux), I constantly faced platform-specific build failures that only appeared after code was committed. My Mac-focused development workflow meant Linux and Windows issues would surface in CI/CD, forcing me to:
- Switch mental context to understand the failure
- Physically move to a different machine to reproduce the issue
- Debug unfamiliar platform-specific code paths
- Verify fixes across all platforms
While necessary, these tasks represented low-value work that could consume 5-10 hours weekly. The breakthrough came when I realized these were perfect candidates for automation - the problems were well-scoped, solutions easily verifiable (tests pass/fail), and the process followed predictable patterns.
Key Insight: Burnout isn't caused by working too hard, but by doing too many low-value tasks. Delegating these to machines (via AI agents) preserves mental energy for high-value creative work.
Hardware Setup
The foundation of any reliable automation system is appropriate hardware. For this project, I selected two Beelink EQR7 mini PCs (24GB RAM, 1TB SSD) running:
- Windows 11 Pro - For Windows-specific build failures
- NixOS - For Linux-specific issues (using the same configuration as my CI/CD runners)
Dedicated hardware proved essential for three reasons:
- Compilation Speed: Rust projects benefit significantly from multiple cores and fast storage
- Environment Consistency: Matching production CI/CD environments reduces "works on my machine" issues
- Resource Isolation: Prevents agent activities from impacting other critical services
Each machine was configured with the complete toolchain needed to build Kiru - Rust, platform-specific media frameworks (GStreamer, Media Foundation), and the GitHub CLI for repository interactions.
Network Architecture
Connecting on-premise hardware to cloud-based CI/CD required a secure networking solution. I implemented:
Tailscale VPN: Created a zero-trust private network connecting GitHub Actions runners to my home lab without exposing public ports. Features like MagicDNS (linux-agent.tailnet.ts.net) simplified remote management.
The system architecture follows these steps when a build fails:
- GitHub Actions detects a Linux or Windows build failure
- Workflow uses Tailscale to securely notify the appropriate agent node
- Agent clones repository at failing commit
- Diagnostic process begins (combining deterministic checks and LLM analysis)
- Fix is implemented, tested, and submitted via pull request
Tailscale Aperture (their LLM proxy service) adds crucial observability by logging all prompts, responses and tool calls while keeping API keys secure.
Agent Design
The core innovation lies in the hybrid agent architecture that balances determinism with AI flexibility:
| Component | Implementation | Why It Matters |
|---|---|---|
| Repository Setup | Traditional Rust code | 100% reliable cloning/checkout |
| Failure Analysis | Cersei (Rust agent framework) | LLM understands build logs |
| Code Modification | Custom tools + LLM | Combines pattern matching with creative fixes |
| Testing | Deterministic test runner | Clear pass/fail verification |
This architecture achieves 80% success rate on initial failures by:
- Using Rust's type system to enforce correctness in critical paths
- Providing the LLM with structured tools rather than free-form access
- Implementing automatic rollback when tests continue failing
LLM Integration
Through Tailscale Aperture, the system can leverage multiple LLM providers while maintaining centralized control:
Primary Model: Kimmy K2.5 Turbo (via Fireworks.ai Firepass plan) provides unlimited tokens at consistent quality for most fixes.
Model evaluation revealed important insights:
- Newer Models (K2.6, GLM 5.1): Better at identifying root causes rather than symptoms (15-20% more accurate)
- Cost Tradeoffs: Unlimited K2.5 tokens proved more economical than pay-per-use superior models
- Specialization: Models fine-tuned on Rust codebases performed 30% better on complex type system issues
The Aperture dashboard provides crucial visibility into token usage and costs across all agents - at 12:35 in the video you can see the detailed cost breakdown per fix.
GitHub Actions Integration
The final piece connects GitHub's CI/CD system to the on-premise agents:
name: Agent Notification on: workflow_run: workflows: ["Build and Test"] types: [completed] jobs: notify_agent: runs-on: ubuntu-latest if: ${{ github.event.workflow_run.conclusion == 'failure' }} steps: - uses: actions/checkout@v4 - name: Notify Linux Agent if: contains(github.event.workflow_run.logs_url, 'ubuntu') run: | curl -X POST \ -H "Authorization: Bearer ${{ secrets.TAILSCALE_TOKEN }}" \ https://linux-agent.tailnet.ts.net:8080/failure \ -d '{"repo": "${{ github.repository }}", "commit": "${{ github.sha }}", "log_url": "${{ github.event.workflow_run.logs_url }}"}' - name: Notify Windows Agent if: contains(github.event.workflow_run.logs_url, 'windows') run: | curl -X POST \ -H "Authorization: Bearer ${{ secrets.TAILSCALE_TOKEN }}" \ https://windows-agent.tailnet.ts.net:8080/failure \ -d '{"repo": "${{ github.repository }}", "commit": "${{ github.sha }}", "log_url": "${{ github.event.workflow_run.logs_url }}"}' Key security considerations:
- Tailscale authentication ensures only authorized workflows can trigger agents
- Agents run in isolated VLANs with restricted network access
- GitHub tokens have minimal required permissions
Results and Metrics
After two months of operation, the system demonstrates compelling results:
Beyond metrics, the qualitative benefits are equally important:
- Reduced Cognitive Load: No more constant pipeline monitoring
- Faster Feedback: Average fix time decreased from 4 hours (manual) to 35 minutes (automated)
- Improved Quality: Agents document their fixes thoroughly, creating institutional knowledge
The system pays for itself in developer productivity within weeks, while the hardware investment provides a foundation for expanding automation to other workflows.
Watch the Full Tutorial
See the complete system in action, including a live demo of the Tailscale Aperture dashboard showing real-time agent activity (jump to 15:20 for the most interesting workflow examples).
Key Takeaways
This project demonstrates how AI agents can solve real productivity drains in software development:
In summary: By combining dedicated hardware, secure networking, and hybrid AI/deterministic workflows, teams can automate up to 80% of cross-platform debugging tasks. The system pays for itself in weeks while improving both developer experience and code quality.
The approach isn't limited to build failures - similar architectures could automate:
- Code review feedback implementation
- Documentation generation
- Dependency updates
- Test case creation
As LLMs continue improving, the scope of automatable developer tasks will only expand - making early investment in these systems a competitive advantage.
Frequently Asked Questions
Common questions about this topic
The system solves the time-consuming problem of context switching when fixing cross-platform build failures. When developing a Rust application that needs to work across Windows, Mac and Linux, developers often only discover platform-specific issues after they've been committed to the CI/CD pipeline.
This requires switching contexts to diagnose and fix issues on different operating systems, which is a low-value task that can be automated. The AI agents handle this repetitive work, allowing developers to focus on feature development rather than platform compatibility issues.
- Eliminates 5-10 hours per week of context switching
- Reduces cognitive load of monitoring CI/CD pipelines
- Provides consistent documentation of platform-specific fixes
The solution uses dedicated Beelink EQR7 mini PCs (one running Windows, one running Linux NixOS) with 24GB RAM and 1TB SSDs. These provide enough power to quickly compile Rust code and run tests.
The machines are connected via Tailscale for secure remote access from GitHub Actions. While less powerful hardware could work, the compilation speed benefits of these specs make them ideal for developer productivity automation.
- Beelink EQR7: $600-$800 per unit
- 24GB RAM handles concurrent compilation jobs
- 1TB SSD provides fast access to source and build artifacts
The system uses a hybrid approach - deterministic steps like cloning repos and checking out commits are handled by traditional code, while the actual bug diagnosis and fixing is handled by a custom Rust agent built using the Cersei crate.
The agent has access to tools for analyzing build logs, modifying code, running tests, and creating pull requests. It follows a structured workflow that combines LLM reasoning with deterministic verification at each step to ensure reliability.
- First analyzes build logs to identify failure patterns
- References platform-specific documentation when available
- Proposes fixes that are automatically tested before submission
The system primarily uses Kimmy K2.5 Turbo through Fireworks.ai's Firepass plan, which provides unlimited tokens. Testing showed newer models like Kimmy K2.6 and GLM 5.1 performed better at identifying root causes rather than just symptoms.
Model selection involves tradeoffs between cost and capability. For most routine fixes, K2.5 provides sufficient quality at the best economics. For complex issues, the system can be configured to automatically escalate to more capable (but expensive) models when initial fixes fail.
- Kimmy K2.5: Best economics for routine fixes
- GLM 5.1: 15-20% better at complex platform-specific issues
- Fine-tuned models: 30% better for language-specific challenges
Tailscale provides secure zero-trust networking between the GitHub Actions runners and the on-premise agent nodes. It enables secure communication without exposing home network ports, and features like Tailscale Aperture provide LLM proxy capabilities with logging and model management.
The Tailscale integration solves several critical challenges: secure remote access without VPN complexity, easy DNS naming for agent nodes, and centralized monitoring of all LLM interactions through Aperture.
- MagicDNS provides easy node addressing
- Aperture enables LLM usage monitoring
- WireGuard-based encryption ensures security
In initial deployment, the agents successfully handled approximately 80% of cross-platform build failures without human intervention. The remaining 20% typically required more complex architectural changes or environment setup issues beyond the agents' current capabilities.
Success rates vary by failure type - simple compilation errors see 90%+ resolution, while complex multithreading issues might only see 50-60% automated resolution. The system is designed to gracefully escalate unsolved issues to human developers after multiple attempts.
- 80% overall resolution rate
- 90%+ for simple compilation errors
- 50-60% for complex concurrency issues
Early metrics show the system saves 5-10 hours per week that would otherwise be spent context switching between platforms to diagnose and fix build issues. More importantly, it eliminates the cognitive load of constantly monitoring CI/CD pipelines for failures.
The time savings compound as the system handles more failures - each resolved issue adds to the agent's knowledge base, improving future success rates. Teams also benefit from consistent documentation of platform-specific fixes that would otherwise exist only as tacit knowledge.
- 5-10 hours weekly time savings per developer
- 4x faster resolution than manual debugging
- Reduced context switching fatigue
GrowwStacks specializes in building custom AI automation solutions for software teams. We can design and implement similar agent systems tailored to your tech stack, whether you need cross-platform testing automation, code review assistance, or other developer workflow optimizations.
Our team handles everything from hardware selection to LLM integration and monitoring. We offer a free 30-minute consultation to analyze your specific pain points and propose an automation strategy that delivers measurable productivity gains within weeks.
- Custom agent development for your tech stack
- Hardware procurement and configuration
- Ongoing monitoring and optimization
Ready to Automate Your Development Workflow?
Every hour spent context switching between platforms is an hour not spent building your product. Our AI automation experts can design a custom solution that handles your team's repetitive debugging tasks within weeks.