How We Use Flow PT CustomEntities to Improve Our Code Quality

In the LLM era, the cost of generating sloppy code is rapidly dropping to zero. But generating high quality code remains difficult and time-consuming. We've discovered that putting as many quality checks and balances in the way of both the models and the developers delivers the best of both worlds, fast generation that we can refine into the high quality we are after, and enables vast refactors with minimal regressions.
In this somewhat lengthy post, we'll go over our process of maintaining a high quality standard throughout our codebase, via the use of Flow PT custom entities and our CI pipeline. Because after all, if we are proposing to service Studios, then we need to adhere to the highest standard in the software industry.
Whilst putting many metric thresholds in the way is great, it opens up a new can of worms: every CI run produces thousands of data points: test results, coverage percentages, file / line / function coverage reports, complexity scores, duplication counts etc. The question is: how do you make it actionable?
Typical Approach
- currents.dev for E2E flakiness
- codecov.io for coverage
- + more SaaS per metric
More dependencies, more cost, more context switching
What We Did
- Write minimal reporters
- Store in CustomEntities
- Visualize with FlowPilot
Zero extra dependencies, one place for everything
At first it felt like a gamble (more code to maintain). But the reporter code turned out to be minimal:
Having all our metrics in one place, actionable and transparent, has been a huge win. Studios using Flow PT who are reluctant to add dependencies should consider doing the same.
Lines of Code
Our codebase size is growing linearly with time, by roughly 1.5k LOC a day. This is expected for a startup, but exposes us to the universal law that the number of bugs grows with codebase size. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10].
The above chart serves as a stark reminder that all software teams have to deal with: tame complexity or progress grinds to a halt.
Type Coverage
Type coverage here means explicit types rather than any. Normally
the quest for strict type coverage is viewed as futile, pedantic and extremely
restrictive. However, with the advent of LLMs, every time we replace an any with a concrete type, we remove ambiguity for the models. So we
settled with 96% type coverage, which feels like a good high mark,
without hitting diminishing returns.
The models know exactly what arguments functions / methods expect, and what they
return. Strong and explicit typing removes a vast majority of the guesswork for LLMs.
Code Complexity
Cyclomatic complexity measures how many independent paths exist through our code. Higher numbers mean more branches, more edge cases, and more bugs. We track both average and maximum complexity to catch files that are getting out of hand.
Click "Complexity Max" in the legend to overlay the maximum values.
When the max complexity spikes, it's usually a single file that needs attention. We can catch these with the chart below.
Complexity Hit List

This is the refactoring hit list. Beyond cyclomatic complexity, our ESLint config also enforces function length (80 lines), nesting depth
(4 levels), and parameter count (5 max). When any of these spike, we start
here.
These constraints can feel contrived, if not an invitation to be gamed. That's
why a mature refactoring approach matters. We write tests before the refactor
begins, then ask: can complex functions be broken into pure functions without
side effects? Can I/O be separated from business logic? Can code be extracted
into its own module for better reuse and testability?
It's also worth noting that agentic coding tools won't read files above a certain
length, falling back on grep instead. Through the lens of AI-assisted
development, these refactors pay for themselves twice.
file ⇅ | complexity ⇅ |
|---|---|
| src/routes/settings/+page.server.ts | 78 |
| src/lib/chart/config.ts | 66 |
| src/routes/api/automation-assistant/+server.ts | 63 |
| src/lib/chart/config.ts | 61 |
| src/routes/+layout.server.ts | 59 |
| src/routes/api/automations/[id]/+server.ts | 56 |
| src/routes/api/organizations/[id]/invite/+server.ts | 55 |
| src/lib/components/ChartPreview.svelte | 54 |
| src/lib/stores/app.svelte.ts | 54 |
| src/routes/home-dashboard/+page.server.ts | 49 |
| Total | 595 |
Code Duplication
This is where LLMs struggle the most. Models default to the path of least resistance: rather than finding and reusing an existing utility, they'll write a new one. Rather than importing a shared type, they'll redeclare it inline. The result is subtle (the code works, tests pass) but duplication creeps up across the codebase. Left unchecked, you end up with multiple sources of truth for the same logic, and a bug fix in one place doesn't propagate to the copies.
We use jscpd to detect duplicated blocks and enforce a threshold in CI. When duplication spikes, it's almost always a sign that a refactor is overdue: extracting a shared module or consolidating types.
Test Coverage
Getting a single coverage number for a full-stack app is harder than it sounds. Unit tests cover backend logic, E2E tests exercise the frontend through a browser, and each produces its own coverage report in a different format. We merge them using Istanbul via nyc, which gives us one unified percentage across the entire codebase: lines, branches, functions, and statements.
The sharp step shifts in the chart below correspond to refactoring sprints, where we targeted the least-covered files (more on that below). Gradual climbs reflect day-to-day test writing as part of normal feature work.
To find files that need covering, we use the chart below.
Least Covered Files
This is where coverage stops being an abstract percentage and becomes a to-do
list. The table ranks files by how little coverage they have, so a test sprint
starts by picking the worst offenders from the top. We write tests for those
files, push, and watch the overall coverage number climb in the chart above.
This part is rather easy, as it mostly involves asking our agentic setup to write
tests, and another more adversarial pass that ensures the tests aren't being lazy,
and are truly testing functionality.
file ⇅ | coverage ⇅ |
|---|---|
| src/lib/automations/limitEnforcement.server.ts | 19.6 |
| src/lib/subscription/state.svelte.ts | 37.7 |
| src/lib/fpt/session.server.ts | 40.7 |
| src/lib/stores/dashboard.svelte.ts | 43.2 |
| src/lib/claude/chart-builder-agent.ts | 52.9 |
| src/lib/recipe/executor.ts | 55.1 |
| src/lib/services/chartCache.ts | 60 |
| src/lib/services/chartService.ts | 60 |
| src/lib/services/conversationService.ts | 60.7 |
| src/lib/utils/createDefaultDashboard.ts | 63 |
| Total | 492.9 |
Playwright E2E Testing
The metrics above tell you about your code at rest: how complex it is, how much is duplicated, how well it's typed. But none of that matters if the app doesn't actually work. That's where E2E tests come in.

Playwright launches 8 real browsers in parallel, each with a bot that navigates the app like a user would: clicking buttons, filling forms, checking that the right data appears on screen. This lets us verify that core user flows survive even substantial refactors. The downside is that E2E tests are inherently fragile. Network hiccups, server latency, timeouts, and race conditions mean that a passing test today can fail tomorrow with no code change. This flakiness became a real problem for our CI reliability.
Test Results Over Time
The first thing we wanted to see was the big picture: are our tests getting more reliable or less? This chart shows passes, failures, and retries over time.
Blended Flakiness Score
This is our CI reliability hit list, and arguably the single most valuable chart
in this entire post. It uses FlowPilot's calculate_field transform to compute a weighted score: 10 points per
failure, 2 points per retry. The AI generated this computed metric from a natural
language prompt.
Having one ranked table that blends failures and retries into a single score makes prioritization trivial: fix the top item, watch CI reliability improve, repeat. Before this chart existed, flaky tests were a vague frustration. Now they're a sorted backlog.
sg_spec_file ⇅ | sg_test_title ⇅ | sg_retry_count ⇅ | id Count ⇅ |
|---|---|---|---|
| dashboard-filter | dashboard filter overrides chart filter | 547 | 603 |
| dashboard-discard-changes | discarding widget addition shows confirmation modal | 399 | 415 |
| dashboard-project-scoped | shows Project-scoped indicator only on project-scoped FPT page widget | 420 | 401 |
| pages | creates a page with 3 tabs (dashboard, table, card list) | 386 | 390 |
| edit-mode-navigation-guard | allows navigation without modal when no changes made | 429 | 376 |
| dashboard-crud | creates a dashboard with a chart, then deletes it | 384 | 370 |
| create-chart | creates a chart via Tools mode and saves it | 287 | 380 |
| dashboard-filters | dashboard has enabled filters | 295 | 324 |
| chart-title-sync | cards: updates DB name when visualization title is changed | 340 | 314 |
| no-auto-save | explicit Save button click DOES update chart order | 291 | 320 |
| dashboard-time-filter | time filter correctly filters dashboard data | 326 | 310 |
| chart-title-sync | table: updates DB name when visualization title is changed | 333 | 305 |
| chart-title-sync | bar chart: updates DB name when visualization title is changed | 304 | 285 |
| dashboard-discard-changes | discarding changes properly resets dashboard to saved state | 281 | 274 |
| automation-toggle | shows loading spinner when toggling automation status | 323 | 255 |
| table-pagination | page size selector and load more work correctly | 246 | 249 |
| page-duplicate | creates a page, then duplicates it successfully | 250 | 236 |
| no-auto-save | opening chart for edit and navigating away does NOT update chart | 252 | 230 |
| no-auto-save | switching view types without saving does NOT update chart | 248 | 224 |
| tab-filters | displays and toggles tab filters for table tab | 237 | 221 |
| Total | 6578 | 6482 |
Failures by Spec File
Sometimes the problem isn't a single test; it's an entire spec file. Maybe it's testing a flaky feature, or the test setup is unreliable. This bar chart shows which spec files fail most often.
When one file dominates this chart, it's usually a sign that the feature itself needs attention: either better test isolation or fixing the underlying flakiness.
Filtering by Timeframe
All the charts and tables above become significantly more useful when you can narrow the window. Did flakiness get worse after last Tuesday's deploy? Is coverage trending up over the past month, or just the past week? FlowPilot supports date range filtering on any chart, so we can answer these questions instantly.
What We Learned
Dogfooding our own dashboards for quality tracking taught us a few things:
Using Flow PT for dev tooling was approached like an experiment, at first, but we very quickly realized it was a huge win. Writing custom reporters for Playwright and Vite, Istanbul, etc. was trivial, and completely obviated the headaches of adding additional 3rd party tools to our stack.
The most useful chart is by far the blended e2e failures. We can simply pick the top culprits from the top of the list, fix them, and the positive effects are immediately felt with increases in the reliability of our CI.
Inserting some of our own dev workflow directly into FlowPilot puts us in the user's seat. This creates user empathy, and a willingness to improve the product for our own use.
You may notice a large step shift in January's data. Using these dashboards, we identified the highest-impact areas and completed a major refactoring sprint. Here's what changed between early and late January:
- Test coverage: 55% → 64%
- Average complexity: 6.1 → 5.2
- TypeScript coverage: 93.6% → 96.8%
- Code duplication: 6.1% → 2.0%
Thanks to the extensive test suite, we experienced only a handful of minor regressions. This gives us tremendous confidence in our ability to move fast for our clients without sacrificing quality.
Build Your Own Charts
FlowPilot's AI assistant can generate charts like these from natural language. Describe what you want to see, and it builds the recipe for you, including computed fields, filters, and visualizations.