How We Use Flow PT CustomEntities to Improve Our Code Quality

February 5, 2026 · 5 min read

In the LLM era, the cost of generating sloppy code is rapidly dropping to zero. But generating high quality code remains difficult and time-consuming. We've discovered that putting as many quality checks and balances in the way of both the models and the developers delivers the best of both worlds, fast generation that we can refine into the high quality we are after, and enables vast refactors with minimal regressions.

In this somewhat lengthy post, we'll go over our process of maintaining a high quality standard throughout our codebase, via the use of Flow PT custom entities and our CI pipeline. Because after all, if we are proposing to service Studios, then we need to adhere to the highest standard in the software industry.

Whilst putting many metric thresholds in the way is great, it opens up a new can of worms: every CI run produces thousands of data points: test results, coverage percentages, file / line / function coverage reports, complexity scores, duplication counts etc. The question is: how do you make it actionable?

Typical Approach

currents.dev for E2E flakiness
codecov.io for coverage
+ more SaaS per metric

More dependencies, more cost, more context switching

What We Did

Write minimal reporters
Store in CustomEntities
Visualize with FlowPilot

Zero extra dependencies, one place for everything

At first it felt like a gamble (more code to maintain). But the reporter code turned out to be minimal:

~550 lines E2E Test Reporter

~750 lines Code Metrics Reporter

~1,300 lines total replacing 3+ SaaS tools

Having all our metrics in one place, actionable and transparent, has been a huge win. Studios using Flow PT who are reluctant to add dependencies should consider doing the same.

Lines of Code

Our codebase size is growing linearly with time, by roughly 1.5k LOC a day. This is expected for a startup, but exposes us to the universal law that the number of bugs grows with codebase size. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10].

Loading chart...

The above chart serves as a stark reminder that all software teams have to deal with: tame complexity or progress grinds to a halt.

Type Coverage

Type coverage here means explicit types rather than any. Normally the quest for strict type coverage is viewed as futile, pedantic and extremely restrictive. However, with the advent of LLMs, every time we replace an any with a concrete type, we remove ambiguity for the models. So we settled with 96% type coverage, which feels like a good high mark, without hitting diminishing returns.

The models know exactly what arguments functions / methods expect, and what they return. Strong and explicit typing removes a vast majority of the guesswork for LLMs.

Loading chart...

Code Complexity

Cyclomatic complexity measures how many independent paths exist through our code. Higher numbers mean more branches, more edge cases, and more bugs. We track both average and maximum complexity to catch files that are getting out of hand.

Loading chart...

Click "Complexity Max" in the legend to overlay the maximum values.

When the max complexity spikes, it's usually a single file that needs attention. We can catch these with the chart below.

Complexity Hit List

This is the refactoring hit list. Beyond cyclomatic complexity, our ESLint config also enforces function length (80 lines), nesting depth (4 levels), and parameter count (5 max). When any of these spike, we start here.

These constraints can feel contrived, if not an invitation to be gamed. That's why a mature refactoring approach matters. We write tests before the refactor begins, then ask: can complex functions be broken into pure functions without side effects? Can I/O be separated from business logic? Can code be extracted into its own module for better reuse and testability?

It's also worth noting that agentic coding tools won't read files above a certain length, falling back on grep instead. Through the lens of AI-assisted development, these refactors pay for themselves twice.

file ⇅	complexity ⇅
src/routes/settings/+page.server.ts	78
src/lib/chart/config.ts	66
src/routes/api/automation-assistant/+server.ts	63
src/lib/chart/config.ts	61
src/routes/+layout.server.ts	59
src/routes/api/automations/[id]/+server.ts	56
src/routes/api/organizations/[id]/invite/+server.ts	55
src/lib/components/ChartPreview.svelte	54
src/lib/stores/app.svelte.ts	54
src/routes/home-dashboard/+page.server.ts	49
Total	595

10 records loaded

Code Duplication

This is where LLMs struggle the most. Models default to the path of least resistance: rather than finding and reusing an existing utility, they'll write a new one. Rather than importing a shared type, they'll redeclare it inline. The result is subtle (the code works, tests pass) but duplication creeps up across the codebase. Left unchecked, you end up with multiple sources of truth for the same logic, and a bug fix in one place doesn't propagate to the copies.

We use jscpd to detect duplicated blocks and enforce a threshold in CI. When duplication spikes, it's almost always a sign that a refactor is overdue: extracting a shared module or consolidating types.

Loading chart...

Test Coverage

Getting a single coverage number for a full-stack app is harder than it sounds. Unit tests cover backend logic, E2E tests exercise the frontend through a browser, and each produces its own coverage report in a different format. We merge them using Istanbul via nyc, which gives us one unified percentage across the entire codebase: lines, branches, functions, and statements.

The sharp step shifts in the chart below correspond to refactoring sprints, where we targeted the least-covered files (more on that below). Gradual climbs reflect day-to-day test writing as part of normal feature work.

Loading chart...

To find files that need covering, we use the chart below.

Least Covered Files

This is where coverage stops being an abstract percentage and becomes a to-do list. The table ranks files by how little coverage they have, so a test sprint starts by picking the worst offenders from the top. We write tests for those files, push, and watch the overall coverage number climb in the chart above.

This part is rather easy, as it mostly involves asking our agentic setup to write tests, and another more adversarial pass that ensures the tests aren't being lazy, and are truly testing functionality.

file ⇅	coverage ⇅
src/lib/automations/limitEnforcement.server.ts	19.6
src/lib/subscription/state.svelte.ts	37.7
src/lib/fpt/session.server.ts	40.7
src/lib/stores/dashboard.svelte.ts	43.2
src/lib/claude/chart-builder-agent.ts	52.9
src/lib/recipe/executor.ts	55.1
src/lib/services/chartCache.ts	60
src/lib/services/chartService.ts	60
src/lib/services/conversationService.ts	60.7
src/lib/utils/createDefaultDashboard.ts	63
Total	492.9

10 records loaded

Playwright E2E Testing

The metrics above tell you about your code at rest: how complex it is, how much is duplicated, how well it's typed. But none of that matters if the app doesn't actually work. That's where E2E tests come in.

Playwright launches 8 real browsers in parallel, each with a bot that navigates the app like a user would: clicking buttons, filling forms, checking that the right data appears on screen. This lets us verify that core user flows survive even substantial refactors. The downside is that E2E tests are inherently fragile. Network hiccups, server latency, timeouts, and race conditions mean that a passing test today can fail tomorrow with no code change. This flakiness became a real problem for our CI reliability.

Test Results Over Time

The first thing we wanted to see was the big picture: are our tests getting more reliable or less? This chart shows passes, failures, and retries over time.

Loading chart...

Blended Flakiness Score

This is our CI reliability hit list, and arguably the single most valuable chart in this entire post. It uses FlowPilot's calculate_field transform to compute a weighted score: 10 points per failure, 2 points per retry. The AI generated this computed metric from a natural language prompt.

Having one ranked table that blends failures and retries into a single score makes prioritization trivial: fix the top item, watch CI reliability improve, repeat. Before this chart existed, flaky tests were a vague frustration. Now they're a sorted backlog.

sg_spec_file ⇅	sg_test_title ⇅	sg_retry_count ⇅	id Count ⇅
dashboard-filter	dashboard filter overrides chart filter	547	603
dashboard-discard-changes	discarding widget addition shows confirmation modal	399	415
dashboard-project-scoped	shows Project-scoped indicator only on project-scoped FPT page widget	420	401
pages	creates a page with 3 tabs (dashboard, table, card list)	386	390
edit-mode-navigation-guard	allows navigation without modal when no changes made	429	376
dashboard-crud	creates a dashboard with a chart, then deletes it	384	370
create-chart	creates a chart via Tools mode and saves it	287	380
dashboard-filters	dashboard has enabled filters	295	324
chart-title-sync	cards: updates DB name when visualization title is changed	340	314
no-auto-save	explicit Save button click DOES update chart order	291	320
dashboard-time-filter	time filter correctly filters dashboard data	326	310
chart-title-sync	table: updates DB name when visualization title is changed	333	305
chart-title-sync	bar chart: updates DB name when visualization title is changed	304	285
dashboard-discard-changes	discarding changes properly resets dashboard to saved state	281	274
automation-toggle	shows loading spinner when toggling automation status	323	255
table-pagination	page size selector and load more work correctly	246	249
page-duplicate	creates a page, then duplicates it successfully	250	236
no-auto-save	opening chart for edit and navigating away does NOT update chart	252	230
no-auto-save	switching view types without saving does NOT update chart	248	224
tab-filters	displays and toggles tab filters for table tab	237	221
Total		6578	6482

20 records loaded

Failures by Spec File

Sometimes the problem isn't a single test; it's an entire spec file. Maybe it's testing a flaky feature, or the test setup is unreliable. This bar chart shows which spec files fail most often.

Loading chart...

When one file dominates this chart, it's usually a sign that the feature itself needs attention: either better test isolation or fixing the underlying flakiness.

Filtering by Timeframe

All the charts and tables above become significantly more useful when you can narrow the window. Did flakiness get worse after last Tuesday's deploy? Is coverage trending up over the past month, or just the past week? FlowPilot supports date range filtering on any chart, so we can answer these questions instantly.

What We Learned

Dogfooding our own dashboards for quality tracking taught us a few things:

Using Flow PT for dev tooling was approached like an experiment, at first, but we very quickly realized it was a huge win. Writing custom reporters for Playwright and Vite, Istanbul, etc. was trivial, and completely obviated the headaches of adding additional 3rd party tools to our stack.

The most useful chart is by far the blended e2e failures. We can simply pick the top culprits from the top of the list, fix them, and the positive effects are immediately felt with increases in the reliability of our CI.

Inserting some of our own dev workflow directly into FlowPilot puts us in the user's seat. This creates user empathy, and a willingness to improve the product for our own use.

You may notice a large step shift in January's data. Using these dashboards, we identified the highest-impact areas and completed a major refactoring sprint. Here's what changed between early and late January:

Test coverage: 55% → 64%
Average complexity: 6.1 → 5.2
TypeScript coverage: 93.6% → 96.8%
Code duplication: 6.1% → 2.0%

Thanks to the extensive test suite, we experienced only a handful of minor regressions. This gives us tremendous confidence in our ability to move fast for our clients without sacrificing quality.

Build Your Own Charts

FlowPilot's AI assistant can generate charts like these from natural language. Describe what you want to see, and it builds the recipe for you, including computed fields, filters, and visualizations.

Try FlowPilot Free