How We Use Flow PT CustomEntities to Improve Our Code Quality

FlowPilot dogfooding

In the LLM era, the cost of generating sloppy code is rapidly dropping to zero. But generating high quality code remains difficult and time-consuming. We've discovered that putting as many quality checks and balances in the way of both the models and the developers delivers the best of both worlds, fast generation that we can refine into the high quality we are after, and enables vast refactors with minimal regressions.

In this somewhat lengthy post, we'll go over our process of maintaining a high quality standard throughout our codebase, via the use of Flow PT custom entities and our CI pipeline. Because after all, if we are proposing to service Studios, then we need to adhere to the highest standard in the software industry.

Code GenerationLLMs & DevelopersQuality GatesUniform FormattingStrong TypeScript TypingESLint EnforcementComplexity ThresholdsDuplication ThresholdsUnit / Integration TestsE2E Happy Path TestsmetricsFlow PTCustomEntitiesFlowPilotDashboardsHigh Quality CodeMinimal RegressionsRefactor

Whilst putting many metric thresholds in the way is great, it opens up a new can of worms: every CI run produces thousands of data points: test results, coverage percentages, file / line / function coverage reports, complexity scores, duplication counts etc. The question is: how do you make it actionable?

Typical Approach

  • currents.dev for E2E flakiness
  • codecov.io for coverage
  • + more SaaS per metric

More dependencies, more cost, more context switching

vs

What We Did

  • Write minimal reporters
  • Store in CustomEntities
  • Visualize with FlowPilot

Zero extra dependencies, one place for everything

At first it felt like a gamble (more code to maintain). But the reporter code turned out to be minimal:

~550 lines E2E Test Reporter
+
~750 lines Code Metrics Reporter
=
~1,300 lines total replacing 3+ SaaS tools

Having all our metrics in one place, actionable and transparent, has been a huge win. Studios using Flow PT who are reluctant to add dependencies should consider doing the same.

Lines of Code

Our codebase size is growing linearly with time, by roughly 1.5k LOC a day. This is expected for a startup, but exposes us to the universal law that the number of bugs grows with codebase size. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10].

Loading chart...

The above chart serves as a stark reminder that all software teams have to deal with: tame complexity or progress grinds to a halt.

Type Coverage

Type coverage here means explicit types rather than any. Normally the quest for strict type coverage is viewed as futile, pedantic and extremely restrictive. However, with the advent of LLMs, every time we replace an any with a concrete type, we remove ambiguity for the models. So we settled with 96% type coverage, which feels like a good high mark, without hitting diminishing returns.

The models know exactly what arguments functions / methods expect, and what they return. Strong and explicit typing removes a vast majority of the guesswork for LLMs.

Loading chart...

Code Complexity

Cyclomatic complexity measures how many independent paths exist through our code. Higher numbers mean more branches, more edge cases, and more bugs. We track both average and maximum complexity to catch files that are getting out of hand.

Loading chart...

Click "Complexity Max" in the legend to overlay the maximum values.

When the max complexity spikes, it's usually a single file that needs attention. We can catch these with the chart below.

Complexity Hit List

Sniper targeting complexity hit list

This is the refactoring hit list. Beyond cyclomatic complexity, our ESLint config also enforces function length (80 lines), nesting depth (4 levels), and parameter count (5 max). When any of these spike, we start here.

These constraints can feel contrived, if not an invitation to be gamed. That's why a mature refactoring approach matters. We write tests before the refactor begins, then ask: can complex functions be broken into pure functions without side effects? Can I/O be separated from business logic? Can code be extracted into its own module for better reuse and testability?

It's also worth noting that agentic coding tools won't read files above a certain length, falling back on grep instead. Through the lens of AI-assisted development, these refactors pay for themselves twice.

file
complexity
src/routes/settings/+page.server.ts78
src/lib/chart/config.ts66
src/routes/api/automation-assistant/+server.ts63
src/lib/chart/config.ts61
src/routes/+layout.server.ts59
src/routes/api/automations/[id]/+server.ts56
src/routes/api/organizations/[id]/invite/+server.ts55
src/lib/components/ChartPreview.svelte54
src/lib/stores/app.svelte.ts54
src/routes/home-dashboard/+page.server.ts49
Total595

Code Duplication

This is where LLMs struggle the most. Models default to the path of least resistance: rather than finding and reusing an existing utility, they'll write a new one. Rather than importing a shared type, they'll redeclare it inline. The result is subtle (the code works, tests pass) but duplication creeps up across the codebase. Left unchecked, you end up with multiple sources of truth for the same logic, and a bug fix in one place doesn't propagate to the copies.

We use jscpd to detect duplicated blocks and enforce a threshold in CI. When duplication spikes, it's almost always a sign that a refactor is overdue: extracting a shared module or consolidating types.

Loading chart...

Test Coverage

Getting a single coverage number for a full-stack app is harder than it sounds. Unit tests cover backend logic, E2E tests exercise the frontend through a browser, and each produces its own coverage report in a different format. We merge them using Istanbul via nyc, which gives us one unified percentage across the entire codebase: lines, branches, functions, and statements.

The sharp step shifts in the chart below correspond to refactoring sprints, where we targeted the least-covered files (more on that below). Gradual climbs reflect day-to-day test writing as part of normal feature work.

Loading chart...

To find files that need covering, we use the chart below.

Least Covered Files

This is where coverage stops being an abstract percentage and becomes a to-do list. The table ranks files by how little coverage they have, so a test sprint starts by picking the worst offenders from the top. We write tests for those files, push, and watch the overall coverage number climb in the chart above.

This part is rather easy, as it mostly involves asking our agentic setup to write tests, and another more adversarial pass that ensures the tests aren't being lazy, and are truly testing functionality.

file
coverage
src/lib/automations/limitEnforcement.server.ts19.6
src/lib/subscription/state.svelte.ts37.7
src/lib/fpt/session.server.ts40.7
src/lib/stores/dashboard.svelte.ts43.2
src/lib/claude/chart-builder-agent.ts52.9
src/lib/recipe/executor.ts55.1
src/lib/services/chartCache.ts60
src/lib/services/chartService.ts60
src/lib/services/conversationService.ts60.7
src/lib/utils/createDefaultDashboard.ts63
Total492.9

Playwright E2E Testing

The metrics above tell you about your code at rest: how complex it is, how much is duplicated, how well it's typed. But none of that matters if the app doesn't actually work. That's where E2E tests come in.

Playwright logo

Playwright launches 8 real browsers in parallel, each with a bot that navigates the app like a user would: clicking buttons, filling forms, checking that the right data appears on screen. This lets us verify that core user flows survive even substantial refactors. The downside is that E2E tests are inherently fragile. Network hiccups, server latency, timeouts, and race conditions mean that a passing test today can fail tomorrow with no code change. This flakiness became a real problem for our CI reliability.

Test Results Over Time

The first thing we wanted to see was the big picture: are our tests getting more reliable or less? This chart shows passes, failures, and retries over time.

Loading chart...

Blended Flakiness Score

This is our CI reliability hit list, and arguably the single most valuable chart in this entire post. It uses FlowPilot's calculate_field transform to compute a weighted score: 10 points per failure, 2 points per retry. The AI generated this computed metric from a natural language prompt.

Having one ranked table that blends failures and retries into a single score makes prioritization trivial: fix the top item, watch CI reliability improve, repeat. Before this chart existed, flaky tests were a vague frustration. Now they're a sorted backlog.

sg_spec_file
sg_test_title
sg_retry_count
id Count
dashboard-filterdashboard filter overrides chart filter547603
dashboard-discard-changesdiscarding widget addition shows confirmation modal399415
dashboard-project-scopedshows Project-scoped indicator only on project-scoped FPT page widget420401
pagescreates a page with 3 tabs (dashboard, table, card list)386390
edit-mode-navigation-guardallows navigation without modal when no changes made429376
dashboard-crudcreates a dashboard with a chart, then deletes it384370
create-chartcreates a chart via Tools mode and saves it287380
dashboard-filtersdashboard has enabled filters295324
chart-title-synccards: updates DB name when visualization title is changed340314
no-auto-saveexplicit Save button click DOES update chart order291320
dashboard-time-filtertime filter correctly filters dashboard data326310
chart-title-synctable: updates DB name when visualization title is changed333305
chart-title-syncbar chart: updates DB name when visualization title is changed304285
dashboard-discard-changesdiscarding changes properly resets dashboard to saved state281274
automation-toggleshows loading spinner when toggling automation status323255
table-paginationpage size selector and load more work correctly246249
page-duplicatecreates a page, then duplicates it successfully250236
no-auto-saveopening chart for edit and navigating away does NOT update chart252230
no-auto-saveswitching view types without saving does NOT update chart248224
tab-filtersdisplays and toggles tab filters for table tab237221
Total65786482

Failures by Spec File

Sometimes the problem isn't a single test; it's an entire spec file. Maybe it's testing a flaky feature, or the test setup is unreliable. This bar chart shows which spec files fail most often.

Loading chart...

When one file dominates this chart, it's usually a sign that the feature itself needs attention: either better test isolation or fixing the underlying flakiness.

Filtering by Timeframe

All the charts and tables above become significantly more useful when you can narrow the window. Did flakiness get worse after last Tuesday's deploy? Is coverage trending up over the past month, or just the past week? FlowPilot supports date range filtering on any chart, so we can answer these questions instantly.

What We Learned

Dogfooding our own dashboards for quality tracking taught us a few things:

Using Flow PT for DX tooling pays off

Using Flow PT for dev tooling was approached like an experiment, at first, but we very quickly realized it was a huge win. Writing custom reporters for Playwright and Vite, Istanbul, etc. was trivial, and completely obviated the headaches of adding additional 3rd party tools to our stack.

Actionable charts lead to immediate improvements

The most useful chart is by far the blended e2e failures. We can simply pick the top culprits from the top of the list, fix them, and the positive effects are immediately felt with increases in the reliability of our CI.

Dogfooding Improves the Product

Inserting some of our own dev workflow directly into FlowPilot puts us in the user's seat. This creates user empathy, and a willingness to improve the product for our own use.

January Refactor Results

You may notice a large step shift in January's data. Using these dashboards, we identified the highest-impact areas and completed a major refactoring sprint. Here's what changed between early and late January:

  • Test coverage: 55% → 64%
  • Average complexity: 6.1 → 5.2
  • TypeScript coverage: 93.6% → 96.8%
  • Code duplication: 6.1% → 2.0%

Thanks to the extensive test suite, we experienced only a handful of minor regressions. This gives us tremendous confidence in our ability to move fast for our clients without sacrificing quality.

Build Your Own Charts

FlowPilot's AI assistant can generate charts like these from natural language. Describe what you want to see, and it builds the recipe for you, including computed fields, filters, and visualizations.