I Backtested 96 SPY Put Credit Spread Strategies Over 7 Years - Here's What Actually Worked | FlashAlpha

I Backtested 96 SPY Put Credit Spread Strategies Over 7 Years - Here's What Actually Worked

Seven years of 1-minute SPY chains. 96 grid cells. Honest fills, signal-gated sizing, drawdown breaker. Same cell ran +5,400% with a stop loss and -100% without. Multi-DTE pooling somehow underperformed single-DTE. SL=200% turned out worse than no stop at all. The best signal in 16k trades was a one-line flag that beat my 3-layer composite. The Calmar-leader cell is in the grid - I'm telling you the shape, not the coordinates.

T
Tomasz Dobrowolski Quant Engineer
Apr 28, 2026
31 min read
Options Backtesting VRP PutSpreads SPY Quant

This started because I wanted to know how much of the apparent edge in selling SPY put credit spreads survives once you stop pretending you fill at the mid. Spoiler: less than the YouTube guys claim, but more than zero.

What's in this thing: 1-minute SPY option chains from 2019 to 2026, MM-style limit fills (no mid-fill nonsense), a VRP signal that strictly cannot see the future, half-Kelly sizing, and a 30% drawdown breaker that just halts the whole run when things go bad. 96 grid cells, 16,024 trades validated, and one finding that absolutely dominates everything else.

-100%
Short DTE, no stop loss
+5,439%
Same cell, with stop loss
-9 pp
CAGR cost of multi-DTE flexibility
96 cells · 7.16 years · 16,024 trades
Limit fills · VRP-gated sizing
Half-Kelly · 30% drawdown breaker

1. Mid-fill backtests are lying to you

First version of the engine filled at the mid. Numbers looked great. Then I switched to "post a limit at combo_ask + $0.04 and wait for someone to cross it" - which is what an actual desk does on options - and CAGR dropped 30-60% across the grid. Several cells flipped from positive to negative. The strategy didn't change. Just the simulator's assumption about who fills whom.

Real fill stats from one of the runs:

RunOrders postedFilledFill rateAvg waitAvg edge captured
45Δ 30 DTE PT 50% No stop 47911223.4%12.4 min-$0.04 to -$0.07
45Δ 30 DTE PT 50% SL 100% 63315524.5%12.2 min-$0.04 to -$0.07
45Δ 30 DTE PT 50% SL 200% 3836416.7%13.3 min-$0.04 to -$0.07

You fill ~20-25% of the orders you post, after a ~12 minute wait, and on average you fill worse than mid by 4-7 cents per contract. The MM doesn't make money at the fill - the MM makes money over the hold, when theta beats realized vol. Any backtest filling at mid is silently gifting your strategy that 4-7 cents on every trade.

Mid-fill fantasy
  • 100% fill rate, instant
  • Always at the mid
  • Never rejected
  • Inflates CAGR by 30-60%
What actually happens
  • ~20-25% fill rate, ~12 min wait
  • Filled 4-7 cents worse than mid
  • Stale-quote rejections common
  • Random tiebreak when two limits cross same bar

2. The stop loss IS the strategy at short DTE

This is the finding. By a lot. Same cell, same fills, same period. Toggle the stop loss:

ConfigurationTradesReturnCAGRMaxDDCalmarWhat happened
10Δ 7 DTE PT 50% No stop -- -100% wipeout broke circuit -- Single tail event = account zero
10Δ 7 DTE PT 50% SL 100% 460 +5,439% +66.0% 30.1% 2.19 Same thing, just add the stop
10Δ 7 DTE PT 75% SL 100% 360 +2,947% +53.9% 31.1% 1.73 Wider PT, slower cycle
10Δ 7 DTE PT 25% SL 100% 649 +1,752% +44.6% 30.2% 1.48 Tighter PT, more execution churn

Yeah. +5,439% with a stop, -100% without. Same cell. The stop loss isn't a fine-tuning parameter, it's the entire strategy at short DTE. If you're running 7DTE credit spreads with no defined-loss exit you are one bad Monday from a margin call.


3. There's a sweet spot in the middle of the grid (and I'm not posting it)

Short DTE is feast-or-famine. Long DTE bleeds. Somewhere in between, at one specific delta / DTE / PT combination, the equity curve flattens out into something that looks genuinely tradeable - low double-digit drawdown, Sharpe well above 1, Calmar above 2. I'm not going to publish the exact cell, because (a) it took 96 sweeps to find it, (b) it's the only thing in this whole project worth keeping private, and (c) the recipe is useless to you anyway without the same historical chain data the engine ran against.

What I will say is what the sweet spot is not:

  • It's not the highest-CAGR cell. The headline +5,439% from the last section has 30% MaxDD - that's a press release, not a strategy.
  • It's not the highest-Sharpe cell either. The Sharpe leader in my grid sits at 2.48 with 39.7% MaxDD. Sharpe-rich, Calmar-poor, will trip every breaker you set. Skip.
  • It's not the shortest DTE (gamma trap) and it's not the longest (vega bleed).
  • It's not the highest delta - delta moves vol-of-equity, not alpha (more on that below).

That eliminates roughly 90 of the 96 cells. The remaining handful is what you're looking for, and which one wins inside that handful depends entirely on the bias hygiene of your backtester (see Section 7) and on having minute-resolution chains going back far enough to count COVID, the 2022 grind-down, and a few smaller vol spikes as separate observations rather than averaged-away noise.

Surprise finding: Delta is a vol dial, not an alpha dial

I expected higher delta to mean more edge - more credit per contract, more theta to harvest. Instead, raising delta from 10 to 30 to 45 raised CAGR and raised MaxDD in roughly equal proportion. Calmar stayed approximately flat. Higher delta just gives you a louder version of the same equity curve. The strategy alpha lives in the SL/PT/DTE choice. Delta is just how big you want the swings.

If you want to find your own version of this cell: the engine is mostly bookkeeping - the actual constraint is having tick-resolution SPY option chains going back to 2018 with surface-consistent IVs and walk-forward signal data. That's what FlashAlpha's Alpha tier historical API is for. Without those inputs, you're sweeping a 96-cell grid on noise.

4. SL=200% is worse than no stop loss (yes, really)

I expected SL=200% to be the sensible compromise. Looser than 100% so I don't get noise-stopped, tighter than no-SL so a tail doesn't kill me. The data laughed at me:

Same spread, three stop-loss settingsTradesReturnCAGRSharpeMaxDD
45Δ 30 DTE PT 50% No stop 112+30.6%+3.8%+0.2330.1%
45Δ 30 DTE PT 50% SL 100% 155+28.0%+3.5%+0.2229.3%
45Δ 30 DTE PT 50% SL 200% ← trap 64 -16.5% -2.5% -0.17 31.2%

SL=200% lost money. While both no-stop and SL=100% made money on the same data.

The reason makes sense once you stare at the trade paths: by the time the loss has grown to 200% of credit, you're deep ITM and gamma is doing the marking, not theta. You stop out at a terrible price, on a terrible day, after letting the position breathe past the point where it could have recovered. SL=100% stops you before gamma takes over. No SL at least lets the position fully expire - sometimes you get bailed out by a recovery. SL=200% has the worst path of both worlds.

Lesson: tighter beats looser. Either run SL=100% or no stop. The middle is the trap.


5. I built a fancy signal. A one-liner beat it.

I built a 3-layer composite - Premium / Danger / Stabilization scores, z-scored macro inputs, continuous Kelly multiplier, the whole nine yards. Then I ran t-tests across all 16,024 trades to see which features actually predicted P&L.

The single strongest predictor in the entire feature set wasn't my composite. It was a one-line boolean flag on a free macro series - the kind of feature you can compute in three lines of pandas. t-stat over 8, on 5,000+ trades. Bonferroni-correct it across the entire feature space and it still wins comfortably.

My fancy composite added something on top, but the signal-to-noise is mostly in the simple flag.

The lesson, not the rule

If your "edge" is a 47-feature gradient-boosted model, check what happens if you replace it with the single most economically obvious flag. Often that flag does 80% of the work and your model is overfitting the residual. The win-rate gap I found between the simple-flag-on days and simple-flag-off days was bigger than every parameter sweep I ran on the strategy itself. Most "signal engineering" is just expensive ways to discover one boring flag.

I'm not posting the specific flag. Same logic as the cell recipe - any reader with a few years of trade tape and a t-test could derive an equivalent on their own. The actual constraint isn't the regression, it's generating the trade tape in the first place - which means a working fill model running against minute-resolution historical chains. Without those inputs you have nothing to regress against.

One thing I will say, because it's surprising and not actionable on its own: the worst environment for selling puts is when the market is paying the most. Top-quintile VRP days have a 66% winrate vs 74% baseline. By the time premium is that rich, something is actually wrong.

And one general principle that came out of the interaction tests: don't sell into rising fear with an inverted term curve. Wait for the curve to normalize or for the fear to start fading. Either is fine. Both being wrong is the worst regime in the data.


6. Multi-DTE pooling - tried it, lost 9pp of CAGR

Seemed obvious: instead of committing to one tenor, rank candidates across 30/45/60-DTE chains every entry, pick the best EV-per-dollar-at-risk, let the term structure tell me which tenor is most attractive that day. Built it. Ran it.

Tenor selectionTradesReturnCAGRSharpeMaxDD
45Δ Pool 30/45/60 DTE No stop 112+30.6%+3.8%+0.2330.1%
45Δ Pool 30/45/60 DTE SL 100% 155+28.0%+3.5%+0.2229.3%
45Δ Focused 30 DTE No stop 82 +70% +12.5% +0.72 27.5%

Pooling underperformed focused by ~9pp of CAGR. The fix-the-strategy was actually a hurt-the-strategy.

~68%
Fills landed at 30-DTE - the alpha-generating bucket
~24%
Fills landed at 45-DTE - flat-to-negative
~8%
Fills landed at 60-DTE - too few to matter
-9 pp
CAGR cost of pooling vs focused

The ranker correctly preferred 30-DTE most of the time. When it picked 45-DTE, it was specifically because the 30-DTE chain looked worse than usual that bar - which means the 45-DTE bucket got adversely selected. More flexibility gave the optimizer more ways to be wrong, not more ways to be right.

Counterintuitive but real

I went into this thinking "more degrees of freedom = better outcomes." Wrong. More degrees of freedom = the optimizer can also fail in more ways. Unless you have a specific signal that the alternative tenor is better, just pick one DTE and stick with it.


7. The bug log (a.k.a. don't trust your own backtest)

Every one of these would have inflated the headline numbers. Most were caught only after a code review, which is humbling. Here they are roughly in order of how badly they would have lied:

BugWhat it didDamage
Mid-fill assumptionFilled at the bid-ask average alwaysFlipped multiple losing strategies positive
Look-ahead in signalDay-D signal used end-of-D data at 10:05 AM entryInflated CAGR ~5-10pp on most cells
Stale-quote acceptance"Fills" at quotes that were no longer real liquidity~30% of fills had negative edge captured
EV-sorted tiebreakHigher-EV candidate "filled" first when two crossed same barSubtle but real per-trade lift
Warmup sizing bugFull Kelly applied before signal had any historyCratered 2018 results
Validation walk-back mismatchValidator used exact-date lookup, engine walked back 7 daysBogus regression stats on weekend dates
Walk-the-limit (proposed, rejected)Drop the limit a penny each minute if unfilledCaught before merge - introduces adverse selection

The archive_v3_grid/ directory has the same engine on the same data with all the bias bugs intact. Every cell in there returned -1% to -8% CAGR. So when the post-fix numbers turned positive, that wasn't simulator noise - that was what the bias was masking.

If you're building your own backtest and the numbers look great on the first run, you have a bug. Find it.


8. The sizing layer that nobody talks about

Sharpe and Calmar tables get all the attention. The thing that actually keeps you in the game is the sizing rule. Mine, after a lot of trial and error:

Unlevered preset (default)
  • kelly_default 0.05, kelly_max 0.25
  • vrp_on_mult 1.0
  • Honest reference - signal alpha without leverage
Leveraged preset (stress test)
  • kelly_default 0.15, kelly_max 0.50
  • vrp_on_mult 2.5
  • If a config blows up here but works unlevered, the strategy is fine and the leverage is wrong
1.0
Hard cap on kelly_f - leveraged math can multiply to 1.25 without it. The cap is "drawdown vs ruin."
30%
Drawdown circuit breaker - peak-to-trough. Halts the run rather than letting it grind to ruin.
0.0
Warmup multiplier - skip days before the signal has enough history. Cratered my 2018 numbers when this was 1.0.
50%
Absolute bankroll floor - secondary backstop in case the 30% rule somehow doesn't fire.

Half-Kelly is conservative on purpose. Mean-variance Kelly assumes Gaussian returns, which short-vol returns absolutely are not - they're skewed left with fat tails. The "true" Kelly under fat tails is below the mean-variance Kelly. So half-Kelly isn't lazy, it's roughly correct.


The configuration I'd actually run

I'm not posting the parameter table. The cell that came out of this study is the only piece I'm keeping for myself. What I'll tell you is the shape of how to find your own:

ParameterRange to sweepWhat to optimize for
StrategySPY put credit spread, limit fills (no mid-fill)Honesty > flattering numbers
Delta0.10 - 0.45Equity-vol tolerance, not alpha
DTE7, 14, 30, 45, 60 (skip the extremes after the first sweep)Calmar, not CAGR
Profit take25%, 50%, 75% of creditCycle speed vs gamma giveback
Stop lossOnly 100% or none. Never 200%.Survival
SizingHalf-Kelly × signal multiplier, capped at 1.0Fat-tail-aware, not mean-variance
Breaker30% peak-to-trough; 50% absolute floorHard stop on the strategy itself

Run that 96-cell sweep on a clean engine (Section 7 has the bug list - check yours against it first), and one specific intersection of (delta, DTE, PT) will jump out as the Calmar leader. That's the cell. I'm not saying which one mine was - it's the only piece I'm keeping for myself.

The thing that nobody tells you about that sweep: you cannot run it without minute-resolution SPY chains going back to 2018. Daily settlement data won't capture the intraday MM fill behavior that turns out to drive 30-60% of the apparent edge. End-of-day VIX won't give you the falling-VIX flag that turns out to be your strongest signal. Without 1-minute chains plus walk-forward signal data, you're sweeping a 96-cell grid on noise.


TL;DR

  1. Mid-fill backtests overstate CAGR by 30-60%. Build a real fill model or don't bother.
  2. Stop loss at 100% of credit collected is the entire game at short DTE. +5,400% with it, -100% without.
  3. SL=200% is worse than no stop. Pick tight or pick none, never the middle.
  4. There's a sweet spot in the grid where Calmar > 2. Mid-DTE, mid-PT, low delta. Exact cell stays with me - find your own with a clean sweep on the same data.
  5. The strongest signal in 16k trades was a one-line boolean flag. Beat my 3-layer composite. The lesson: replace your model with the obvious flag and check how much you actually lose.
  6. The market pays the most right before it bites. Top-quintile VRP days had 66% winrate vs 74% baseline.
  7. More flexibility cost me 9pp of CAGR. Multi-DTE pooling adversely selected the bad tenors.
  8. Higher delta moves equity vol, not alpha. 10Δ, 30Δ, 45Δ all had similar Calmar - just louder swings.

If you're running short-vol on SPY without a stop loss, please reconsider. If you're running it with SL=200%, definitely reconsider.

Most Popular · Alpha tier
The data you actually need to find your own cell
Minute-resolution SPY chains. Walk-forward signals. Since 2018.
The reason I can run a 96-cell sweep with honest fills isn't the engine - the engine is bookkeeping. It's the historical data feed. Alpha tier ships the same minute-bar SPY option chains, surface-consistent IVs, walk-forward VRP percentiles, and macro signals (VIX/VIX3M/VVIX/HY OAS) that I used to produce every number in this article.
Resolution
1-minute bars
History
2018 → today
Bias hygiene
Walk-forward by default
Signals
VRP, term, GEX
See Alpha tier → Find your own Calmar-2 cell. Don't trust mine.

Live Market Pulse

Get tick-by-tick visibility into market shifts with full-chain analytics streaming in real time.

Intelligent Screening

Screen millions of option pairs per second using your custom EV rules, filters, and setups.

Execution-Ready

Instantly send structured orders to Interactive Brokers right from your scan results.

Join the Community

Discord

Engage in real time conversations with us!

Twitter / X

Follow us for real-time updates and insights!

GitHub

Explore our open-source SDK, examples, and analytics resources!