I Backtested 96 SPY Put Credit Spread Strategies Over 7 Years - Here's What Actually Worked

Seven years of 1-minute SPY chains. 96 grid cells. Honest fills, signal-gated sizing, drawdown breaker. Same cell ran +5,400% with a stop loss and -100% without. Multi-DTE pooling somehow underperformed single-DTE. SL=200% turned out worse than no stop at all. The best signal in 16k trades was a one-line flag that beat my 3-layer composite. The Calmar-leader cell is in the grid - I'm telling you the shape, not the coordinates.

Tomasz Dobrowolski Quant Engineer

Apr 28, 2026

31 min read

Options Backtesting VRP PutSpreads SPY Quant

This started because I wanted to know how much of the apparent edge in selling SPY put credit spreads survives once you stop pretending you fill at the mid. Spoiler: less than the YouTube guys claim, but more than zero.

What's in this thing: 1-minute SPY option chains from 2019 to 2026, MM-style limit fills (no mid-fill nonsense), a VRP signal that strictly cannot see the future, half-Kelly sizing, and a 30% drawdown breaker that just halts the whole run when things go bad. 96 grid cells, 16,024 trades validated, and one finding that absolutely dominates everything else.

-100%

Short DTE, no stop loss

+5,439%

Same cell, with stop loss

-9 pp

CAGR cost of multi-DTE flexibility

96 cells · 7.16 years · 16,024 trades
Limit fills · VRP-gated sizing
Half-Kelly · 30% drawdown breaker

1. Mid-fill backtests are lying to you

First version of the engine filled at the mid. Numbers looked great. Then I switched to "post a limit at combo_ask + $0.04 and wait for someone to cross it" - which is what an actual desk does on options - and CAGR dropped 30-60% across the grid. Several cells flipped from positive to negative. The strategy didn't change. Just the simulator's assumption about who fills whom.

Real fill stats from one of the runs:

Run	Orders posted	Filled	Fill rate	Avg wait	Avg edge captured
45Δ 30 DTE PT 50% No stop	479	112	23.4%	12.4 min	-$0.04 to -$0.07
45Δ 30 DTE PT 50% SL 100%	633	155	24.5%	12.2 min	-$0.04 to -$0.07
45Δ 30 DTE PT 50% SL 200%	383	64	16.7%	13.3 min	-$0.04 to -$0.07

You fill ~20-25% of the orders you post, after a ~12 minute wait, and on average you fill worse than mid by 4-7 cents per contract. The MM doesn't make money at the fill - the MM makes money over the hold, when theta beats realized vol. Any backtest filling at mid is silently gifting your strategy that 4-7 cents on every trade.

Mid-fill fantasy

100% fill rate, instant
Always at the mid
Never rejected
Inflates CAGR by 30-60%

What actually happens

~20-25% fill rate, ~12 min wait
Filled 4-7 cents worse than mid
Stale-quote rejections common
Random tiebreak when two limits cross same bar

2. The stop loss IS the strategy at short DTE

This is the finding. By a lot. Same cell, same fills, same period. Toggle the stop loss:

Configuration	Trades	Return	CAGR	MaxDD	Calmar	What happened
10Δ 7 DTE PT 50% No stop	--	-100%	wipeout	broke circuit	--	Single tail event = account zero
10Δ 7 DTE PT 50% SL 100%	460	+5,439%	+66.0%	30.1%	2.19	Same thing, just add the stop
10Δ 7 DTE PT 75% SL 100%	360	+2,947%	+53.9%	31.1%	1.73	Wider PT, slower cycle
10Δ 7 DTE PT 25% SL 100%	649	+1,752%	+44.6%	30.2%	1.48	Tighter PT, more execution churn

Yeah. +5,439% with a stop, -100% without. Same cell. The stop loss isn't a fine-tuning parameter, it's the entire strategy at short DTE. If you're running 7DTE credit spreads with no defined-loss exit you are one bad Monday from a margin call.

3. There's a sweet spot in the middle of the grid (and I'm not posting it)

Short DTE is feast-or-famine. Long DTE bleeds. Somewhere in between, at one specific delta / DTE / PT combination, the equity curve flattens out into something that looks genuinely tradeable - low double-digit drawdown, Sharpe well above 1, Calmar above 2. I'm not going to publish the exact cell, because (a) it took 96 sweeps to find it, (b) it's the only thing in this whole project worth keeping private, and (c) the recipe is useless to you anyway without the same historical chain data the engine ran against.

What I will say is what the sweet spot is not:

It's not the highest-CAGR cell. The headline +5,439% from the last section has 30% MaxDD - that's a press release, not a strategy.
It's not the highest-Sharpe cell either. The Sharpe leader in my grid sits at 2.48 with 39.7% MaxDD. Sharpe-rich, Calmar-poor, will trip every breaker you set. Skip.
It's not the shortest DTE (gamma trap) and it's not the longest (vega bleed).
It's not the highest delta - delta moves vol-of-equity, not alpha (more on that below).

That eliminates roughly 90 of the 96 cells. The remaining handful is what you're looking for, and which one wins inside that handful depends entirely on the bias hygiene of your backtester (see Section 7) and on having minute-resolution chains going back far enough to count COVID, the 2022 grind-down, and a few smaller vol spikes as separate observations rather than averaged-away noise.

Surprise finding: Delta is a vol dial, not an alpha dial

I expected higher delta to mean more edge - more credit per contract, more theta to harvest. Instead, raising delta from 10 to 30 to 45 raised CAGR and raised MaxDD in roughly equal proportion. Calmar stayed approximately flat. Higher delta just gives you a louder version of the same equity curve. The strategy alpha lives in the SL/PT/DTE choice. Delta is just how big you want the swings.

If you want to find your own version of this cell: the engine is mostly bookkeeping - the actual constraint is having tick-resolution SPY option chains going back to 2018 with surface-consistent IVs and walk-forward signal data. That's what FlashAlpha's Alpha tier historical API is for. Without those inputs, you're sweeping a 96-cell grid on noise.

4. SL=200% is worse than no stop loss (yes, really)

I expected SL=200% to be the sensible compromise. Looser than 100% so I don't get noise-stopped, tighter than no-SL so a tail doesn't kill me. The data laughed at me:

Same spread, three stop-loss settings	Trades	Return	CAGR	Sharpe	MaxDD
45Δ 30 DTE PT 50% No stop	112	+30.6%	+3.8%	+0.23	30.1%
45Δ 30 DTE PT 50% SL 100%	155	+28.0%	+3.5%	+0.22	29.3%
45Δ 30 DTE PT 50% SL 200% ← trap	64	-16.5%	-2.5%	-0.17	31.2%

SL=200% lost money. While both no-stop and SL=100% made money on the same data.

The reason makes sense once you stare at the trade paths: by the time the loss has grown to 200% of credit, you're deep ITM and gamma is doing the marking, not theta. You stop out at a terrible price, on a terrible day, after letting the position breathe past the point where it could have recovered. SL=100% stops you before gamma takes over. No SL at least lets the position fully expire - sometimes you get bailed out by a recovery. SL=200% has the worst path of both worlds.

Lesson: tighter beats looser. Either run SL=100% or no stop. The middle is the trap.

5. I built a fancy signal. A one-liner beat it.

I built a 3-layer composite - Premium / Danger / Stabilization scores, z-scored macro inputs, continuous Kelly multiplier, the whole nine yards. Then I ran t-tests across all 16,024 trades to see which features actually predicted P&L.

The single strongest predictor in the entire feature set wasn't my composite. It was a one-line boolean flag on a free macro series - the kind of feature you can compute in three lines of pandas. t-stat over 8, on 5,000+ trades. Bonferroni-correct it across the entire feature space and it still wins comfortably.

My fancy composite added something on top, but the signal-to-noise is mostly in the simple flag.

The lesson, not the rule

If your "edge" is a 47-feature gradient-boosted model, check what happens if you replace it with the single most economically obvious flag. Often that flag does 80% of the work and your model is overfitting the residual. The win-rate gap I found between the simple-flag-on days and simple-flag-off days was bigger than every parameter sweep I ran on the strategy itself. Most "signal engineering" is just expensive ways to discover one boring flag.

I'm not posting the specific flag. Same logic as the cell recipe - any reader with a few years of trade tape and a t-test could derive an equivalent on their own. The actual constraint isn't the regression, it's generating the trade tape in the first place - which means a working fill model running against minute-resolution historical chains. Without those inputs you have nothing to regress against.

One thing I will say, because it's surprising and not actionable on its own: the worst environment for selling puts is when the market is paying the most. Top-quintile VRP days have a 66% winrate vs 74% baseline. By the time premium is that rich, something is actually wrong.

And one general principle that came out of the interaction tests: don't sell into rising fear with an inverted term curve. Wait for the curve to normalize or for the fear to start fading. Either is fine. Both being wrong is the worst regime in the data.

6. Multi-DTE pooling - tried it, lost 9pp of CAGR

Seemed obvious: instead of committing to one tenor, rank candidates across 30/45/60-DTE chains every entry, pick the best EV-per-dollar-at-risk, let the term structure tell me which tenor is most attractive that day. Built it. Ran it.

Tenor selection	Trades	Return	CAGR	Sharpe	MaxDD
45Δ Pool 30/45/60 DTE No stop	112	+30.6%	+3.8%	+0.23	30.1%
45Δ Pool 30/45/60 DTE SL 100%	155	+28.0%	+3.5%	+0.22	29.3%
45Δ Focused 30 DTE No stop	82	+70%	+12.5%	+0.72	27.5%

Pooling underperformed focused by ~9pp of CAGR. The fix-the-strategy was actually a hurt-the-strategy.

~68%

Fills landed at 30-DTE - the alpha-generating bucket

~24%

Fills landed at 45-DTE - flat-to-negative

~8%

Fills landed at 60-DTE - too few to matter

-9 pp

CAGR cost of pooling vs focused

The ranker correctly preferred 30-DTE most of the time. When it picked 45-DTE, it was specifically because the 30-DTE chain looked worse than usual that bar - which means the 45-DTE bucket got adversely selected. More flexibility gave the optimizer more ways to be wrong, not more ways to be right.

Counterintuitive but real

I went into this thinking "more degrees of freedom = better outcomes." Wrong. More degrees of freedom = the optimizer can also fail in more ways. Unless you have a specific signal that the alternative tenor is better, just pick one DTE and stick with it.

7. The bug log (a.k.a. don't trust your own backtest)

Every one of these would have inflated the headline numbers. Most were caught only after a code review, which is humbling. Here they are roughly in order of how badly they would have lied:

Bug	What it did	Damage
Mid-fill assumption	Filled at the bid-ask average always	Flipped multiple losing strategies positive
Look-ahead in signal	Day-D signal used end-of-D data at 10:05 AM entry	Inflated CAGR ~5-10pp on most cells
Stale-quote acceptance	"Fills" at quotes that were no longer real liquidity	~30% of fills had negative edge captured
EV-sorted tiebreak	Higher-EV candidate "filled" first when two crossed same bar	Subtle but real per-trade lift
Warmup sizing bug	Full Kelly applied before signal had any history	Cratered 2018 results
Validation walk-back mismatch	Validator used exact-date lookup, engine walked back 7 days	Bogus regression stats on weekend dates
Walk-the-limit (proposed, rejected)	Drop the limit a penny each minute if unfilled	Caught before merge - introduces adverse selection

The archive_v3_grid/ directory has the same engine on the same data with all the bias bugs intact. Every cell in there returned -1% to -8% CAGR. So when the post-fix numbers turned positive, that wasn't simulator noise - that was what the bias was masking.

If you're building your own backtest and the numbers look great on the first run, you have a bug. Find it.

8. The sizing layer that nobody talks about

Sharpe and Calmar tables get all the attention. The thing that actually keeps you in the game is the sizing rule. Mine, after a lot of trial and error:

Unlevered preset (default)

kelly_default 0.05, kelly_max 0.25
vrp_on_mult 1.0
Honest reference - signal alpha without leverage

Leveraged preset (stress test)

kelly_default 0.15, kelly_max 0.50
vrp_on_mult 2.5
If a config blows up here but works unlevered, the strategy is fine and the leverage is wrong

1.0

Hard cap on kelly_f - leveraged math can multiply to 1.25 without it. The cap is "drawdown vs ruin."

30%

Drawdown circuit breaker - peak-to-trough. Halts the run rather than letting it grind to ruin.

0.0

Warmup multiplier - skip days before the signal has enough history. Cratered my 2018 numbers when this was 1.0.

50%

Absolute bankroll floor - secondary backstop in case the 30% rule somehow doesn't fire.

Half-Kelly is conservative on purpose. Mean-variance Kelly assumes Gaussian returns, which short-vol returns absolutely are not - they're skewed left with fat tails. The "true" Kelly under fat tails is below the mean-variance Kelly. So half-Kelly isn't lazy, it's roughly correct.

The configuration I'd actually run

I'm not posting the parameter table. The cell that came out of this study is the only piece I'm keeping for myself. What I'll tell you is the shape of how to find your own:

Parameter	Range to sweep	What to optimize for
Strategy	SPY put credit spread, limit fills (no mid-fill)	Honesty > flattering numbers
Delta	0.10 - 0.45	Equity-vol tolerance, not alpha
DTE	7, 14, 30, 45, 60 (skip the extremes after the first sweep)	Calmar, not CAGR
Profit take	25%, 50%, 75% of credit	Cycle speed vs gamma giveback
Stop loss	Only 100% or none. Never 200%.	Survival
Sizing	Half-Kelly × signal multiplier, capped at 1.0	Fat-tail-aware, not mean-variance
Breaker	30% peak-to-trough; 50% absolute floor	Hard stop on the strategy itself

Run that 96-cell sweep on a clean engine (Section 7 has the bug list - check yours against it first), and one specific intersection of (delta, DTE, PT) will jump out as the Calmar leader. That's the cell. I'm not saying which one mine was - it's the only piece I'm keeping for myself.

The thing that nobody tells you about that sweep: you cannot run it without minute-resolution SPY chains going back to 2018. Daily settlement data won't capture the intraday MM fill behavior that turns out to drive 30-60% of the apparent edge. End-of-day VIX won't give you the falling-VIX flag that turns out to be your strongest signal. Without 1-minute chains plus walk-forward signal data, you're sweeping a 96-cell grid on noise.

TL;DR

Mid-fill backtests overstate CAGR by 30-60%. Build a real fill model or don't bother.
Stop loss at 100% of credit collected is the entire game at short DTE. +5,400% with it, -100% without.
SL=200% is worse than no stop. Pick tight or pick none, never the middle.
There's a sweet spot in the grid where Calmar > 2. Mid-DTE, mid-PT, low delta. Exact cell stays with me - find your own with a clean sweep on the same data.
The strongest signal in 16k trades was a one-line boolean flag. Beat my 3-layer composite. The lesson: replace your model with the obvious flag and check how much you actually lose.
The market pays the most right before it bites. Top-quintile VRP days had 66% winrate vs 74% baseline.
More flexibility cost me 9pp of CAGR. Multi-DTE pooling adversely selected the bad tenors.
Higher delta moves equity vol, not alpha. 10Δ, 30Δ, 45Δ all had similar Calmar - just louder swings.

If you're running short-vol on SPY without a stop loss, please reconsider. If you're running it with SL=200%, definitely reconsider.

I Backtested 96 SPY Put Credit Spread Strategies Over 7 Years - Here's What Actually Worked

1. Mid-fill backtests are lying to you

2. The stop loss IS the strategy at short DTE

3. There's a sweet spot in the middle of the grid (and I'm not posting it)

4. SL=200% is worse than no stop loss (yes, really)

5. I built a fancy signal. A one-liner beat it.

6. Multi-DTE pooling - tried it, lost 9pp of CAGR

7. The bug log (a.k.a. don't trust your own backtest)

8. The sizing layer that nobody talks about

The configuration I'd actually run

TL;DR

0DTE Gamma Regime Today — Positive or Negative Gamma, and Why It Matters By 10 AM

SPXW 0DTE — Same-Day S&P 500 Options Guide (Gamma, Pin Risk, Expected Move)

Pin Risk Explained — The 0-100 Score That Tells You If Price Will Pin

Live Market Pulse

Intelligent Screening

Execution-Ready

Join the Community

Discord

Twitter / X

GitHub

Welcome to FlashAlpha!

How did you hear about us?

1. Mid-fill backtests are lying to you

2. The stop loss IS the strategy at short DTE

3. There's a sweet spot in the middle of the grid (and I'm not posting it)

4. SL=200% is worse than no stop loss (yes, really)

5. I built a fancy signal. A one-liner beat it.

6. Multi-DTE pooling - tried it, lost 9pp of CAGR

7. The bug log (a.k.a. don't trust your own backtest)

8. The sizing layer that nobody talks about

The configuration I'd actually run

TL;DR

0DTE Gamma Regime Today — Positive or Negative Gamma, and Why It Matters By 10 AM

SPXW 0DTE — Same-Day S&P 500 Options Guide (Gamma, Pin Risk, Expected Move)

Pin Risk Explained — The 0-100 Score That Tells You If Price Will Pin

Live Market Pulse

Intelligent Screening

Execution-Ready

Join the Community

Discord

Twitter / X

GitHub