Auto-generate Reddit reply templates: a self-tuning distribution, not a prompt list

The honest answer to "how do I auto-generate Reddit reply templates" is that the template list is the easy part. The picker is the part that makes or breaks it. Encode 7 named tones, score each by avg_upvotes from your own past replies, then pick the next tone from a sharpened weighted distribution with a 5% floor and a 50% cap. Let the model invent new tones inline that auto-register as candidates, and graduate them only when real-world samples confirm they work.

Matthew Diakonov, Written with AI

Published May 2, 20268 min read

Direct answer (verified 2026-05-02)

Don't hard-code a list of reply prompts. Do this instead, end to end:

Define 7-10 distinct tones (e.g. critic, storyteller, pattern_recognizer, contrarian, data_point_drop) as a dict, each with a description, an example utterance, and a per-platform best_in hint list.
Compute the next reply's tone from your own historical avg_upvotes per tone, sharpened by an exponent of 2.0, with a 5% floor per tone and a 50% cap on the leader. This is a Thompson-style picker, not argmax.
Allow the model to invent a new tone inline. Auto-register it as a candidate that only gets the 5% floor until a nightly job graduates it.
Gate every reply with a two-lane grounding rule so the model cannot present a fabricated specific as a personal first-hand claim.

Authoritative source for the constants and the picker code: github.com/m13v/social-autoposter, specifically scripts/engagement_styles.py.

Why a static template list rots

Every off-the-shelf Reddit reply tool ships with a single prompt that says something like "write a helpful, on-topic comment in a natural Reddit voice." You paste a thread URL, you get a draft, you ship it. After the first 30 replies, your output starts to feel exactly the same. That is not a model problem. It is a taxonomy problem. A single prompt collapses to a single voice because it has no way to ask "what kind of reply does this thread want?"

The fix is to split the work in two. First, decide on a tone. Second, write inside that tone. The tone is the template; the writing is the model. Most published reply prompts conflate the two and inherit the average of their training data across all Reddit comment shapes. The result reads like a person who is trying to sound like Reddit instead of a person who is on Reddit.

The static prompt vs. the distribution-driven picker

One prompt. The model rephrases the post and offers a helpful comment. Every reply ends up sounding like Reddit's median voice. After a week the same hedges and openers appear in 80% of your output. Performance per subreddit is invisible because the prompt is the same everywhere. Adding a new style means editing the prompt; the old replies do not retrain the picker because there is no picker.

One voice in, one voice out
No way to A/B test tones against your own subreddit data
No way to decay tones that stop working
No way to safely add new tones without a deploy

The seven baseline templates

Below is the full taxonomy from scripts/engagement_styles.py in the open-source S4L repo. The names matter; pick names you can hand to a model without further explanation. Each template ships with a one-line description, a worked example utterance, a per-platform best_in subreddit list, and a note about when it fails. Treat them as the smallest set that covers the shapes Reddit actually rewards. You can rename them, but you cannot collapse them, because each one selects for a different conversation move.

critic

Point out what is missing, flawed, or naive. Reframe the problem. Authority comes from a non-obvious insight, not from credentials. Best in r/Entrepreneur, r/smallbusiness, r/startups.

storyteller

Narrative-driven, with a mandatory failure or surprise lede. Forced through a two-lane grounding rule so a fabricated specific never leaks in as a personal claim. Best in r/startups, niche advice subs.

pattern_recognizer

Name the phenomenon. 'I have seen this play out dozens of times across X.' Authority through observation. Best in r/ExperiencedDevs, r/programming, r/webdev.

curious_probe

One follow-up question on the most interesting detail, prefixed with 'curious because we ran into something similar'. ONE question only, never a list. Disabled by policy on Reddit, kept on for niche subs by override.

contrarian

Take a clear opposing position backed by experience. 'Everyone recommends X. I have done X for Y years and it is wrong.' Empty hot takes get destroyed.

data_point_drop

Share one specific, believable metric. '$12k in a month', not 'a lot of money'. Numbers must be believable rather than impressive. No links.

snarky_oneliner

Short, sharp, validates a shared frustration. One sentence max. NEVER in small/serious subs. NEVER on professional networks.

The full source for each template (description, example, best_in, note) is in the STYLES dict at lines 31-133 of scripts/engagement_styles.py.

The four constants that turn it from a list into a picker

Once the templates exist, you need a function that picks one. A pure argmax over historical avg_upvotes is the obvious wrong answer; it locks onto whichever tone got lucky first and never explores. A pure uniform random is the other wrong answer; it ignores everything you have learned. The right shape is a sharpened weighted distribution with floor and cap. These four constants control it.

0MIN_SAMPLE_SIZE (trust threshold)

0WEIGHT_EXPONENT (sharpens winners)

0%STYLE_FLOOR_PCT (every style keeps testing)

0%STYLE_CAP_PCT (winner can't starve rest)

MIN_SAMPLE_SIZE = 5. Below five posts on a tone, you have no signal. The picker treats untrusted tones as "explore" and gives them floor weight only, regardless of their noisy average.

WEIGHT_EXPONENT = 2.0. Raw weight per tone is avg_upvotes ** 2.0. The exponent sharpens the distribution so a tone that scores 2x the average gets 4x the weight, not 2x. Set this to 1.0 if you want a softer landing; set it higher if you want a steeper one.

STYLE_FLOOR_PCT = 5.0. Every non-disabled tone gets at least 5% of picks. This is the exploration budget. A tone with no samples yet still gets tested; a tone that has fallen out of favor still gets a chance to come back.

STYLE_CAP_PCT = 50.0. The leading tone cannot exceed 50% of picks. Overflow redistributes pro-rata to the other tones based on their current weight. This is what stops a single "winning" tone from collapsing everything else to the floor and freezing the distribution.

Letting the model invent new templates inline

The harder design question is what to do when none of your seven baseline tones fit a thread. The wrong answer is to force the closest match. The slightly less wrong answer is to ask a human to add a new template. The actually-good answer is to let the model emit a new tone inline and have your orchestrator register it as a candidate.

Concretely: the model returns a JSON object with engagement_style: "my_new_style" AND a new_style block carrying description, example, why_existing_didnt_fit, and an optional note. The orchestrator validates the block, atomically writes it to scripts/engagement_styles_extra.json behind a flock so concurrent agents do not lose each other's entries, and tags it status: "candidate". From that moment on the new tone shows up in every prompt next to the baseline seven, but it only gets the 5% floor weight regardless of how often it "wins" in subsequent reply decisions. A separate nightly job, the promoter, graduates it to status: "active" only after at least N samples confirm a non-degenerate median.

Candidate registration to active promotion

Model picks a name

engagement_style: my_new_style

new_style block parses

description, example, why_existing_didnt_fit

Atomic write to sidecar

engagement_styles_extra.json with flock

Active in next prompt

Floor weight only until promoted

Nightly promoter graduates

Once N >= 5 with positive median

Three real candidates that graduated last week

The candidate-promotion pipeline is not theoretical. The S4L sidecar at scripts/engagement_styles_extra.json currently holds three model-invented styles, each promoted to active in the past week. The dates and sample counts are taken verbatim from the file as of 2026-05-02.

agree_and_extend

Validate OP's framing then add a non-obvious second-order point. Sits between pattern_recognizer (which names the phenomenon) and storyteller (which leads with failure). Promoted from candidate to active on 2026-04-29 after n=5 posts on Reddit with median upvotes at par with the platform baseline.

share_experience

Lived experience without storyteller's mandatory failure-lead. 'Here is what we found' rather than 'here is where we got burned'. Promoted on 2026-04-30 across Reddit (n=4), Twitter (n=5), GitHub (n=6).

technical_detail

Pure technical specificity, no narrative. Drops a precise mechanism, regulation, or system constraint. Closer to data_point_drop but qualitative. Promoted on 2026-04-29.

Each entry preserves the first-use post URL, the inventing model, the original why_existing_didnt_fit rationale, and the promoted_at timestamp. That history is what lets you audit later why a tone exists at all.

“Three model-invented templates promoted to active in seven days, each one solving a thread shape the seven baselines could not cover cleanly.”

engagement_styles_extra.json on 2026-05-02

The two-lane grounding rule

The single biggest failure mode of an auto-generated reply is a fabricated personal anecdote. Storyteller and share_experience tones are the most vulnerable. The model writes "i ran 22 cameras across three properties for 8 months" on a thread about home security and lands two downvotes inside an hour from anyone who recognizes the specifics are made up. The fix is structural, not stylistic. Every template runs under the same grounding rule, baked into the prompt and outranking every other style guidance.

Lane 1, disclosed story. Open the comment with a phrase that signals illustration: "hypothetically", "imagine someone running this", "say a friend tried", "scenario:". Once that frame is set, the model has full creative license on the specifics. The reader can tell from the first phrase that this is a worked example, not testimony.

Lane 2, no fabrication. Stay first-person, but every concrete detail (number, duration, headcount, place name, course name, named tool, named person) must appear verbatim in your project's config. If a specific is not in config, drop it, generalize it, or pattern-frame it ("the typical failure mode is..."). Pattern-framing counts as observation, not autobiography.

The two lanes are mutually exclusive. The model cannot mix them. A specific presented in Lane 1 voice (after a hedge) is fine; the same specific presented in Lane 2 voice (first-person) without a config anchor is not. This is the line that separates an auto-generator that survives a year on Reddit from one that gets flagged in a month.

Wiring it into your own loop

Treat the picker as a pure function over your existing engagement DB. You do not need a new service. The four moving pieces are:

The four moving pieces, in order

1
STYLES dict
Hardcoded baseline taxonomy in a single Python file
2
posts table
engagement_style + upvotes + platform + status
3
compute_target_distribution()
avg_upvotes^2.0, floor, cap, normalize
4
validate_or_register()
Accept inline new_style blocks as candidates

Once these four pieces exist, every reply generation call follows the same pattern: query the picker for the platform, render the prompt with the per-tone target%, the last 10 picks, and the under-represented hint. Pass the prompt to the model. Validate the returned style against the universe. If unknown, look for a new_style block; register or reject. Post the reply. Log it back to the posts table with the chosen engagement_style. The picker's next call uses the new data.

What this is not

This is not a recipe for spamming Reddit. The picker assumes you are the human in the loop on every reply, that your DM volume is in the single-digits per session, and that you log honest engagement back to your DB. The picker only works if the upvotes you log are the ones the comment actually earned. If you fake that, the weights drift toward whatever you fake and the picker becomes a worse argmax than if you had skipped it.

It is also not a content scheduler. A scheduler answers "when do I post?" A reply-template generator answers "what shape of reply does this thread want?" Those are orthogonal problems and conflating them is why most off-the-shelf social tools do both jobs poorly. Build the picker on top of your existing scheduler if you have one; do not try to fold the scheduler into the picker.

Walk through wiring this into your engagement loop

30 minutes on your own posts table, your own taxonomy, and the smallest picker that will do real work for you.

FAQ

Frequently asked questions

Why not just have ChatGPT generate one reply per Reddit post and ship it?

That is what every off-the-shelf 'paste-link-get-comment' tool does, and it is why their output sounds the same after the third reply. A single prompt collapses to one voice. The pattern that survives is to encode several distinct tones (critic, storyteller, pattern_recognizer, contrarian, data_point_drop, and so on), pick which tone a given thread deserves, and only THEN ask the model to write inside that tone. The picker is the part you have to build. The single-prompt tools skip it.

How does the picker decide which template to use?

It computes a target distribution from your own historical posts. For each tone with at least MIN_SAMPLE_SIZE = 5 samples, it takes avg_upvotes ** WEIGHT_EXPONENT (2.0) as the raw weight, normalizes those into a percentage, applies a STYLE_FLOOR_PCT of 5% so every tone keeps getting tested, then caps the top tone at STYLE_CAP_PCT = 50% so a single winner cannot starve the rest. The exact code path lives in compute_target_distribution() in scripts/engagement_styles.py. The picker also reads the last 10 picks on the platform and tags any tone that is over- or under-represented vs target, so the model leans toward under-used tones unless another fits the thread better.

Why a floor of 5% and a cap of 50%? Why not optimize purely for the highest avg_upvotes?

Because optimizing purely for the historical winner kills exploration. Once one tone gets a streak of high upvotes, a pure argmax picker would lock onto it forever and you would never discover that another tone is better suited for a sub you started posting in last month. The 5% floor guarantees every non-disabled tone gets at least 1-in-20 picks so noise eventually averages out. The 50% cap prevents the runaway winner from collapsing the distribution. These two numbers turn the picker from a greedy optimizer into a multi-armed bandit with a reasonable exploration rate.

What stops the model from inventing a tone called 'super_great' and using it for everything?

Two gates. First, the new_style block has required fields (description, example, why_existing_didnt_fit), and a registration that fails on missing fields, returning rejected to the orchestrator. Second, status starts as candidate, not active. Candidates only get the 5% floor weight regardless of how many times they 'win', so even if the model insists on its new tone every reply, the picker still sends 95% of traffic to the established baseline. A nightly promote_engagement_styles.py job graduates a candidate to active only after it accumulates real samples with a non-degenerate median. Until then it lives at the floor.

What is the two-lane grounding rule and why does it matter for templates?

It is the rule that prevents your auto-generated replies from making up personal anecdotes. Lane 1 is DISCLOSED STORY: the comment opens with a hedge ('hypothetically', 'imagine someone running this', 'scenario:'), and once that frame is set the model can use any specifics it wants. Lane 2 is NO FABRICATION: the comment stays first-person but every concrete detail (number, duration, headcount, place name, course name) must appear verbatim in your project's config. If a specific is not in config, the model must drop it, generalize it, or pattern-frame it ('the typical failure mode is...'). The rule outranks the 'specificity is the #1 authenticity signal' rule wherever they conflict. Without it, storyteller will happily write 'i ran 22 cameras for 8 months' on a thread about home security, which Reddit catches and downvotes within an hour.

Where do platform-specific overrides live?

PLATFORM_POLICY in scripts/engagement_styles.py. Per platform you set a never list of disallowed tones (snarky_oneliner is permanently off on professional networks; curious_probe is off on Reddit because question-form replies underperform there) and a note string that gets prepended to the prompt ('Brevity wins. 1-2 sentences max.'). The picker excludes never tones from the candidate pool entirely. Adding a platform is one dict entry plus a per-platform avg_upvotes query.

Does the picker care about which subreddit the reply is going to?

Each style declares best_in.reddit as a list of subreddit hints (critic is best in r/Entrepreneur and r/startups; pattern_recognizer is best in r/ExperiencedDevs and r/programming). Those hints surface in the prompt next to the target distribution so the model can lean toward a tone the subreddit's culture rewards even if it is not the global winner. The hints are advisory; the model can override when the specific thread breaks the subreddit's average mood.

Can I run this without writing my own picker?

Yes. The S4L repo ships the picker, the candidate registry, the nightly promoter, and the per-platform avg_upvotes query against a Postgres `posts` table. Drop in your own engagement DB or use the bundled SQLite/Postgres schema, then call get_styles_prompt(platform, context) before every reply generation. The only thing you have to wire yourself is logging each posted reply back into the same posts table with engagement_style and upvotes filled in, otherwise the picker has nothing to score.