Show HN: I built "AI Wattpad" to eval LLMs on fiction

(narrator.sh)

28 points | by jauws 23 hours ago

20 comments

  • pshirshov 3 hours ago
    > eval

    I'll do that with a language model, too busy writing poems with Claude.

    > has found better patterns for maintaining consistency across chapters

    Yeah, I've found one! Write your fiction with your own hands!

    Thank you.

    Consistency, my ass. They can't even write a paragraph of believable emotions on their own.

  • babblingfish 21 hours ago
    > The surge of AI, large language models, and generated art begs fascinating questions. The industry’s progress so far is enough to force us to explore what art is and why we make it. Brandon Sanderson explores the rise of AI art, the importance of the artistic process, and why he rebels against this new technological and artistic frontier.

    What It Means To Be Human | Art in the AI Era

    https://www.youtube.com/watch?v=mb3uK-_QkOo

    • babblingfish 20 hours ago
      Do watch the video as it makes a compelling argument against this exact kind of thing. From a product design perspective, you're asking people to read a bunch of slop and organize it into slop piles. What's the point of that? Honestly it seems like a huge waste of everyone's time.
      • jauws 20 hours ago
        I think there's interesting work to be built on this data beyond just generating and sorting slop. I didn't build this because I enjoy having people read bad fiction. I built it because existing benchmarks for creative writing are genuinely bad and often measure the wrong things. The goal isn't to ask users to read low-quality output for its own sake. It's to collect real reader-side signal for a category where automated evaluation has repeatedly failed.

        More broadly, crowdsourced data where human inputs are fundamentally diverse lets us study problems that static benchmarks can't touch. The recent "Artificial Hivemind" paper (Jiang et al., NeurIPS 2025 Best Paper) showed that LLMs exhibit striking mode collapse on open-ended tasks, both within models and across model families, and that current reward models are poorly calibrated to diverse human preferences. Fiction at scale is exactly the kind of data you need to diagnose and measure this. You can see where models converge on the same tropes, whether "creative" behavior actually persists or collapses into the same patterns, and how novelty degrades over time. That signal matters well beyond fiction, including domains like scientific research where convergence versus originality really matters.

  • bccdee 21 hours ago
    I took a look at the "top-rated" story.

    1. UI is terrible. Paragraphs are extremely far apart, and most paragraphs are 1 short sentence (e.g. "I glare."). On mobile, I can only see a few words at a time, and desktop's not much better.

    2. Story is so bad that it's not even amusing.

    • jauws 21 hours ago
      Thanks for letting me know - the UI issues are definitely on me (fixing asap). Feel free to generate a story or two - right now there's not enough annotations to make "top-rated" a valid moniker.
  • drusepth 20 hours ago
    Hard to find the signal in the noise and know what stories I should even read to get a sense of baseline quality; partially because that's just a hard problem inherent to floods of any content, but also because the recommendation system seems to lack enough data (and also might be weighting the wrong things, e.g. the rank #1 story is also the lowest-rated...).

    A very cool idea in theory and something very hard to pull off, but I think in order to get the data you need on how readable each story is you'll need to work on presentation and recommendation so those don't distract from what you're actually testing.

    • jauws 19 hours ago
      Thanks for the feedback - looking at the rest of the comments, I definitely agree it seems to be a common theme. Will do better to fix those issues so there's less noise.
  • verelo 20 hours ago
    Did you skip Anthropic models? I honestly can't take this seriously if you're not looking at all the leading providers but you did look at some obscure ones.
    • jauws 20 hours ago
      There's 151 models there right now (with all the latest Anthropic models), it's all randomized, it's just that there aren't enough annotations for the anthropic models to be elicited right now.
    • th0ma5 20 hours ago
      [dead]
  • permenant 16 hours ago
    It would be interesting to consider composite systems where human brainstorming feeds AI writing, as well as vice versa, to see what kind of engagement with AI writing people like the most. At least in my case, I find plot writing good fun, and actual writing slightly less good fun.
    • jauws 15 hours ago
      Definitely on the to-do list! Right now, there's smth called fork (inspired by Github fork), where it lets you remix the story with a given input. It might be cool for you to mess around with.
  • JoshPurtell 15 hours ago
    This is super cool! Have you tried GEPA?
    • jauws 15 hours ago
      Thanks Josh! I tried GEPA previously back when it was still 1-shot generation. It actually ended up working really well for some models and horrible for others, so I decided to scrap for a more generic prompt instead to make the benchmark a bit more rigourous.
  • linolevan 21 hours ago
    Quick feedback: Website is basically unusable on mobile
    • jauws 21 hours ago
      Ah shoot - thanks for letting me know. I'm still a noob on frontend so still learning as I go.
  • BoorishBears 19 hours ago
    I have a lot of engagement data on LLMs from running a creative writing oriented consumer AI app and spending s lot of time on quality improvements and post training

    Do you have a contact email?

    • jauws 19 hours ago
      Would love to chat! Here's my email: team@narrator.sh
  • pillbitsHQ 3 hours ago
    [dead]
  • pillbitsHQ 6 hours ago
    [dead]
  • pillbitsHQ 9 hours ago
    [dead]
  • pillbitsHQ 10 hours ago
    [dead]
  • pillbitsHQ 12 hours ago
    [dead]
  • pillbitsHQ 13 hours ago
    [dead]
  • pillbitsHQ 15 hours ago
    [dead]
  • pillbitsHQ 19 hours ago
    [dead]
  • dehugger 20 hours ago
    [flagged]
    • jauws 19 hours ago
      If you have specific objections, I’m open to hearing them.
  • mp_mn 20 hours ago
    [flagged]
    • jauws 19 hours ago
      Thanks for the feedback. What would you need to see to change your mind?
      • mp_mn 19 hours ago
        There's more quality fiction out there than you or I will ever have time to read. I don't see a purpose in flooding the world with more mediocre to unreadable fiction.
        • jauws 19 hours ago
          Realistically, I don't think anyone will be spending hours here instead of reading real fiction anytime soon (I personally wouldn't). There's just so much nuanced complexity when it comes to creative writing as a domain (long-form outputs, creativity, etc.) that coming up with better annotation methods has massive applications in other research, like in scientific discovery. "AI Wattpad" just happens to be a convenient form factor for crowdsourcing from an HCI perspective. I hope you give it a chance.
          • mp_mn 18 hours ago
            OK, so you already recognize these stories aren't something that people are going to spend time sorting through. How could you possibly then get any usable preference data out of this?
            • jauws 18 hours ago
              If you look at similar live benchmarks like LMArena or Design Arena, there's an extremely large number of unique annotators, with a low number of annotations per person - which is normal. However, since this platform is designed to generate fiction catered to individual interests, my hypothesis is that it'll be an added boost of novelty that will help aggregate enough usable data over time.
              • mp_mn 18 hours ago
                I tried reading the two top rated stories. They're both unreadable gunk. Why would I (or anyone else) go for another? Why would I tell anyone I know to spend their time reading this?
      • empath75 19 hours ago
        I am not going to argue this on the basis of LLM's suck at fiction, because even if it's true, it's not really that relevant. The problem is that what LLM's are good at is producing mediocre fiction particular to the tastes of the individual reading at. What people will keep reading is fiction that an LLM is writing because they personally asked it to write it.

        I don't want to read fiction generated from someone else's ideas. I want to read LLM fiction generated from my weird quirks and personal taste.

  • rbtprograms 20 hours ago
    [flagged]
    • jauws 19 hours ago
      Happy to engage if you have concrete criticisms.
      • rbtprograms 19 hours ago
        have you read any of the generated stories? if you can honestly tell me this is not complete drivel (even worse, wildly generic and poorly written) then i will consider giving real feedback but i would find that hard to believe.
        • jauws 18 hours ago
          I hope it's clear that the stories aren't being generated one-shot. I'm sure there are flaws that I haven't perfectly accounted for in the agent-loop, but because we randomize the models for each of the brainstorming -> writing -> memory parts, bad intermediate outputs will affect the final output as well. That's why unless we have above average models across all 3 stages, it might be worse than what you're used to. It's a trade-off to get more granular results. Hope you can give it a chance.
          • rbtprograms 17 hours ago
            so you havent read any of them, very nice product you got there. i stand by my original statement.