My website gets more attacks than human visitors

I run a small self-hosted website on a Raspberry Pi 4B at home. A few weeks ago I started wondering: who actually visits a website in 2026? Not just humans. Everything. So I built a public observability dashboard on top of GoAccess that separates traffic into four categories: human visitors, search engine crawlers, AI retrieval agents, and automated attacks. The numbers from the last 17 days surprised me:

4,523 human visits 6,409 automated attack attempts Thousands of crawler requests from search engines and AI systems

The attacks aren't sophisticated. They're mostly automated scanners probing for .env files, WordPress admin panels, and cloud credentials — hitting every public IP on the internet regardless of what's actually running there. What I found more interesting was the AI agent behavior. AI retrieval agents (GPTBot, ClaudeBot, PerplexityBot, Amazonbot) behave differently from traditional search crawlers. They hit semantic files aggressively — llms.txt, sitemap.xml, JSON-LD structured data — and seem to index the knowledge graph structure of a site rather than individual pages. Within hours of publishing new content, multiple AI crawlers had already visited, apparently triggered by the sitemap update rather than any external link. A few observations I didn't expect:

Combined machine traffic consistently exceeds human traffic AI agents discovered new content faster than Google did The semantic structure exposed by the site seems almost as important as the content itself Even a Pi on a residential ISP receives constant automated scans (380+ attempts/day average)

I made the dashboard public because I think the machine side of the web is underobserved. The modern web feels less like "users visiting pages" and more like a parallel ecosystem of crawlers, AI agents, and automated systems running continuously alongside human visitors.

Two questions for HN: Are others tracking AI agents separately from traditional search crawlers? Has anyone else noticed AI retrieval systems indexing semantic structure (JSON-LD, llms.txt) faster than they index page content?

5 points | by tommy2970 1 day ago

4 comments

  • lemonademan 9 hours ago
    To answer the first question, yes! Web operators and security professionals actively track and categorize AI agents separately from traditional search crawlers because they serve fundamentally different purposes and impact site resources in distinct ways.

    I built a database website a few months ago and submitted it to Google, Bing, and Yandex. 2 months later, according to my Cloudflare dashboard, I have 1.5 million unique visitors monthly. I found that human visitors only accounted for about 10% of the total, followed by search engine crawl bots and then AI crawl bots. I also discovered that AI bots (like GPTBot, ClaudeBot, or PerplexityBot) scraped a lot rapidly without adhering to traditional crawl limits or deeply checking robots.txt files, resulting in high server loads.

    That should also answer the second question, which is that AI retrieval systems index semantic structure faster than they index page content. You have to understand that AI doesn't just index your website like regular crawl bots, which index mostly your content, schema, and so on. AI bots go deeper by trying to understand your website structure, as this will also help in training other AI models.

  • Yahyaaa 10 hours ago
    I wonder if "traffic" is becoming the wrong metric. A human visit has always implied someone consuming the content. An AI crawler may never generate a pageview from a human, yet it can still become the mechanism by which someone discovers your work later through an assistant.

    In that world, machine visits aren't necessarily noise, they're another distribution channel.

  • mmarian 15 hours ago
    I thought about tracking all this stuff for my personal server, then realized that I wouldn't bother doing anything with the knowledge anyway.
  • anenefan 22 hours ago
    I'm curious if you're just tracking browser user agent, fingerprinting or some other method? For instance would someone using a tool to spider your site, would it be classed as an attack?