Industry

The 2026 State of AI Readiness: 500 Top Websites Scanned. None Reached Level 4.

The Appearly Team Apr 24, 2026 13 min read

ai readiness geo ai visibility research report schema markup

TL;DR

In April 2026, we scanned 500 top websites against a 7-signal AI readiness audit. We checked each signal at the level it realistically appears: site-level (homepage, robots.txt, llms.txt) and content-level (a blog post or article we detected via the site's sitemap or by parsing homepage links).

0 sites of 500 reached Level 4 (Cited) in the AI Readiness Maturity Model.
Max technical score: 85 / 100. Seven sites tied at the top: Dropbox, Shopify, Taboola, Plesk, Dynatrace, Yieldmo, Moloco. Salesforce follows at 80.
Of the 271 sites with a detectable content page, 15.9% (43 sites) have author Person schema with a verifiable identity (sameAs to social profile OR url to an author bio page).
Only 5.6% of sites have FAQPage schema anywhere (homepage, content page, or /faq).
Only 10.0% have a llms.txt file.
Only 29.2% have Organization schema (JSON-LD or Microdata, including nested in publisher/copyrightHolder).
22.6% score as Invisible (Level 0): they block at least one AI crawler.

Full dataset (CSV + methodology, CC-BY-4.0): appearly.ai/dataset/ai-readiness-2026/.

Why this matters

AI search is a retrieval layer on top of the web. When someone asks ChatGPT, Perplexity, Gemini, or Claude a question, the engine composes an answer from its training corpus plus real-time web retrieval. Whether your brand gets cited depends on two things:

Can the AI engine access your site?
Can it parse your content into structured, attributable claims?

Google ranking does not transfer. A site can rank #1 on Google and still score Level 0 on AI readiness if it blocks GPTBot or has no structured data. The industry has been treating AI search as a PR problem ("how do I get mentioned?") when at the technical layer it is an infrastructure problem first.

We wanted to quantify how prepared the top of the web actually is.

Methodology

Source list. Tranco Top 850 as of April 24, 2026 — we scanned down the Tranco list until we had 500 analyzable websites. Tranco is an academic ranking that combines Alexa, Cisco Umbrella, Majestic, and Farsight. We chose Tranco over commercial lists because it is reproducible and peer-reviewed.

Why not just the Top 500? Of the Tranco Top 500 entries, only ~60% are analyzable websites. The rest are DNS infrastructure domains (gtld-servers.net, msftncsi.com, amazon-dss.com), CDN subdomains that don't serve HTML, WAF-blocked domains, and network-unreachable hosts. Tranco measures domain popularity by traffic, not "websites that serve content." To report honest numbers on 500 actual websites, we read from the Top 850 until we had 500 with status=scanned.

Scanner. Proprietary to Appearly. For each domain, we fetch: the homepage, robots.txt, llms.txt, and (if detectable) up to 3 content pages. Content-page detection tries two strategies in order: (1) recursively parse the site's sitemap (up to 2 levels deep to handle sitemap indexes), scoring URLs by content-path hints (/blog/, /news/, /article/ weighted strongest), date patterns, and path depth; (2) fall back to parsing the homepage HTML for links to content sections. For Article and Author Person checks, we accept pass if any of the 3 sampled pages has the schema. If FAQPage schema is not on the homepage or any sampled content page, we try /faq/ and /faqs/ as a light fallback. We parse both JSON-LD and Microdata (many e-commerce and older CMS sites use Microdata). User-Agent: AppearlyBot/1.0 (+https://appearly.ai/bot). Timeout: 10s per request.

Why we split signals by level. Some signals live at the root of a site (robots.txt directives, llms.txt). Others live on the homepage (Organization schema). Others live on content pages (Article, Author Person schema). Checking all 7 signals only on the homepage would be methodologically weak and bias the results downward.

The 7 signals:

Signal	Points	Where we check	What counts as a pass
AI crawler access	20	robots.txt	All 6 AI crawlers allowed (GPTBot, ChatGPT-User, anthropic-ai, ClaudeBot, PerplexityBot, Google-Extended)
llms.txt	10	`/llms.txt`	200 response, non-empty, non-HTML body, valid content-type
Organization schema	15	Homepage	JSON-LD @type in Organization / LocalBusiness / Corporation
Article schema	15	Content page	JSON-LD @type in Article / BlogPosting / NewsArticle / TechArticle
FAQPage schema	15	Homepage OR content page OR /faq	JSON-LD @type FAQPage
Author Person schema	10	Content page	JSON-LD Person with non-empty sameAs
Sitemap in robots.txt	15	robots.txt	Sitemap: directive pointing to http(s) URL

Scoring. Total out of 100. For content-level signals, if no content page is detectable on a site, those signals are N/A and contribute 0 points. The site is not "penalized" in a bug sense; a site without content pages simply has less for AI engines to cite, and that reality is reflected in the score.

Maturity levels.

Level	Name	Score range	Override
0	Invisible	0-19, OR crawler blocked	If any AI crawler is blocked, forced to 0
1	Discoverable	20-39	-
2	Indexable	40-59	-
3	Retrievable	60-89	-
4	Cited	90-100 + observed citation	Technical readiness alone cannot confirm Level 4

Coverage. The study reports on 500 analyzable websites. To reach 500 we scanned down the Tranco list to rank ~900, because ~42% of top-ranked Tranco entries are not analyzable as websites: DNS infrastructure (gtld-servers.net, msftncsi.com — 22% of entries have no A/AAAA records), WAF-blocked domains (9%), and network failures (10%). Tranco ranks domains by traffic, not by "sites that serve web content." Of the 500 analyzable websites, 271 (54.2%) had a detectable content page we could sample for Article and Author Person schema.

Full methodology: appearly.ai/dataset/ai-readiness-2026/.

The headline number

Zero sites reached Level 4. The highest-scoring domains scored 85 out of 100, which places them in Level 3 Retrievable territory. Seven sites tied at the top: Dropbox, Shopify, Taboola, Plesk, Dynatrace, Yieldmo, and Moloco. Salesforce follows at 80. The average score across the 500 analyzable sites was 32.5. The median was 30.

The internet's most-visited websites are collectively performing below the threshold for full AI citation readiness.

Distribution across maturity levels

Level	Name	Count (of 297)	Share
0	Invisible	64	21.7%
1	Discoverable	164	55.6%
2	Indexable	36	12.2%
3	Retrievable	31	10.5%
4	Cited	0	0.0%

Discoverable dominates. 57% of the analyzable top 500 have crawler access and little else. This is the biggest bucket by far and represents the largest opportunity: moving out of Level 1 requires adding Organization schema and Article schema on blog posts, which is a small same-day engineering task.

Validation. We validated the scanner through four independent approaches: (1) cross-validation against extruct (Zyte's reference library) — 100% agreement on 120 sampled sites; (2) triangulation against pyld (W3C-compliant JSON-LD processor) + the microdata Python library — 100% agreement on signal detection; (3) directional check against Web Almanac 2024 (HTTP Archive's annual study of 8M+ pages): our Top 500 numbers are 3x higher than their global averages, which is the expected direction for top-tier sites; (4) manual spot-checks of edge cases discovered during validation (HubSpot, Pinterest, Statista, Shopify). The scanner was built independently and the validation harnesses (validate.py + triangulate.py) are preserved so anyone can re-run the comparison.

Site-level signals (across 500 analyzable sites)

Signal	Adoption
AI crawlers allowed (all 6)	77.4%
Sitemap in robots.txt	51.4%
Organization schema on homepage	29.2%
llms.txt	10.0%

Organization schema is missing from 71% of homepages. This is basic structured data and the simplest schema to add: a name, URL, and logo. The adoption gap is not about technical complexity. It is about awareness. We counted JSON-LD, Microdata, and nested forms (e.g. Organization declared as publisher or copyrightHolder of a WebPage), so the number reflects total coverage.

For external reference: the Web Almanac 2024 reports 7.16% of pages across the global web have Organization schema. Our 29.2% on top websites is ~4x higher, consistent with top-tier sites having better adoption than average.

llms.txt is at under 10%. The spec is young but trivial to implement (a single static file). Most sites have not added it. Sites that do tend to cluster in the AI-native category (labs, SaaS companies building on LLMs).

Content-level signals (across 156 sites with detectable content pages)

This is where the story gets interesting. We only evaluated these signals on sites where we could find an actual content page (blog, news, article, resources, etc.) via the site's sitemap or by parsing homepage links. For each site we sampled up to 3 content pages and counted the signal as a pass if any of them had it.

Signal	Adoption (of 156)
Article schema	21.8%
Author Person schema with sameAs	1.9%

Only 3 sites of the 156 with detectable content pages have author Person schema with sameAs: The Guardian, cPanel, and Business Insider. We verified this by inspecting the actual JSON-LD of several sites that should have it (Moz, HubSpot, Notion, Stripe). They all declare Person author schema but without a sameAs link to a verifiable identity (LinkedIn, Twitter, author profile page). Author name alone without a verifiable link is a substantially weaker E-E-A-T signal than Person + sameAs.

Article schema adoption is at 22%. A minority. More than three in four top-500 sites with blogs or news sections publish content without declaring it as Article / BlogPosting / NewsArticle in structured data.

FAQPage (multi-location)

We checked for FAQPage schema on the homepage, on up to 3 sampled content pages, and on fallback paths /faq/ and /faqs/. Pass if found in any location.

Source	Count (of 500)
Content page	15
Homepage	10
/faq/	3
Nowhere	472

5.6% adoption. FAQPage is one of the most AI-cited schema types because its Question / Answer structure matches the shape of AI engine answers directly. Sites that add FAQPage to homepage or key pages pick up disproportionate citation weight relative to the effort. Yet only 28 of the 500 analyzable sites have it anywhere.

Top 20 most AI-ready sites

These are the 20 domains with the highest technical readiness score. No site reached Level 4; the top of the list sits at Level 3 Retrievable.

Rank (Tranco)	Score	Level	Domain
128	85	3	dropbox.com
163	85	3	shopify.com
184	85	3	taboola.com
376	85	3	plesk.com
519	85	3	dynatrace.com
599	85	3	yieldmo.com
807	85	3	moloco.com
263	80	3	salesforce.com
99	75	3	wordpress.com
166	75	3	reg.ru
216	75	3	cpanel.net
254	75	3	hubspot.com
290	75	3	dnsmadeeasy.com
381	75	3	hcaptcha.com
434	75	3	foxnews.com
491	75	3	indiatimes.com
568	75	3	playstation.com
601	75	3	onetrust.com
608	75	3	braze.com
641	75	3	independent.co.uk

Seven sites tie at 85 out of 100. Dropbox, Shopify, Taboola, Plesk, Dynatrace, Yieldmo, and Moloco. All seven have crawlers open, llms.txt, Organization schema, Article schema on blog posts, author Person schema with a verifiable identity, and some combination of Sitemap directive + FAQPage. Only two signals separate them from a perfect score. Salesforce at 80 and a large middle tier (WordPress.com, HubSpot, Playstation, Braze, OneTrust, Fox News) fill the 75-80 range.

Forbes and Business Insider at 70 raw points but Level 0 (not shown in Top 20 above because Level 0 suppresses them from the ranking). Both have strong schema on their content pages, but both block at least one of the six AI crawlers in robots.txt. The Maturity Model forces that to Level 0 because citation is impossible without crawl access. This is a deliberate part of the model, not a bug.

Interesting observation about the Top 20. Several of the top-scoring sites are ranked 500-800 on Tranco, not in the top 100. The Tranco rank measures traffic, but technical AI readiness correlates more with B2B SaaS maturity than raw traffic. Consumer mega-brands (Google, Facebook, Amazon, Apple) score in the 50s; B2B SaaS and adtech (Dynatrace, Yieldmo, Moloco, Braze, OneTrust) score in the 75-85 range.

Why no site reached Level 4

Level 4 requires a perfect technical score (90+) AND observed citation in AI engine responses. Even if a site scored 100 on technical signals, Level 4 is only granted with real-world citation data, which the scanner alone cannot provide. Of the 297 analyzable sites, the technical ceiling we observed was 85 (Dropbox, Shopify, Taboola).

The two signals where the top-scoring sites fall short are FAQPage and Sitemap in robots.txt. Dropbox is missing FAQPage. Shopify is missing the Sitemap directive. Taboola is missing FAQPage.

The brands with the strongest infrastructure for AI citation are one or two signals away from the top of the model. Adding a Sitemap: line to Shopify's robots.txt would move them from 85 to 100. Adding a FAQPage schema to Dropbox's news or Taboola's policy pages would do the same. These are small deployments with disproportionate payoff.

What to do, by level

If you are Level 0 Invisible

Check your robots.txt first. If any AI crawler (GPTBot, ChatGPT-User, anthropic-ai, ClaudeBot, PerplexityBot, Google-Extended) is disallowed, you are invisible no matter what else you do. Unblock them unless you have a specific legal or compliance reason to keep them out.

Then add at minimum: - Organization schema on the homepage. - A Sitemap: directive in robots.txt.

These two changes move you to Level 1 or Level 2 immediately.

If you are Level 1 Discoverable

You have crawler access but almost no structured data. Add: - Organization JSON-LD on the homepage. - Article or BlogPosting JSON-LD on blog and content pages. - A Sitemap: directive in robots.txt if missing.

Moving from Level 1 to Level 2 is typically a same-day engineering task.

If you are Level 2 Indexable

You have basic schemas but miss the signals that unlock citation: - llms.txt at the root with a plain-text product description and key links. - FAQPage schema on your homepage, pricing page, or high-intent content pages. - Person schema with sameAs for content authors on blog posts, linking to their LinkedIn or professional profile.

These three together move you from Level 2 to Level 3.

If you are Level 3 Retrievable

You are in the top 8% of the analyzable top 500 by technical readiness. The remaining work is not adding more schema; it is earning real citations: - Time for AI engines to re-crawl and index your new signals (typically 1 to 4 weeks). - Presence on the sources AI engines cite (Reddit, industry publications, G2 / Capterra, community forums). - Fresh content with clear answers for queries in your category. - Monitoring whether you actually appear in AI responses for your target keywords (this is what our Monitor product does).

How to check your own site

We built a free public checker that runs the same 7-signal scan on any domain:

AI Readiness Checker

No signup, no email required. Returns your score, level, and per-signal breakdown, including which page each signal was checked on.

What is the AI Readiness Maturity Model?

The 5-level model is the AI Readiness Maturity Model, a framework we developed for this research. Each level has specific technical criteria, and each jump up requires a specific set of changes. The framework page has definitions, DefinedTermSet schema, and per-level playbooks.

Open dataset

The full CSV of all 500 scanned domains, the methodology document, and the license are hosted at:

appearly.ai/dataset/ai-readiness-2026/

Published under Creative Commons BY 4.0. You can reuse the data freely for research or commentary as long as you cite this report. We plan to re-run this scan quarterly and publish updated datasets at the same URL so the industry has a time-series view of AI readiness adoption.

Caveats

Content-page sampling. We sample up to 3 content pages per site. If none of them has the schema but a deeper page we did not sample does, the site can score lower than it deserves.
JavaScript-rendered schema. The scanner does not execute JavaScript. Sites that inject schema via useHead or similar client-side mechanisms would not be detected. This is a known limitation of non-headless crawlers.
Microdata IS parsed. We detect schema declared via both JSON-LD (<script type="application/ld+json">) and Microdata (itemscope itemtype="https://schema.org/..."). Many e-commerce and older CMS sites use Microdata; missing it would have biased our results downward.
Tranco drift. Tranco updates daily. Any follow-up replication on a different day will have slight domain differences.
WAF false negatives. 36 sites returned HTTP 403 to our bot. Some of those sites likely have rich schema; we cannot confirm.
Level 4 unreachable by scanner alone. Confirming a site has been cited requires live queries to the AI engines themselves, which we do not perform in this scan.

What we are not claiming

We are not claiming that the top 500 are "bad" at AI visibility. Many of them dominate AI search answers anyway, because brand authority and content corpus volume compensate for thin schema. We are claiming that the technical foundation across the web is weaker than the industry assumes, and that smaller brands willing to invest in readiness can punch well above their weight in AI citation.

Frequently asked questions

Why did you check some signals on the homepage and others on a content page?

Because that is where those signals actually live. Organization schema belongs on the homepage. Article schema belongs on articles. Author Person schema belongs on posts with a byline. Checking all 7 signals on the homepage would be methodologically weak and bias the results. We crawl each homepage for links to blog / news / articles / resources / insights / learn sections, follow the first plausible post link, and check content-level signals there.

What happens if a site has no detectable content page?

Article schema and Author Person schema are marked N/A for that site and contribute 0 points. The site is not 'penalized' in a bug sense; a brochure-only site simply has less for AI engines to cite, and its score reflects that. Only 92 of 295 analyzable top-500 sites had a content page we could detect (31%).

Why is Author Person schema adoption only 1.9%?

Across the 156 sites where we sampled a content page, only 3 had a Person block with a sameAs link (LinkedIn, Twitter, or similar). The three are The Guardian, cPanel, and Business Insider. We manually verified several sites that you would expect to have it (Moz, HubSpot, Notion, Stripe blog): they declare `Person` author schema but without a `sameAs` link. Author name alone, without a verifiable identity link, is a substantially weaker E-E-A-T signal than Person + sameAs. That distinction is what the model captures.

Does Google ranking correlate with AI readiness score?

Not strongly. In our Top 20, Tranco ranks span from #45 (wordpress.org) to #451 (name.com). Some of the highest-ranking domains globally score in the Invisible or Discoverable brackets because they block AI crawlers or have minimal schema. Google authority and AI readiness are distinct dimensions.

What happens when a site blocks an AI crawler?

The Maturity Model forces any site that blocks any of the six AI crawlers to Level 0, regardless of total score. Citation is impossible without crawl access. This is the reason some sites with otherwise-strong scores appear at Level 0 in the dataset. It is not a bug; it is an explicit part of the model design. A site's raw score is still useful as a diagnostic of where their infrastructure is strong, even while they are structurally capped at Level 0.

What is the difference between Level 3 Retrievable and Level 4 Cited?

Level 3 means the site has all the technical signals needed to be cited. Level 4 adds confirmed observation of the site in actual AI engine responses for its target queries. Technical readiness is necessary but not sufficient for citation; reaching Level 4 requires live monitoring of AI answers, which is what Appearly's Monitor product does.

Why is FAQPage weighted so heavily?

FAQPage is one of the most AI-cited schema types because its Question / Answer structure matches the shape of AI engine answers directly. Engines can lift a question-answer pair verbatim from structured data without parsing prose. Sites that add FAQPage to high-intent pages consistently see higher citation rates.

Is the scanner open-source?

The scanner code is proprietary to Appearly. The dataset it produces and the methodology document are open under Creative Commons BY 4.0. This split lets us share the findings for reuse while keeping the implementation a product differentiator.

How often will this report be updated?

Quarterly. Next update scheduled for July 2026. Each update uses the then-current Tranco Top 500 and documents the exact scan date so readers can reproduce results.

Can I run this scan on my own site?

Yes. The free AI Readiness Checker at appearly.ai/ai-readiness-checker/ runs the exact same 7-signal scan on any domain you provide. No signup required, response in about 15 seconds. The result page shows which page each signal was checked on.

All posts