# Methodology
This document describes exactly how the 2026 State of AI Readiness scan was conducted.
## Source list
- **List:** Tranco Top 500.
- **Date of fetch:** April 23, 2026.
- **Source URL:** https://tranco-list.eu/top-1m.csv.zip (the current daily list).
- **Why Tranco:** Academic ranking combining Alexa, Cisco Umbrella, Majestic, and Farsight. Peer-reviewed, reproducible, free.
We took the first 500 rows of the daily list.
## Scanner
- **HTTP client:** `httpx` (Python).
- **User-Agent:** `AppearlyBot/1.0 (+https://appearly.ai/bot)` — identifiable and static. Sites that wish to exclude us can block that UA.
- **Requests per domain:** Up to ~12 worst case: `GET /` (homepage), `GET /robots.txt`, `GET /llms.txt`, up to 6 requests for sitemap-based content-page discovery (sitemap root → sub-sitemap → content page), up to 5 more for homepage-link fallback discovery if sitemap fails, plus up to 2 fallback requests for FAQPage detection (`/faq/`, `/faqs/`). In practice most scans use 5-8 requests.
- **Timeout:** 10 seconds per request.
- **Follow redirects:** Yes, max 5 hops.
- **Parallelism:** 15 domains concurrent.
- **SSRF protection:** DNS resolution + IP range check. Private, loopback, reserved, or link-local IPs are rejected as `blocked`. Domains with no A/AAAA records are marked `dns_failed`.
## The 7 signals, and where we check each
We check each signal at the level it realistically appears. Checking all 7 on the homepage would be methodologically weak.
| # | Signal | Points | Where we check | Pass criterion |
|---|---|---|---|---|
| 1 | AI crawler access | 20 | robots.txt | All 6 AI crawlers allowed at root (GPTBot, ChatGPT-User, anthropic-ai, ClaudeBot, PerplexityBot, Google-Extended). Any explicit `Disallow: /` for one of these (without an `Allow: /` override) fails. |
| 2 | llms.txt | 10 | `/llms.txt` | HTTP 200, non-empty body, not HTML, size under 500KB. If a `Content-Type` header is present, it must be plain text, markdown, or octet-stream. |
| 3 | Organization schema | 15 | Homepage | JSON-LD `@type` in `{Organization, LocalBusiness, Corporation, OnlineBusiness, NGO}`. |
| 4 | Article schema | 15 | A content page detected on the site | JSON-LD `@type` in `{Article, BlogPosting, NewsArticle, TechArticle, Report}`. |
| 5 | FAQPage schema | 15 | Homepage OR content page OR `/faq/` OR `/faqs/` | JSON-LD `@type == "FAQPage"`. |
| 6 | Author Person schema | 10 | A content page detected on the site | JSON-LD `Person` (top-level or nested as `author` of an Article block) with a non-empty `sameAs` field. |
| 7 | Sitemap in robots.txt | 15 | robots.txt | At least one `Sitemap:` directive pointing to an http(s) URL. |
### Content-page discovery
For signals 4 and 6, we need a content page. We detect one in two strategies, in order:
**Strategy A: Sitemap-based (primary).**
1. Parse `Sitemap:` URLs from the site's robots.txt (fall back to `/sitemap.xml` if none declared).
2. Fetch the sitemap. If it is an index (all entries are `.xml`), recurse into the sub-sitemap whose name best matches content keywords (e.g. `news.xml` before `image.xml`).
3. Extract `<loc>` URLs from the chosen sitemap.
4. Filter to article-like URLs: same host, not homepage, not legal/privacy/terms/support/help paths, not feeds or sitemaps.
5. Score each candidate: +100 if path contains `/blog/`, `/news/`, `/article/`, `/post/`, etc.; +80 if path has a date pattern like `/2024/01/`; +5 per path segment depth; +20 for a long slug with dashes.
6. Try up to 3 top-scoring candidates. First one that returns HTML wins.
**Strategy B: Homepage-link parsing (fallback).**
1. Parse the homepage HTML for `<a href>` links matching content-section patterns (`/blog/`, `/news/`, `/articles/`, `/resources/`, `/insights/`, `/learn/`, `/stories/`, `/case-studies/`, `/guides/`, `/posts/`, etc.).
2. Priority 1: a link that looks like a specific post (slug with 3+ chars after the content section).
3. Priority 2: a hub link (`/blog/`, `/news/`). Fetch the hub, find post links within, take the first.
4. Priority 3: try common hub paths directly (`/blog/`, `/news/`, `/insights/`, `/articles/`).
If neither strategy yields a usable content page within the fetch budget, we mark Article and Author Person as **N/A** for that site. They contribute 0 points. The site is not classified as "failing" those signals; it simply has no content page we could detect.
In the April 2026 scan, 154 of 291 analyzable sites (52.9%) had a detectable content page via one of these strategies.
### JSON-LD extraction
We parse JSON-LD blocks with this regex (case-insensitive, dot-matches-newline):
```
<script[^>]*type=["']application/ld\+json["'][^>]*>(.*?)</script>
```
Each matched block is parsed as JSON. We handle:
- Single objects with `@type`.
- Top-level arrays of objects.
- `@graph` wrappers (unwrapped before inspection; top-level fields preserved).
- `@type` as string OR as array of strings.
- `@type` values with `http://schema.org/` or `https://schema.org/` prefix (normalized to short form).
HTML comments inside JSON-LD blocks are stripped before parsing.
## The AI Readiness Maturity Model (5 levels)
| Level | Name | Score range | Override |
|---|---|---|---|
| 0 | Invisible | 0-19 | OR: any AI crawler blocked |
| 1 | Discoverable | 20-39 | - |
| 2 | Indexable | 40-59 | - |
| 3 | Retrievable | 60-89 | - |
| 4 | Cited | 90-100 + observed citation | The scanner alone cannot confirm Level 4 |
**Override rationale.** If the AI crawler signal fails (any of the 6 AI crawlers is blocked), the level is forced to 0 regardless of total score. No citation is possible without crawl access.
**Level 4 caveat.** A site with a perfect technical score (90+) still requires observed citation in AI engine responses to be placed at Level 4. Citation observation requires live queries to each engine, which the scanner does not do. In the Top 500 scan, the maximum technical score was 75 (Stripe, Dropbox), so no site approached the Level 4 technical threshold.
## Status codes
The `status` column describes the scan outcome, not the site's content quality.
| Status | Meaning |
|---|---|
| `scanned` | All fetches completed; scoring applied. |
| `partial` | Reserved for edge cases where 1-2 checks could not complete but scoring was still produced. Currently unused. |
| `blocked` | Homepage returned HTTP 401 or 403 to AppearlyBot (WAF block). Schema may exist but we cannot verify. |
| `dns_failed` | Domain has no A or AAAA records (DNS infrastructure, not a website). |
| `failed` | Network error (timeout, connection refused, 5xx). |
For aggregate statistics, we use only rows where `status IN ('scanned', 'partial')`.
## CSV columns
| Column | Type | Notes |
|---|---|---|
| `rank` | int | Position in Tranco Top 500. |
| `domain` | string | Normalized domain (lowercase, no protocol). |
| `status` | enum | See table above. |
| `total_score` | int | Sum of signal points, 0-100. |
| `level` | int | Maturity level 0-4. |
| `level_name` | string | Human-readable level. |
| `ai_crawler_allowed` | `1`/`0` | Site-level. |
| `has_llms_txt` | `1`/`0` | Site-level. |
| `has_org_schema` | `1`/`0` | Homepage. |
| `has_article_schema` | `1`/`0`/`` | Content page. Empty = N/A (no content page detected). |
| `article_checked_on` | string | `content_page` or `none`. |
| `has_faqpage_schema` | `1`/`0` | Multi-location. |
| `faqpage_checked_on` | string | `homepage`, `content_page`, `/faq/`, `/faqs/`, or `homepage+content+faq_paths` (when failed). |
| `has_author_schema` | `1`/`0`/`` | Content page. Empty = N/A. |
| `author_checked_on` | string | `content_page` or `none`. |
| `has_sitemap_in_robots` | `1`/`0` | robots.txt. |
| `content_page_url` | string | The specific content page URL we sampled (if any). |
| `scan_duration_ms` | int | Total HTTP time for the scan. |
| `final_url` | string | Final URL after redirects (homepage). |
| `error` | string | Error message for failed / blocked / dns_failed rows. |
## Validation
The scanner was validated through multiple independent approaches before the dataset was published:
**1. Cross-validation against `extruct`.** We ran our scanner and [extruct](https://pypi.org/project/extruct/) (Zyte's reference library for structured-data extraction, widely used in the scraping / SEO industry) on 120 randomly-sampled sites across 2 random seeds. Both parsers agreed on 100% of 466 per-signal comparisons for Organization, Article, FAQPage, and Person schema detection.
**2. Triangulation against `pyld` + `microdata` library.** We added a second independent validation using [pyld](https://pypi.org/project/PyLD/) (a W3C-compliant JSON-LD expansion library used in academic contexts) and the [microdata](https://pypi.org/project/microdata/) library (an independent Microdata parser, separate from extruct's internal mf2py). Across multiple random seeds, our scanner agreed with the union of `pyld + microdata` detection on 100% of signal comparisons.
**3. Cross-check against Web Almanac 2024.** The Web Almanac (published annually by HTTP Archive) reports structured-data adoption across 8M+ crawled pages on the global web. Their 2024 data reports 7.16% of pages with Organization schema, 1.40% with BlogPosting, and 2.9% with GPTBot rules in robots.txt. Our Top 500 results (22% Organization, 21% Article on content pages, 78% AI crawler access) are directionally consistent: top sites show meaningfully higher adoption than the global average, as expected.
**4. Manual spot-checks.** For each edge case where our scanner differed from a reference tool, we inspected the raw HTML byte-by-byte to determine the correct answer. Discovered bugs included: (a) Person schema nested inside `ListItem.author` (Pinterest), (b) Organization nested inside `publisher` / `copyrightHolder` properties, (c) `@graph` nested inside a list-wrapped JSON-LD block (Statista pattern), and (d) Person with `url` pointing to an author bio page instead of external `sameAs` (HubSpot pattern). Each was confirmed against a reference parser before the fix was shipped.
The validation harness (`ai_readiness/validate.py` and `ai_readiness/triangulate.py`) is preserved in the codebase for future re-runs.
## Caveats
- **Content-page sampling.** We pick one content page per site. If the sampled page is a landing or listing without Article schema, the site may score lower than it actually deserves. Sites with schema on deeper pages we did not sample would be missed. This is a known limitation of the sampling approach.
- **JavaScript-rendered schema.** The scanner does not execute JavaScript. Sites that inject schema via React `useHead` or similar client-side mechanisms would not be detected. This is a known limitation of non-headless crawlers.
- **Microdata not parsed.** Only JSON-LD is parsed. Sites using inline Microdata (`itemscope` / `itemtype`) are not scored on those signals. JSON-LD is by far the dominant format in 2026, so this affects few sites.
- **User-Agent blocking.** Some sites block unknown bots at the WAF. We treat those as `blocked` and exclude them from per-signal statistics.
- **Tranco drift.** The exact composition of the top 500 shifts daily. Any replication on a different day will see slight domain differences.
- **Technical ≠ observed citation.** A high score does not guarantee AI engines cite the site. Citation requires brand authority, content corpus, and competitive density, in addition to technical readiness.
## Replication
To build an equivalent scanner:
1. Download the Tranco daily list: https://tranco-list.eu/top-1m.csv.zip
2. For each domain, issue:
- `GET https://{domain}/` with redirect following, timeout 10s
- `GET https://{domain}/robots.txt`
- `GET https://{domain}/llms.txt`
3. Parse the homepage HTML for content-section links. Follow priority 1, 2, 3 in order until a content page is fetched (or budget exhausted).
4. If no FAQPage found on homepage or content page, try `/faq/` and `/faqs/`.
5. Apply the 7 checks against the appropriate pages.
6. Sum points. Apply the crawler-block override. Emit level.
## Contact
Data questions: open an issue on this repo.
Report discussion: [
[email protected]](mailto:
[email protected]).