Methodology

How we build a cohort. The same way, every time.

Yardstick is an independent software research and consulting agency. We publish guides cohort by cohort on the protocol below.

2-min walkthrough / cohort-agnostic

How Yardstick scores any cohort

The three principles, in detail

The video above covers the protocol at a glance. The sections below add the auditable detail: what we explicitly do not do, what is auditable versus editorial, and the in-scope / out-of-scope lines for vendor right-of-reply.

  1. 01

    Same evidence stack, every vendor in the cohort.

    What we explicitly do not do. We do not pay for paid tiers. At cohort scale, paid-tier seat costs are cost-prohibitive. We do not run held-out outcome experiments (e.g., a 50-prospect cold-email campaign for the AI sales cohort, or a held-out call-center deflection test for the AI call-center cohort). Those require infrastructure and legal posture we are not set up to maintain. For every dimension where we cannot defensibly produce held-out outcome data, the cohort's rubric replaces it with an observable leading indicator that traces back to vendor pricing, feature pages, and free-tier hands-on. The replacement, and the reason for it, is published with each cohort's rubric.

    Every claim is labelled. Every cell on every tear-sheet carries one of three labels:

    • MEASURED. Observed first-hand on the free tier or trial seat, or graded by us against a sample prospect (e.g., personalization output quality, setup-time stopwatch on the free tier, UI heuristics from the trial seat).
    • ESTIMATED. Derived by us from the vendor's own pricing page and feature limits (e.g., cost-per-seat efficiency).
    • CITED: sourced from a vendor-published benchmark (cited as such, with its provenance) or a third-party review (G2, Capterra, customer case study, with the source linked).

    Auditable vs. editorial. The evidence-intake template, the rubric weights, and the per-tear-sheet evidence log (every claim, its label, and the URL or screenshot it traces to) are auditable. They live in the published Yardstick Report appendix. The "best for" call at the end of each tear-sheet is editorial judgement, sourced from the scored data, and labelled as such.

  2. 02

    Score on observable outcomes, with weights we publish.

    How dimensions are picked per cohort. A dimension qualifies for a cohort's rubric only if (a) it is observable from the cohort's evidence stack: vendor docs, pricing, integrations, free-tier or trial-seat hands-on, third-party reviews, and (b) it materially affects buyer outcome in that category. Every cohort's rubric carries one shared dimension at 25%: ease of data integration & accuracy. The frontier requirement that determines whether the AI tool can actually be trusted with the buyer's data and trusted to produce accurate output once it has that data. Around that anchor, each cohort's other dimensions are category-specific. The AI sales cohort scores personalization, email reputation, cost-per-seat efficiency, UI heuristics, and setup time. The AI call-center cohort scores a different set: speech-to-text accuracy, real-time agent assist, voicebot containment, compliance posture, telephony economics. Because those are the buyer-decisive dimensions in that category.

    What we deliberately do not score, in any cohort. Anything that depends on held-out outcome data we are not set up to collect, or anything not observable from public information plus free-tier hands-on. Each cohort's rubric documents what it leaves out and which observable leading indicator it scores in place.

    Auditable vs. editorial. Every raw input is in the appendix with its evidence label. The weights are public. The rubric anchors that turn raw observations into a 0-4 dimension score are published. What is editorial: the choice of which dimensions to include and at what weight. We invite arguments on that choice. See the section at the bottom of this page.

  3. 03

    Publish. Vendors check facts. Affiliate links disclosed.

    What's in scope. Factual errors only. If we listed a HubSpot integration as native and it's actually via Zapier, we fix it. If we listed a price as $400 a seat and it's $450 a seat, we fix it. If we cited a vendor benchmark and the vendor has a more recent published figure, we update the citation.

    What's out of scope. Rankings. Rubric weights. Editorial language. The decision to include or exclude the vendor. The decision to mark a dimension as out-of-scope on the rubric.

    Affiliate disclosure. Where the guide or this site links to a vendor's product, that link may earn Yardstick a commission. Disclosed on every page where the link appears. The commission rate is not a factor in the score, the ranking, or the cohort.

The evaluation protocol. What each vendor walks through

Every vendor in the cohort goes through the same five-step protocol. The protocol is fixed before any vendor is loaded into the workbook, and it does not change once evaluation opens. Each step writes to the per-vendor evidence log that ships in the Yardstick Report appendix.

  1. A

    Public-information sweep

    Vendor docs, pricing page, feature pages, integrations page, security/trust page, recent blog posts, recent funding news, and the vendor's own published benchmarks are pulled into the workbook. Every URL is captured. Every claim sourced from this step is labelled CITED on the tear-sheet.

  2. B

    Free-tier or trial-seat hands-on

    Where the vendor offers a free tier or a self-serve trial seat, we sign up under standard terms and exercise the product against the same sample prospect every other vendor sees. Setup time is stopwatched; UI heuristics graded; email reputation and integration screens screenshotted; data accuracy sampled where the tier exposes enrichment. Outputs from this step are labelled MEASURED. Where no free tier or trial seat exists, we say so on the tear-sheet and fall back to step C for the affected dimensions.

  3. C

    Third-party evidence

    G2, Capterra, public customer case studies, recent video walkthroughs, and practitioner discussion (LinkedIn, Reddit, Pavilion). Used to triangulate setup time, UI heuristics, and integration depth on dimensions we cannot exercise on a free tier. Outputs from this step are labelled CITED with the source URL on the tear-sheet.

  4. D

    Sample-prospect output grading

    A single sample prospect, full ICP profile, public LinkedIn, public company page, is loaded into each tool that can produce outbound copy. Two graders score the output blind against the personalization rubric. Output and grade live in the appendix. Outputs from this step are labelled MEASURED.

  5. E

    Cost-per-seat estimation

    From the vendor's published pricing and feature limits, a $/month-per-output-unit figure is calculated under documented assumptions. The formula and assumptions are printed in the appendix. Outputs from this step are labelled ESTIMATED. Where no list pricing is published, the dimension is scored "not disclosed" and the tear-sheet says so.

What the protocol deliberately does not include. We do not pay for paid-tier seats. We do not run held-out outcome experiments: those require infrastructure we are not set up to maintain, so for every dimension that would otherwise depend on held-out outcome data, the cohort's rubric replaces it with an observable leading indicator and explains the substitution.

Cohorts we publish

Each cohort's rubric is published on its own page. The protocol on this page is the same across all of them.

Vendor right-of-reply

A factual claim should survive contact with the vendor's own team. A ranking is editorial and is not subject to appeal. The right-of-reply protocol enforces both.

How it works. Seven calendar days before a guide is published, every vendor in the cohort receives their scored tear-sheet by email. They have those seven days to flag factual errors. If they reply with corrections, we verify the correction against our raw data and amend the tear-sheet if the correction is upheld. If they don't reply, the tear-sheet publishes as-is.

In scope. Pricing tier listed incorrectly. Feature misquoted. Integration listed as native that's actually via a third party (or vice versa). Customer count off. Founding year off. ICP mis-stated.

Out of scope. The score. The ranking. The rubric. The weights. The decision to mark a dimension as out-of-scope on the rubric. The editorial language in the tear-sheet. The decision to include or exclude the vendor from the cohort.

The full right-of-reply email template is in the brand kit and reproduced in every Yardstick Report appendix.

Conflicts of interest

Every revenue line that touches a vendor is disclosed here, on every relevant page, and in the Yardstick Report footer.

Affiliate links. Where the guide or this site links to a vendor's product, that link may earn Yardstick a commission. The presence or rate of an affiliate program does not affect the score, the ranking, or whether the vendor is included in the cohort. A vendor without an affiliate program is treated identically to one with the largest commission in the category.

Paid placements. None. We do not accept sponsored content, paid placements, vendor research contracts, or "thought leadership" trades.

Vendor advisory roles. None. Yardstick staff do not hold paid advisory positions, formal or informal, at any vendor in any cohort we cover.

Equity. Yardstick holds no equity in any vendor in any cohort we cover. If that ever changes, the holding is disclosed in the affected tear-sheet, in the appendix, and in the footer of every page where the vendor appears.

Argue with our methodology

Find a flaw. Tell us.

Every methodology is wrong somewhere. The way it stops being wrong is by surviving public critique from people who know more than we do about a corner of it. If you spot a flaw, a weight that's miscalibrated, a dimension we should add or drop, an inclusion criterion that's screening out the wrong vendors, a baseline number we should update, write to us.

hello@yardstickresearch.app

Email a critique

Or read more about who we are / take the audit

Take the free 4-minute readiness audit.

Get your score, peer benchmarks, and three tailored vendor recommendations. No email required to see your results.