How to test whether WordPress email actually reaches the inbox

A WordPress site sends a password reset. The mailer plugin’s log shows the relay accepted the message with a 250 OK. The DMARC aggregate report for the next day shows the message passed alignment. The user calls support and says the email never arrived. Where did it go?

This is the layer below DMARC. The relay log proves the relay accepted the message; the DMARC report proves the receiving server accepted the authentication. Neither proves the receiving server put the message in the inbox. Receivers run their own filters after authentication, and a message that passed every prior test can still land in spam, the “Promotions” tab, or be silently dropped. The operator who has done the SMTP and DMARC work and is still seeing complaints is looking at this layer: inbox placement.

What follows covers each tool, the methodology that uses them, and the limits each runs into. This is not a deliverability tutorial; if SPF, DKIM, and DMARC are not yet configured, start with DNS setup for WordPress email and come back when authentication is in place.

The three layers

WordPress email lives in three test layers that are easy to conflate and necessary to separate.

Delivery. The relay’s log. Did the receiving server accept the message at the SMTP transaction? A 250 OK response (see how to read an SMTP session log for the full transaction) confirms the relay handed the message off and the receiver acknowledged it. Delivery is the first checkbox; it proves the message left WordPress and reached the receiver’s MTA.

Authentication. The DMARC aggregate report. Did SPF and DKIM align with the From-header domain, and did the receiver’s evaluation of those alignments pass? The DMARC parsing guide covers how to read these reports. Authentication is the second checkbox; it proves the receiver trusted the message’s sender identity.

Inbox placement. The receiver’s content filter. Did the receiver put the message in the user’s inbox, or in spam, or in a non-primary tab, or drop it silently? This layer is not visible in the relay log or the DMARC report. The receiver does not tell the sender what happened past the authentication check.

The first complaint most operators reach for is “did the email get delivered?”, and the operator looks at the relay log, sees 250 OK, and tells the user the email was sent. That answer is technically true and operationally insufficient. Delivered to the receiver’s MTA is not the same as delivered to the user’s inbox. The rest of this guide is the methodology for measuring the third layer.

mail-tester: the fastest first test

mail-tester is the fastest available diagnostic. The service generates a unique recipient address, the operator sends one email to that address, and the service returns a score from 0 to 10 with a breakdown of every factor it checked. The breakdown covers content patterns (spam-flagged words, link density, image-to-text ratio), header configuration (DKIM signature presence, SPF record alignment, DMARC policy), blacklist checks across major DNSBLs, and message-structure issues (HTML validity, encoding, multipart/alternative handling).

The value: in under a minute the operator gets a structured list of every factor that could plausibly cause filtering, with each factor scored and explained. A WooCommerce order-confirmation email that scores 6/10 because the body contains the word “free” twice, the SPF record is missing the relay’s IP, and the HTML version has an unclosed tag is a far more actionable diagnostic than “users say they didn’t get it.”

The limits: mail-tester is one mailbox running one set of filtering rules. Its score is not the same as Gmail’s placement decision or Outlook’s placement decision. A 10/10 mail-tester score does not guarantee a Gmail inbox landing; a 7/10 score does not guarantee a spam folder. The value is the structured breakdown, not the headline number. Operators who chase the score itself end up optimising for mail-tester rather than for their actual audience’s mailboxes.

Use mail-tester first because it is fast and free. Use it to rule out content-level and header-level problems before paying for or building anything more elaborate. Re-test after every change to the body template, every DKIM-key rotation by the sending provider, and every change to the From or Reply-To address. The cost of a re-run is the time it takes to copy the test address and click send.

Gmail’s Promotions and Updates tabs are a category of misclassification mail-tester does not detect. A message that lands in Promotions is not in spam (the user can find it) and not in the inbox (the user typically does not look). Tab placement is mostly a function of body content (image-heavy templates, marketing-style HTML), the sender’s reputation history (consistent transactional senders rarely get tabbed), and Gmail-side classifier updates that change daily. There is no public test tool for tab placement; the workaround is to send transactional and marketing mail from different subdomains so Gmail can build separate reputation profiles for each.

Gmail Postmaster Tools and Microsoft SNDS

For domains with established sending volume (typically thousands of messages per month to Gmail or Outlook respectively) the receivers publish aggregate sender-reputation data that the operator can read directly.

Gmail Postmaster Tools requires DNS-verifying the sending domain and waiting until the domain accumulates enough volume that Gmail will share data (the threshold is not published but is typically in the high thousands of messages per day). Once active, the dashboard shows domain reputation, IP reputation, authentication pass rates broken down by SPF/DKIM/DMARC, spam-rate as reported by user “report spam” clicks, and delivery errors. The data is one-day-lagged and covers Gmail / Workspace only.

Microsoft SNDS (Smart Network Data Services) covers the equivalent for Outlook.com, Hotmail.com, and Microsoft 365 mailboxes. Access requires registering the sending IPs (not the domain; Microsoft is IP-centric where Google is domain-centric); once registered, the dashboard shows IP reputation, complaint rate, and trap-hit rate. JMRP (Junk Mail Reporting Program) is a complementary feedback loop that emails the operator each time a Microsoft user reports a message from the registered IP as spam.

Both tools are free, vendor-direct, and the closest thing to “what does Gmail / Microsoft actually think of my sending.” The limits are real:

  • Aggregate-only. Neither tool tells the operator about a specific message that failed to land. They give multi-day reputation trends and per-day complaint rates, not per-message placement.
  • Volume-gated. Both tools require enough sending volume to produce statistically meaningful data. A WordPress site sending a few hundred transactional messages per month gets thin or empty dashboards.
  • Coverage-limited. Gmail Postmaster covers Gmail; SNDS covers Microsoft. Yahoo, AOL, Comcast, regional providers, and corporate Exchange tenants are not covered by either. Yahoo Sender Hub exists but is less mature than the Google and Microsoft offerings.

Use Postmaster Tools and SNDS if the domain’s sending volume justifies them. They are the right tool for “what does the receiver think of my reputation”; they are the wrong tool for “did this specific message land.”

GlockApps and equivalent inbox-placement panels

The paid inbox-placement category is built on seed lists: panels of real mailbox addresses across Gmail, Outlook, Yahoo, AOL, Comcast, and dozens of smaller providers. The operator sends one test message to the panel; the service reports per-provider inbox-vs-spam-vs-missing placement.

GlockApps is the WordPress audience’s default. Pricing is published on the GlockApps site and varies by test allowance, retention, and feature tier; expect monthly costs in the low double digits for a small-site allowance and several times that for the tiers with deliverability monitoring, automated test scheduling, and bounce-handling diagnostics. Litmus and Email on Acid offer similar panels at higher price points; their feature sets target marketing-email teams with budgets and integrations the WordPress operator usually does not have. Inbox Insight and similar smaller services occupy the lower end of the price range with smaller panels.

What the paid panels measure that nothing else does: actual placement across a representative panel of real provider mailboxes for a specific test message. If the operator wants to know “if I send the WooCommerce order confirmation to Gmail, Outlook, and Yahoo right now, where does it land?”, this is the only category of tool that answers it.

The limits, named honestly:

  • The panel is not your audience. The seed addresses are known to filtering systems. Major providers have anti-fraud heuristics that treat known-seed addresses differently from real-user addresses, sometimes more favourably (treated as test traffic) and sometimes less (treated as suspect because they receive bulk-pattern mail). The score is a directional measurement of your sending posture, not a direct prediction of what your actual users will see.
  • Snapshot, not trend. A single GlockApps run measures one message at one time against one filter state. Provider filters change daily; a clean panel today does not guarantee a clean panel next week. The value comes from running the same test repeatedly and watching the trend, not from interpreting one score.
  • Cost. The entry tier is affordable in absolute terms but is also a recurring expense most WordPress operators running a single small site are reluctant to take on for a problem that may resolve with a content tweak. Use the panel runs strategically: when something has changed (new relay, new From address, new send volume), and as a quarterly sanity check.

nanoPost’s recommendation: for a one-time diagnostic, the GlockApps free trial is the right entry point. For ongoing monitoring on a site with deliverability stakes (a paid product, a membership site, a WooCommerce store at scale), the paid tier earns its keep. For a five-page brochure site sending occasional contact-form notifications, the cost is hard to justify.

Self-hosted seed lists

The no-vendor-lock alternative is to build the panel yourself: create real mailbox accounts at the major providers, send a test message to all of them, and check each manually. Free, but operationally expensive.

The minimum useful panel covers Gmail (gmail.com plus a Workspace tenant if accessible), Outlook (outlook.com plus a Microsoft 365 tenant if accessible), Yahoo, AOL, Comcast, and ProtonMail or Tutanota for the privacy-mailbox category. Ten addresses, all real, all checked manually or via IMAP.

A minimal Python script to automate the IMAP check:

import imaplib
import email
from email.header import decode_header

ACCOUNTS = [
    ("gmail.com", "imap.gmail.com", "[email protected]", "app-password-here"),
    ("outlook.com", "imap-mail.outlook.com", "[email protected]", "password-here"),
    # ... one tuple per panel address
]

def check_placement(host, user, password, subject_marker):
    """Return 'inbox', 'spam', or 'missing' for a message with this subject."""
    m = imaplib.IMAP4_SSL(host)
    m.login(user, password)
    for folder in ("INBOX", "[Gmail]/Spam", "Junk"):
        try:
            m.select(f'"{folder}"', readonly=True)
            typ, data = m.search(None, f'SUBJECT "{subject_marker}"')
            if typ == "OK" and data[0]:
                m.logout()
                return "inbox" if "INBOX" in folder else "spam"
        except imaplib.IMAP4.error:
            continue
    m.logout()
    return "missing"

# Run after sending a test message with a unique subject marker
for provider, host, user, pw in ACCOUNTS:
    print(provider, check_placement(host, user, pw, "test-2026-06-12-1430"))

This is illustrative, not production code: it does not handle OAuth (most Gmail and Outlook accounts now require it), does not handle providers with non-IMAP retrieval, does not handle the case where the message is delayed past the check window, and does not enumerate every provider’s spam-folder name (Yahoo uses Bulk Mail; Comcast uses Junk E-mail; ProtonMail’s bridge uses Spam; Microsoft 365 sometimes uses Junk Email). A real implementation handles all of these. The script exists in this piece to indicate the shape of the work; an operator building this should expect to spend a half-day on the first version.

When self-hosted seed lists are worth doing: when the paid panels’ results contradict actual user reports (some real users say the message lands in spam, GlockApps says inbox), or when the audience is concentrated on providers the paid panels do not cover well (a corporate Exchange-heavy customer base, a regional provider, an audience using Fastmail or HEY or another smaller mailbox service). For most WordPress operators, the paid panel covers the case at less operational cost.

The diagnostic methodology

The tools above are the inputs. The methodology is the sequence that turns them into a diagnosis.

Step 1: rule out content-level red flags. Run a mail-tester send first. If the score is below 8/10, the breakdown will name the specific issue (DKIM not signing, link in body to a blacklisted domain, image-to-text ratio out of balance). Fix the flagged issues before doing anything more elaborate. Most inbox-placement problems trace to a content or header configuration issue that mail-tester catches in 30 seconds.

Step 2: check reputation, if the volume justifies it. If the domain has the sending volume to populate Gmail Postmaster Tools and Microsoft SNDS, read the reputation trends. Falling domain reputation, rising user-complaint rate, or authentication-pass-rate dropping appreciably below 100% are all signals that point at structural issues (a misconfigured relay, a compromised account sending phishing as your domain, a marketing list that is generating “report spam” clicks). Gmail’s bulk sender guidelines require a spam-complaint rate under 0.3% as a hard floor; a domain trending toward that threshold needs intervention before placement testing is meaningful. Address those structural issues before measuring placement; placement will not improve while reputation is degrading.

Step 3: measure baseline placement. Run one GlockApps panel test, or send to a self-hosted seed list. Record the per-provider placement (inbox / spam / missing). This is the baseline against which everything else is measured. Without a baseline, the operator has no way to tell whether a change made things better or worse.

Step 4: change one factor, re-test. If placement is worse than the operator wants, change exactly one factor and re-test. Examples of single-factor changes: switch the sending IP (move from a shared IP to a dedicated IP, or vice versa), switch the From address (test whether placement differs when sending as [email protected] versus [email protected]), strip a body element (remove the image header, remove the unsubscribe link, simplify the HTML structure), or change the sending time (test whether evening sends are filtered differently from morning sends). One change, one re-test. Multiple changes at once destroy the ability to attribute the result.

Step 5: re-test after every infrastructure change. Adding a new mailer plugin, switching SMTP providers, changing the DMARC policy from p=none to p=quarantine, moving to a new dedicated IP: all of these change the inputs to the filter. The baseline measurement is not portable across changes. Re-establish it.

The single-factor-at-a-time discipline is the difference between “we tried X and Y and it seemed to help” and a real diagnosis. Most operators skip the discipline because changing one thing at a time is slow and the user is complaining now. The fast answer is usually wrong; the methodical answer is usually right. For a WordPress site whose email matters (commerce, memberships, account recovery), the methodical answer is worth the extra week.

What inbox-placement testing cannot tell you

The tools and the methodology give the operator a measurement. They do not give the operator the underlying model of why the filter decided what it decided. The providers do not publish that model. Gmail’s filter weights, Microsoft’s reputation algorithm, and Yahoo’s heuristics are all proprietary, evolve continuously, and are not disclosed. The operator can measure outputs and infer inputs; the operator cannot read the source code.

This means three honest limits on what placement testing can deliver:

One bad day is not a trend. A single placement test that shows spam-folder landings does not necessarily mean the sender’s reputation has tanked. Providers re-evaluate continuously; a routine spam-filter update can move some senders into a lower-trust bucket for a day or two before re-stabilising. Operators who panic-react to a single test result waste effort. The signal is the trend across repeated tests, not the value of one test.

Test results do not generalise across audiences. A panel run that shows clean Gmail placement does not guarantee clean Gmail placement for a real user. Real users have read-rate signals, complaint history, and engagement patterns that train the filter for their specific mailbox. The seed panel can only measure the sender-side posture, not the recipient-side context. Two users with the same email service can receive the same message in different folders.

The measurement is point-in-time. Inbox placement is a feedback loop: today’s results train tomorrow’s reputation. A week of clean placement builds reputation; a week of spam landings degrades it. The operator who tests once and assumes the result holds for months is operating on stale data. For sites where placement matters, build the test into a routine: monthly at minimum, weekly during any active configuration change.

For the protocol layer this guide builds on, see the first-principles email reference. For the layer below this (what to do if the relay itself is failing), see the SMTP session log reference and Swaks for SMTP debugging. For the authentication layer that has to be solid before inbox-placement work is worth doing, see DNS setup for WordPress email and parsing DMARC aggregate reports.