Sign in to Peak Brain Path
Sign in to access your courses, books, and progress tracker. New here? Signing in creates your account automatically.
Want to explore courses first?
Browse courses and booksSign in to access your courses, books, and progress tracker. New here? Signing in creates your account automatically.
Want to explore courses first?
Browse courses and booksModule 11
Chapter 11 · 2.5 h · 8 quiz items · pass at 80%
This module is central to IQCB Domain V (QEEG), 21% of the exam. The z-score is the unit a QEEG report trades in, and the candidate must understand both how the normative comparison is built and why uncorrected multiple comparisons produce spurious deviations. The quiz confirms the learner can interpret a z-score, choose a database appropriately, and apply correction for multiple comparisons.
A clean recording transformed into metrics is still just numbers from one person at one moment. Is 20 μV of frontal alpha high or low? Is 8.5 Hz alpha fast or slow? Without a comparison population, you cannot say. Normative databases supply that population, and the z-score is the unit in which the comparison is expressed. This chapter covers where the databases come from, how the z-score is built, the false-positive problem that the multiple-comparison structure of QEEG creates, the corrections that address it, how databases differ from one another, how norms change across the lifespan, and how medication breaks the comparison. Databases sit at the center of Domain V because nearly every interpretive error in clinical QEEG traces back to misreading a z-score, so the exam leans on this material hard.
The governing principle: a database is a tool, not an oracle. It tells you how unusual a measurement is relative to a reference sample. It does not tell you whether the measurement matters, whether it reflects pathology, or whether the person in front of you belongs to the population the sample represents. Hold that distinction and most of this chapter follows.
Brain activity varies enormously between healthy individuals. One person's resting alpha amplitude can be several times another's, both entirely normal. A single recording, however clean, is a measurement without a yardstick.
The single-case problem is concrete. A subject shows 8 Hz alpha. Without a reference, you cannot classify it. With an age-matched database, the classification is immediate and opposite depending on age: in an 8-year-old, 8 Hz alpha is age-appropriate and normal; in a 30-year-old, 8 Hz alpha is slowed (roughly two standard deviations below the expected 10 to 11 Hz peak). Same measurement, opposite reading, and the only thing that resolved it was the normative comparison. Every QEEG metric carries this property: the number means nothing until it is placed against a distribution of people like the client.
Databases also let you compare unlike metrics on a common scale. Power, frequency, coherence, and asymmetry live in different units. The z-score converts each to standard-deviation units, so you can ask "which of this client's findings is most deviant" across metrics that otherwise share no scale. That standardization is the database's second service, after age-appropriate context.
What goes into a normative database determines what "normal" means for everyone compared against it. Five construction choices matter most, and each is a place the exam can probe.
Sample size. Larger is better, and it matters most at the edges. A few hundred subjects can characterize the stable middle of the adult range adequately, but age bands at the extremes (infancy, advanced age) thin out fast, and a sparsely sampled band produces unstable mean and standard-deviation estimates. When a band has few subjects, the z-scores computed against it are themselves uncertain. Good databases report their per-band counts. Thin bands warrant interpretive caution.
Age-matching and age bands. Because EEG changes continuously across the lifespan, norms must be age-specific. Databases handle this either by binning subjects into age groups (and comparing the client to the matching bin) or by age-regression modeling (Section 11.3), which fits a smooth function of age and compares the client to the model's prediction for their exact age. Bin width matters: a one-year-old differs markedly from a three-year-old, so young children need narrow bands, while stable adult years tolerate wider ones. A broad band ("20 to 40 years") means the client is compared against a mean that includes people quite different in age. If your client is 23 and the band runs to 40, the band mean is not centered on them.
Exclusion criteria. The sample is screened to represent "normal," but screening is imperfect. Ideal exclusion removes diagnosed neurological and psychiatric conditions, head injury, and substance use. In practice, "screened" means "no diagnosed condition," not "verified healthy," so normative samples contain undiagnosed ADHD, subclinical anxiety, sleep problems, and the like, which shift what counts as normal. Looser exclusion produces a more permissive norm; stricter exclusion produces a cleaner but less representative one.
Recording conditions. The database was recorded with a specific protocol: condition (eyes closed, eyes open), duration, sampling rate, filters, and reference. Your recording must match it, because the metrics depend on all of these. A database built on five minutes of eyes-closed data with a linked-ears reference cannot validly score a two-minute eyes-open recording referenced to Cz. Mismatched protocol means the z-score is computed against the wrong distribution, and the deviation it reports is an artifact of the mismatch.
Reference electrode. Because EEG is reference-dependent, the reference used to build the database is part of its identity. Linked ears, average reference, and others each produce different voltages and therefore different norms. The database states its reference. Your re-referencing (Chapter 10) must produce data in that same reference frame. The exam treats reference-matching as non-negotiable, and the report should document the reference used.
Artifact standards. Database developers artifacted their normative recordings to their own criteria, and those criteria set the database's baseline. If the developers cleaned aggressively, the norm represents unusually clean data, and a moderately clean clinical recording can read as "abnormally high" in muscle-adjacent bands simply because it carries more residual artifact than the norm. Match the artifact rigor of the database as best you can, and recognize that you rarely know the developers' exact standards in detail.
Several normative databases are in clinical use. They converge on broad resting-state spectral findings and diverge on high beta, connectivity, task states, and the lifespan edges. The exam expects familiarity with the major names and their distinguishing features. The Field Guide carries the full comparison, and Appendix E of this book tabulates them side by side.
NeuroGuide (Thatcher). The most established and most cited normative database in clinical QEEG, developed by Robert Thatcher. It covers the lifespan from infancy through old age with narrow overlapping age bands (one-year bands for children, wider bands for adults), primarily uses a linked-ears reference with re-referencing available, and provides absolute and relative power, peak frequencies, asymmetry, coherence, phase, and LORETA source norms. It is FDA 510(k) cleared as normalizing QEEG software, which gives it regulatory standing in contexts that require a cleared tool. Its sample is predominantly North American and modest in some age bands, and it was collected over an extended period during which methods evolved. NeuroGuide is the common reference point against which other databases are compared.
BrainDX. Part of the New York University lineage descending from the foundational work of E. Roy John and Leslie Prichep at the NYU Brain Research Laboratories, commercialized successively as Neurometrics, then NxLink (which held the FDA 510(k) clearance, K974748, issued 1998), and most recently as BrainDX. Its distinguishing feature is source-space normative comparison: it emphasizes LORETA-normed metrics in addition to surface measures, and it carries the John/Prichep discriminant-function tradition. Its core normative sample was collected decades ago with stringent neuropsychological screening. The discriminant functions carry the etiologic limitation discussed in Section 11.4. The IQCB blueprint lists BrainDX among the databases a candidate should recognize.
Neurofield. A database used in the neurofeedback and QEEG community, sensitive to recording protocol like all normative systems. The IQCB blueprint includes it among the databases to know. Treat protocol-matching with particular care, and consult its current documentation for sample composition and recording conditions, which evolve. [citation needed: Neurofield database sample-size and construction specifications]
EureKa (Key Institute). Associated with the Key Institute for Brain-Mind Research (the LORETA developers' institutional home), drawing on a European (Swiss) sample with its own recording conditions. As with any database built on a specific national sample and protocol, its norms are sound within that population and require attention to demographic and protocol match when applied elsewhere. The IQCB blueprint names it among recognized databases. [citation needed: EureKa database sample and construction specifications]
Pearson (NCS Pearson). A normative database associated with standardized clinical acquisition through NCS Pearson, a major test publisher, emphasizing standardized administration. Listed in the IQCB blueprint among databases a candidate should know. Consult current product documentation for sample and protocol specifications. [citation needed: Pearson/NCS Pearson QEEG normative database specifications]
The clinically useful generalization, supported by cross-database concordance studies in the Field Guide source: for resting-state spectral power in roughly the 1 to 20 Hz range, the major databases agree closely (correlations above 0.9 between systems are documented). A frontal theta excess flagged in one will almost certainly flag in another. Divergence concentrates in high beta and EMG-adjacent frequencies (where artifact handling differs), in task-state and ERP metrics (which only some databases carry), and at the lifespan edges (infancy and advanced age, where coverage varies). The practical recommendation: pick one primary database, learn its characteristics, and use a second for cross-validation on complex cases, trusting findings that survive both.
Binning subjects into age groups is the simplest way to age-match, but it has a cost: the client is compared to a group mean that may not be centered on their exact age, and bin boundaries create discontinuities. Age-regression modeling replaces bins with a smooth function of age, so the client is compared to the model's prediction for their precise age. The statistical sophistication of this modeling has advanced through three generations, and the exam asks why the newer methods improve on the older.
Polynomial regression (first generation). The classic approach, used in the traditional NeuroGuide implementation, models each EEG feature as a polynomial function of age (cubic or quartic). It is simple and interpretable and works well through the middle of the age range. Its weakness is that a polynomial is a fixed global shape: it cannot independently follow the rapid nonlinear changes of early childhood and the slow drift of adulthood without overfitting one or underfitting the other, and the curve becomes unstable at the youngest and oldest ages (endpoint instability), inflating z-scores exactly where samples are thinnest. Thatcher's implementation mitigates this by using overlapping age bins rather than pure polynomial fitting.
Generalized additive models (GAMs, second generation). GAMs replace the fixed polynomial with smooth, data-adaptive spline functions whose flexibility is penalized and selected by cross-validation. This lets the model follow developmental growth spurts without the analyst pre-specifying their timing, and it removes the binning discontinuities. Modern sex-stratified databases use GAM spline smoothing to model age-dependent power separately by sex, which both follows the developmental curve and tightens the normative distribution.
GAMLSS (third generation). Generalized Additive Models for Location, Scale, and Shape extend GAMs by modeling not just the mean as a function of age but also the variance, skewness, and kurtosis. This matters because the shape of the distribution, not only its center, changes with age: pediatric power distributions are right-skewed during growth spurts, and aging produces heavy-tailed distributions as trajectories diverge. By modeling distributional shape at every age, GAMLSS produces z-scores that are more accurate at the extremes of the age range, where clinical decisions are most consequential and where a Gaussian assumption fails most. A child flagged at z = +2.5 under a Gaussian assumption falls at z = +2.0 under a GAMLSS model that accounts for the wider, skewed distribution expected at that age, shifting the finding from "clearly abnormal" to "monitor and retest." This connects directly to the fat-tails problem in Section 11.5: GAMLSS is one principled answer to it.
The z-score is the database's currency. Its definition is simple. Its interpretation is where practitioners go wrong.
Computation. The z-score is the observed value minus the database mean, divided by the database standard deviation, for that metric at that site for that age (and sex, when stratified):
z = (observed value − mean) / SD
It expresses the measurement in standard-deviation units. The mean and SD come from the normative sample (or the age-regression model's prediction and residual spread at the client's age).
The normative assumption. The z-score's interpretation as a percentile rests on the metric being approximately normally distributed after transformation. Under a true normal distribution, z = +1 is the 84th percentile, z = +2 the 98th, z = −2 the 2nd, and so on. This is why metrics are transformed before z-scoring (Chapter 10): absolute power is log-transformed (it is lognormal raw), relative power needs a logit transform (it is bounded 0 to 1), coherence needs a Fisher z-transform (also bounded, with ceiling effects), and phase requires circular statistics entirely (its mean and SD are undefined in ordinary Euclidean terms). If the wrong transform (or no transform) is applied, the z-score is computed on a distribution that violates its own assumption, and the tails misbehave. When a report shows a coherence z-score of +3.0, the right question is whether a Fisher z-transform was applied first. If not, the number is built on bounded data scored with unbounded statistics.
Statistical versus clinical significance. This distinction is the most-tested idea in the chapter and the most-confused in practice. A z-score beyond roughly ±1.96 is "statistically significant" at p < 0.05, meaning the measurement falls in the outer 5 percent of the reference distribution. That is all it means. It does not mean dysfunction, does not mean the pattern causes the client's symptoms, and does not mean treatment is indicated. Two examples make the gap concrete. Alpha frequency of 10.8 Hz might score z = +2.1, statistically elevated but clinically trivial (half a hertz above expected, meaningless). Frontal theta at z = +1.8 falls just short of the p < 0.05 threshold, yet combined with attention symptoms and performance deficits it is clinically relevant. Statistical thresholds are guidelines for where to look, not verdicts. Clinical significance requires evidence that the pattern interferes with what the client needs to do, which comes from symptoms, performance testing, and history, not from the z-score alone.
QEEG runs not one comparison but hundreds, and that structure manufactures false positives. This is the statistical heart of Domain V.
The arithmetic. A standard 19-electrode montage, scored across multiple frequency bands (delta, theta, alpha, beta, often sub-bands), on multiple metrics (absolute power, relative power, coherence between electrode pairs, asymmetry, ratios), in two conditions (eyes open and eyes closed), produces easily 500 to 2,000 individual statistical comparisons in a single analysis. At a p < 0.05 threshold, 5 percent of comparisons exceed the threshold by chance alone in a person with no genuine abnormality. A thousand comparisons yield about 50 "significant" findings by chance. A report flags 15 to 20 metrics as deviant when several are statistical noise.
Why it matters clinically. A flagged z-score on a colorful map reads, visually, like a problem. But in a montage producing hundreds of comparisons, a single isolated electrode at z = +3.0 with nothing else abnormal is more likely a false positive or an artifact than a real finding. The multiple-comparison structure means that isolated extreme values are exactly what chance produces. The error the exam targets is treating each flagged metric as an independent discovery rather than recognizing that the report is a haystack in which a few needles appear by chance.
The fat-tails compounding. The false-positive rate is worse than the nominal 5 percent because the Gaussian assumption underlying the z-score does not hold as well as the field assumed. EEG power distributions have heavier-than-Gaussian tails even after log transformation. The practical consequence is that a database claiming 5 percent out-of-range at z > ±2 flags 6 to 9 percent of neurotypical subjects as abnormal, and at the ±3 threshold, where you expect about 1 in 370, the rate is closer to 1 in 30. A z of +3.5 feels extreme and is rare under a perfect Gaussian, but in a fat-tailed distribution it is far less rare than the percentile table suggests. This does not invalidate z-scores. It sharpens the rule that patterns beat points. Weight z > 2.5 more heavily than z > 2.0, because the gap between those thresholds is where fat tails inflate false positives most.
If hundreds of comparisons inflate false positives, the statistical answer is to correct the threshold. Two corrections appear on the exam, and they differ in what they control and how conservative they are.
Bonferroni correction. The simplest and most conservative. It controls the family-wise error rate (the probability of even one false positive across the whole set of comparisons) by dividing the threshold by the number of comparisons: with 1,000 tests, the per-test threshold becomes 0.05 / 1,000 = 0.00005. This crushes false positives, but at a steep cost in false negatives: with so many comparisons, the corrected threshold is so stringent that genuine moderate deviations fail to reach it. Bonferroni assumes the comparisons are independent, which QEEG metrics are not (adjacent electrodes and overlapping bands are correlated), making it doubly conservative. It is rarely used in routine clinical QEEG for these reasons, though the exam expects you to know what it does and why it is conservative.
False Discovery Rate (FDR). The Benjamini-Hochberg procedure controls a different and more clinically sensible quantity: the proportion of false positives among the findings declared significant, rather than the probability of any false positive at all. It ranks the p-values and applies a graduated threshold that tolerates a controlled fraction of false discoveries (commonly 5 percent of the significant set). FDR is less conservative than Bonferroni, retains more real findings, and is more appropriate for the exploratory, many-comparison structure of QEEG. Some software platforms implement it. The exam-level contrast: Bonferroni controls the chance of any false positive and is the more conservative; FDR controls the rate of false positives among significant results and is the better fit for QEEG.
The clinical practice that substitutes for both. Formal correction is one tool. Convergent-pattern reasoning is the one practitioners actually rely on. A finding earns confidence when multiple adjacent electrodes show consistent deviation that makes physiological sense, when multiple metrics converge (elevated theta power and slowed peak frequency and reduced coherence all telling the same story), when the pattern appears in both eyes-closed and eyes-open conditions, and when it correlates with symptoms. A single electrode with an extreme z-score and no corroboration is discounted regardless of how many standard deviations it spans. Pattern over point is the clinical form of multiple-comparison control, the same rigor enforced by hand rather than by formula.
Run the same recording through two databases and the z-scores differ, sometimes enough to change the call. Frontal theta might score z = +2.3 (flagged) in one database and z = +1.2 (normal range) in another. The exam expects you to know why, and that neither answer is "wrong."
Sources of divergence. Two categories. First, sample differences: the databases drew from different populations with different means, standard deviations, and artifact-handling standards, so the distribution the client is scored against differs. Second, method differences: age-stratification approach (bins versus regression), frequency-band definitions (where exactly theta ends and alpha begins), filtering, transforms, and reference scheme all differ between platforms. A divergence is not a contradiction. It reflects two different reference contexts.
Reconciling disagreement. Three workable strategies. Use one primary database consistently and learn its characteristics, which suits routine clinical work. Cross-validate on complex cases and trust findings that survive both databases while discounting single-database findings, which suits research-grade or high-stakes interpretation (and forensic work, Chapter 18). Or, when databases seem to mislead, step outside normative comparison entirely toward intra-individual change and qualitative pattern recognition, treating the database as one information source rather than the arbiter. The forensic relevance is direct: an adversarial expert can run your recording through a different database to manufacture a different number, so a forensic finding should be one that holds across databases.
EEG changes across the entire lifespan, and norms must track those changes or they will flag normal development and normal aging as abnormal. The full developmental and aging trajectory belongs to the baseline-across-time companion volume. What the exam tests, and what matters for database comparison, is how the norm itself moves and where it is least stable.
Infants and toddlers. The maturing brain moves from discontinuous to continuous activity in the neonatal period, and the posterior dominant rhythm emerges and accelerates through the first years. Slow activity (delta, theta) dominates early and recedes with age. Norms here require narrow age bands because change is rapid, and sample sizes are thin, so z-scores at these ages are the least stable in any database. Frontal slowing that would alarm in an adult is expected under roughly age six.
School age. The posterior dominant rhythm follows a normative trajectory toward the adult alpha frequency, and theta persists at levels that would be abnormal in adults but are age-appropriate here. This is the band where the 8 Hz alpha example from Section 11.1 lives: normal in a child, slowed in an adult. Age-matching is mandatory and the bands must stay reasonably narrow.
Adolescence. Cortical pruning and myelination proceed, alpha continues to mature and stabilize, and beta shifts. The QEEG correlates of this maturation make adolescent recordings developmentally unstable: a 13-year-old's map should be treated as provisional, because puberty shifts alpha and beta, and apparent session-to-session change reflects maturation as often as anything clinical. Repeat assessment after puberty is prudent before committing to an extended interpretation.
Young adult. The stable plateau, and the reference center of the database. Bands are settled, and the norms are most reliable here because samples are largest and change is slowest. This is the age range against which the database is best calibrated.
Middle and older adult. A gradual trajectory of alpha slowing and other age-related shifts begins. The norm must distinguish normal aging from early neurodegeneration, which is hard, because the trajectories overlap and the geriatric medication burden confounds the recording. Bands tighten in importance again after about 60, sample sizes thin once more, and z-scores at advanced ages are correspondingly less stable. The fat-tails and distributional-shape problems of Section 11.5 are most acute at both age extremes, which is exactly where GAMLSS-style modeling (Section 11.4) earns its keep.
The single operational rule across the lifespan: age-appropriate norms are non-negotiable, and the bands must be narrow where change is fast (early childhood, advanced age) and may be wider where it is slow (stable adulthood). Apply an adult norm to a child, or a young-adult norm to an 80-year-old, and the database will manufacture deviations that are nothing but development and aging.
Normative databases are built, by design, from medication-free subjects. The moment your client takes a drug that affects EEG, you are comparing a medicated brain to an unmedicated norm, and the mismatch generates deviations that are pharmacology, not pathology. This is one of the most common interpretive errors in clinical QEEG, and the exam tests whether you catch it.
The mechanism of the error. A drug that raises beta, lowers theta, or slows the background shifts the client's metrics away from the unmedicated norm. The database, knowing nothing about the medication, scores that shift as a deviation. A client on a benzodiazepine shows elevated beta (the drug's signature). Compared to unmedicated norms, this flags as "excessive beta," but it is the expected medication effect, not an anxiety phenotype. A client successfully treated for ADHD with a stimulant shows reduced frontal theta (the drug's effect); their z-scores may now read normal, raising the question of whether "normal on medication" is the same as inherently normal, and what the map would show off the drug. In both directions, the database reports the medication as if it were the brain.
Drug classes that most disrupt the comparison. The full drug-by-drug treatment is Chapter 14 and the companion medication volume; the database-relevant headlines: benzodiazepines produce fast beta spindles and slow activity at once, the single most diagnostically disruptive class; stimulants suppress the attention-domain slow activity in responders, masking phenotypes; sedating antipsychotics produce diffuse slowing that mimics encephalopathic or hypoarousal patterns; and anticonvulsants variably suppress fast activity. Each makes the medicated brain a poor match for the unmedicated norm.
What to do about it. The minimum is documentation: record every current medication with dose and timing before the recording, because you cannot correct for what you did not capture. For trait phenotyping, an off-medication baseline (subject to safe and appropriate washout, by drug class, coordinated with the prescriber) is preferable, because it removes the confound at the source. When an off-medication recording is not feasible, the finding must be reported with the confound named explicitly: "elevated frontal theta observed, but the recording occurred on quetiapine, which independently produces frontal slow activity, so phenotype confidence is reduced" is a defensible statement. "Elevated frontal theta detected" alone is not. The database comparison is only as valid as the match between the client's medication state and the norm's medication-free assumption.
The z-score is a single number carrying a chain of assumptions: that the recording was clean, that the montage and reference matched the database, that the right transform made the metric approximately Gaussian, that the client's age (and sex) was modeled well, that the client belongs to the population the sample represents, and that no medication is masquerading as a brain pattern. When the chain holds, the z-score is a precise statement of how unusual a measurement is. When any link breaks, the z-score is precise and wrong.
The exam rewards the candidate who treats z-scores as evidence to be weighed rather than verdicts to be read off. Statistical deviation is where to look; clinical significance is whether it matters; the multiple-comparison structure means isolated extremes are usually noise; the corrections (Bonferroni, FDR) and the convergent-pattern habit both exist to separate signal from haystack; databases disagree for principled reasons; norms must be age-appropriate across the lifespan; and medication breaks the comparison unless documented and accounted for. Carry the numbers, and these cautions, into the next chapter, where the maps and reports turn z-scores into clinical interpretation.