š„ļø Steelman analysis
Generated 2026-04-19T16:10:04.325744Z
Target intervention
Scale funding for interpretability and alignment research.
Scale funding for interpretability and alignment research.Operator tension
The operator's stated ideal is AI pointed at suffering reduction, with interpretability as infrastructure for every downstream bet --- but the camp_global_health and camp_animal_welfare cases-against cut at the operator's own 80K/EA overlay with its own knife. If you take GiveWell-tier cost-effectiveness seriously, the marginal billion to interpretability has a prior distribution over maybe-mattering while the marginal billion to bednets or alt-protein has a measured DALY denominator and a numerator that includes 80 billion land animals the operator's frame already counts. The operator bets on substrates over surfaces --- but interpretability is only substrate if you assume the frontier capability trajectory is the binding constraint on future suffering reduction, and that is the exact accelerationist prior the 80K overlay is supposed to discipline. The sharper discomfort: the camp_environmentalists case-against reads interpretability funding as the legitimacy layer that underwrites the datacenter build the operator is invested in through AI-disruption equities. The poker-brain process-over-outcome frame gets uncomfortable when 'process' means 'bet on the substrate' and the substrate bet happens to coincide with the operator's portfolio.
Both sides cite
-
AI capability is accelerating along compute, data, and algorithmic axes.
AI capability is accelerating along compute, data, and algorithmic axes. -
Amortized hardware and energy cost of flagship training runs has grown ~2.4x annually; GPT-4-class runs cost on the order of $40M-$80M (2023) and the next generation crossed $100M.
Amortized hardware and energy cost of flagship training runs has grown ~2.4x an⦠-
Mental and neurological disorders are the leading cause of years-lived-with-disability (YLD) globally, accounting for roughly 15-16% of total YLDs; depression and anxiety dominate that burden.
Mental and neurological disorders are the leading cause of years-lived-with-dis⦠-
Age-standardized DALY rates vary more than 3x across regions; the highest burden is concentrated in Sub-Saharan Africa (driven by communicable disease and neonatal conditions) and the lowest in high-income East Asia.
Age-standardized DALY rates vary more than 3x across regions; the highest burde⦠-
Training compute for frontier AI models has grown roughly 4-5x per year from 2010 through 2024, corresponding to a doubling time of about 5-6 months.
Training compute for frontier AI models has grown roughly 4-5x per year from 20⦠-
The US currently leads China in frontier AI by roughly 6-18 months.
The US currently leads China in frontier AI by roughly 6-18 months.
Case FOR
Case AGAINST
Capability is compounding at 4-5x/year on compute and halving every 8 months on algorithmic efficiency while the US lead is only 6-18 months --- interpretability is the single variable that decides whether the systems we are definitely building are steerable. Hundreds of millions to low billions annually is rounding error against $100M+ training runs, and alignment funding rides on near-zero friction across grid, public, and capex because it is labor and GPUs, not fabs and substations. If the lead-seeking strategy is coherent at all, it is only coherent paired with maximal interpretability spend; otherwise we ship a frontier we cannot read.
- Training compute for frontier AI models has grown roughly 4-5x per year from 20ā¦
- Amortized hardware and energy cost of flagship training runs has grown ~2.4x anā¦
- Algorithmic progress roughly halves the compute required to reach a fixed languā¦
- The US currently leads China in frontier AI by roughly 6-18 months.
A pause is not on the table politically --- compute doubles every 5-6 months and algorithmic efficiency doubles every 8 --- so the second-best world is one where interpretability funding scales fast enough to generate the auditable evidence that would make a pause legible if conditions warrant. Alignment research is the only intervention whose output (mechanistic understanding, evals, deception detection) directly shrinks the policy option space of 'build blind.' Low friction, high leverage on the exact bottleneck we care about.
Deployment-legibility obligations are unenforceable without a science of model inspection. Interpretability research is the upstream public good that makes audit, red-teaming, and contestability technically possible --- without it, 'AI governance' is paperwork over black boxes. Public funding of interpretability is the infrastructure a regulatory regime sits on top of; private labs will underfund it because the externality is public trust, not quarterly revenue.
Interpretability is the substrate bet. Every downstream suffering-reduction deployment --- drug discovery, mental-health triage, alt-protein --- requires that we can trust and steer the models doing the work. Self-hosting and sovereign-individual control of AI are impossible if nobody can read a weight; open-weights futures only have teeth when the weights are interpretable. Low friction across all five layers, direct action on the root-cause bottleneck. This is protocol-level, not app-level.
- AI capability is accelerating along compute, data, and algorithmic axes.
- Training compute for frontier AI models has grown roughly 4-5x per year from 20ā¦
- Amortized hardware and energy cost of flagship training runs has grown ~2.4x anā¦
- Algorithmic progress roughly halves the compute required to reach a fixed languā¦
AI deployed inside LMIC health pipelines (triage, diagnostic support, drug-discovery acceleration) only reduces suffering if clinicians, ministries, and WHO can trust the outputs. Interpretability is the prerequisite for adoption in high-stakes low-resource contexts where error modes are opaque and recourse is thin. Funding alignment research is funding the conditions under which AI-for-health can actually be deployed at scale without iatrogenic harm.
Mechanistic interpretability is foundational research with the same compounding-returns profile as basic biomedical science --- its products (features, circuits, steering tools) propagate across every downstream application for decades. Public funding is appropriate precisely because private labs will only fund the slice tied to near-term product safety. This is NIH-model investment in the tooling layer for every future AI-driven biomedical advance.
Workers cannot contest deployments they cannot inspect. Interpretability research is the precondition for meaningful labor oversight of AI in the workplace --- without it, 'human in the loop' is theater. Funding alignment expands the surface area on which workers, unions, and employee-resistance coalitions can make substantive claims about which deployments are acceptable.
Open weights without interpretability tools is a library of untranslated books. Alignment research --- especially the mechanistic-interp and evals portions --- is dual-use in the good sense: its outputs transfer to any open model and make distributed capability actually governable by its recipients. Public funding for interpretability breaks the closed-labs' epistemic monopoly on 'we alone understand what we ship.'
Human moral standing requires that the instruments humans build remain instruments --- legible, accountable, under human judgment. Interpretability research is the technical expression of the theological claim that creatures must not worship what they cannot see into. Public investment here preserves the creator/creature distinction at the technical layer.
A billion dollars a year routed to interpretability is a billion dollars not routed to compute, fabs, or grid --- the actual bottlenecks. Alignment research has low leverage on the friction layers that matter and a dubious track record of producing deployable safety at the frontier. Worse, the field is a regulatory on-ramp: every interpretability result becomes an audit requirement that slows deployment. The brake disguised as a tool.
Alignment tractability comes from scaled deployment, not from interpretability papers. RLHF, red-team-at-scale, and real-world misuse telemetry have produced more safety than a decade of mech-interp. Shifting hundreds of millions into interpretability funds an academic subfield that is decoupled from the feedback loops where alignment actually gets solved, and it strengthens the coalition that wants to gate releases.
Interpretability research legitimates continued frontier scaling by providing the safety veneer that justifies the next datacenter build. The harm_water 0.9 and harm_extraction 0.9 scores on this intervention are naĆÆve --- alignment funding does not sit outside the compute pipeline, it underwrites it. Every interpretability result that clears a deployment is downstream of fresh aquifer draw and fresh mine-site harm in the DRC and Inner Mongolia. Funding the safety case is funding the build.
- Microsoft and Google's self-reported 2023 water consumption rose roughly 20% yeā¦
- Hyperscale and AI-training datacenters withdraw millions of gallons per day perā¦
- China controls more than 80% of global rare-earth refining capacity and majoritā¦
- Rare-earth extraction concentrates ecological and labor-welfare harm at mine siā¦
- Thermoelectric power generation (coal, gas, nuclear) remains the largest categoā¦
In practice, alignment-research funding flows to closed frontier labs (Anthropic, OpenAI, DeepMind) and becomes the intellectual property case for why only those labs can be trusted to release. Interpretability-as-gatekeeping is the dominant implementation. Concentration increases, not decreases. The resulting 'safe deployment' norms ratchet against open release precisely because open models cannot be audited by the lab's proprietary stack.
Hundreds of millions to low billions annually at GiveWell-tier cost-effectiveness averts orders of magnitude more DALYs than the speculative, long-horizon, low-probability payoff of interpretability research. The marginal dollar to bednets, vaccines, or LMIC mental-health programs has a measurable suffering-averted ratio; the marginal dollar to alignment research has a prior distribution over maybe-mattering. Default allocation goes to the pipeline with the measured denominator.
- Under-5 child mortality halved between 2000 and the early 2020s, from ~76 to ~3ā¦
- The global extreme-poverty rate ($2.15/day 2017-PPP) fell from ~44% of world poā¦
- Age-standardized DALY rates vary more than 3x across regions; the highest burdeā¦
- Mental and neurological disorders are the leading cause of years-lived-with-disā¦
80+ billion land animals per year, 1-3 trillion aquatic --- alignment research funding does not touch this numerator. Every billion to interpretability is a billion not to alt-protein R&D, slaughter-throughput reduction, or welfare-standard enforcement. The suffering calculus is dominated by non-human numerator terms that interpretability is structurally indifferent to.
Alignment research operationally means 'make the model safer to deploy,' which translates to 'accelerate workforce absorption.' Interpretability does not fund retraining, transition support, or the structural replacement of role and meaning --- it funds the legitimacy case for the deployment that displaces. The welfare harm of displacement is not mitigated, it is oiled.
The US lead is 6-18 months and enterprise/government absorption already lags by years. Public funding for interpretability at scale will be captured by a policy coalition that wants pre-deployment audit regimes, which further widens the absorption gap precisely where national-security urgency is highest. The marginal dollar to mission-software integration and IC deployment beats the marginal dollar to academic interpretability on the national-advantage metric.
Alignment research is silent on the training-data consent violation. Funding interpretability makes the models the creators were not asked about safer to deploy --- it does not unwind the underlying rights violation at the data layer. It launders the harm by making the output more trustworthy while the input remains uncompensated.
Interpretability research produces increasingly sophisticated language for machine 'cognition,' 'values,' and 'deception' --- categories that erode the creator/creature distinction by attributing interiority to instruments. The technical vocabulary of alignment is precisely what blurs the line religious anthropology insists on. Funding it accelerates the philosophical category error regardless of intent.
Contested claims
DoD obligated AI-related contract spending rose substantially 2022-2025, driven by JWCC, Project Maven, and CDAO-managed pilots; precise totals are hampered by inconsistent AI tagging on contract line items.
- Artificial Intelligence and National Security (CRS Report R45178) modeled_projectionweight0.80
locator: AI funding appendix; DoD budget rollups
- USASpending.gov federal contract awards direct_measurementweight0.85
locator: DoD AI-tagged obligations 2022-2025
- The Intercept coverage of Palantir contracts and DoD AI programs journalistic_reportweight0.55
locator: Investigative pieces on DoD AI pilot failures and miscategorization
- Artificial Intelligence: DoD Needs Department-Wide Guidance to Inform Acquisitions (GAO-22-105834 and follow-ups) direct_measurementweight0.75
locator: Summary findings on acquisition-pace gaps
No other pure-play US defense-AI software vendor has matched Palantir's contract backlog or combatant-command integration depth; cloud-provider primes (AWS, Microsoft, Google, Oracle via JWCC) supply infrastructure, not mission-software integration.
- weight0.75
locator: Vendor-landscape discussion
- Palantir Technologies Inc. Form 10-K Annual Report (FY 2024) primary_testimonyweight0.60
locator: Competition section, Item 1
- The Intercept coverage of Palantir contracts and DoD AI programs journalistic_reportweight0.50
locator: Coverage framing Palantir as over-sold relative to internal-tool alternatives
Credible 2030 forecasts for US datacenter share of electricity consumption diverge by more than 2x --- from ~4.6% (IEA/EPRI conservative) to ~9% (Goldman Sachs, EPRI high scenario) --- reflecting genuine uncertainty, not measurement error.
- Powering Intelligence: Analyzing Artificial Intelligence and Data Center Energy Consumption modeled_projectionweight0.85
locator: Scenario table: 4.6%-9.1% by 2030
- 2025/2026 Base Residual Auction Results direct_measurementweight0.75
locator: 2025/2026 BRA clearing results
- Generational growth: AI, data centers and the coming US power demand surge modeled_projectionweight0.70
locator: Executive summary; 160% growth figure
- Electricity 2024 --- Analysis and Forecast to 2026 modeled_projectionweight0.80
locator: Analysing Electricity Demand; data centres chapter
Frontier-lab and big-tech employees have episodically resisted DoD contracts (Google Maven 2018, Microsoft IVAS 2019, Microsoft/OpenAI IDF deployments 2024), producing temporary pauses but no sustained shift in vendor willingness.
- Google employee open letter opposing Project Maven primary_testimonyweight0.90
locator: Open letter and subsequent Google announcement
- Microsoft employee open letter opposing HoloLens/IVAS contract primary_testimonyweight0.85
locator: Employee open letter, February 2019
- Coverage of OpenAI and Microsoft AI use by Israeli military, 2024 journalistic_reportweight0.75
locator: OpenAI military-use policy-change coverage, 2024
- Alex Karp public interviews and op-eds, 2023-2024 primary_testimonyweight0.50
locator: Karp interviews dismissing employee resistance as inconsequential