{"id":1658,"date":"2025-12-15T13:45:24","date_gmt":"2025-12-15T13:45:24","guid":{"rendered":"http:\/\/ayercut.com\/index.php\/2025\/12\/15\/new-technologies-like-ai-come-with-big-claims-borrowing-the-scientific-concept-of-validity-can-help-cut-through-the-hype\/"},"modified":"2025-12-15T13:45:24","modified_gmt":"2025-12-15T13:45:24","slug":"new-technologies-like-ai-come-with-big-claims-borrowing-the-scientific-concept-of-validity-can-help-cut-through-the-hype","status":"publish","type":"post","link":"http:\/\/ayercut.com\/index.php\/2025\/12\/15\/new-technologies-like-ai-come-with-big-claims-borrowing-the-scientific-concept-of-validity-can-help-cut-through-the-hype\/","title":{"rendered":"New technologies like AI come with big claims \u2013 borrowing the scientific concept of validity can help cut through the hype"},"content":{"rendered":"<figure><img decoding=\"async\" src=\"data:image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\" class=\"lazyload\" data-src=\"https:\/\/images.theconversation.com\/files\/700631\/original\/file-20251105-56-z6vgyb.jpg?ixlib=rb-4.1.0&amp;rect=0%2C37%2C2092%2C1394&amp;q=45&amp;auto=format&amp;w=1050&amp;h=700&amp;fit=crop\"><figcaption><span class=\"caption\">Closely examining the claims companies make about a product can help you separate hype from reality.<\/span> <span class=\"attribution\"><a class=\"source\" href=\"https:\/\/www.gettyimages.com\/detail\/photo\/magnifying-glass-data-royalty-free-image\/2164377469\">Flavio Coelho\/Moment via Getty Images<\/a><\/span><\/figcaption><\/figure>\n<p>Technological innovations can seem relentless. In computing, some have proclaimed that \u201ca <a href=\"https:\/\/doi.org\/10.1038\/d41586-023-01273-w\">year in machine learning<\/a> is a century in any other field.\u201d But how do you know whether those advancements are hype or reality?<\/p>\n<p>Failures quickly multiply when there\u2019s a deluge of new technology, especially when these developments haven\u2019t been properly tested or fully understood. Even technological innovations from trusted labs and organizations sometimes result in spectacular failures. Think of <a href=\"https:\/\/www.statnews.com\/2017\/09\/05\/watson-ibm-cancer\/\">IBM Watson<\/a>, an AI program the company hailed as a revolutionary tool for cancer treatment in 2011. However, rather than evaluating the tool based on patient outcomes, IBM used less relevant measures \u2013 possibly even <a href=\"https:\/\/doi.org\/10.1007\/s00146-020-00945-9\">irrelevant ones<\/a>, such as expert ratings rather than patient outcomes. As a result, IBM Watson not only failed to offer doctors reliable and innovative treatment recommendations, it also <a href=\"https:\/\/www.statnews.com\/2018\/07\/25\/ibm-watson-recommended-unsafe-incorrect-treatments\/\">suggested harmful ones<\/a>.<\/p>\n<p>When <a href=\"https:\/\/www.britannica.com\/technology\/ChatGPT\">ChatGPT was released<\/a> in November 2022, interest in AI <a href=\"https:\/\/trends.google.com\/trends\/explore?date=2022-01-01%202025-10-18&amp;geo=US&amp;q=AI&amp;hl=en\">expanded rapidly<\/a> across industry <a href=\"https:\/\/doi.org\/10.1038\/d41586-023-02980-0\">and in science<\/a> alongside ballooning <a href=\"https:\/\/theconversation.com\/is-ai-dominance-inevitable-a-technology-ethicist-says-no-actually-240088\">claims of its efficacy<\/a>. But as the vast majority of companies are seeing their <a href=\"https:\/\/futurism.com\/ai-agents-failing-companies\">attempts at incorporating generative AI fail<\/a>, questions about whether the technology does what developers promised are coming to the fore.<\/p>\n<figure class=\"align-center zoomable\"> <a href=\"https:\/\/images.theconversation.com\/files\/700626\/original\/file-20251105-56-sq8c99.jpg?ixlib=rb-4.1.0&amp;q=45&amp;auto=format&amp;w=1000&amp;fit=clip\"><img decoding=\"async\" alt=\"Black screen with IBM Watson logo on a Jeopardy stand with $1,200 stood between two contestants with $0 each\" src=\"data:image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\" sizes=\"(min-width: 1466px) 754px, (max-width: 599px) 100vw, (min-width: 600px) 600px, 237px\" class=\"lazyload\" data-src=\"https:\/\/images.theconversation.com\/files\/700626\/original\/file-20251105-56-sq8c99.jpg?ixlib=rb-4.1.0&amp;q=45&amp;auto=format&amp;w=754&amp;fit=clip\" data-srcset=\"https:\/\/images.theconversation.com\/files\/700626\/original\/file-20251105-56-sq8c99.jpg?ixlib=rb-4.1.0&amp;q=45&amp;auto=format&amp;w=600&amp;h=401&amp;fit=crop&amp;dpr=1 600w, https:\/\/images.theconversation.com\/files\/700626\/original\/file-20251105-56-sq8c99.jpg?ixlib=rb-4.1.0&amp;q=30&amp;auto=format&amp;w=600&amp;h=401&amp;fit=crop&amp;dpr=2 1200w, https:\/\/images.theconversation.com\/files\/700626\/original\/file-20251105-56-sq8c99.jpg?ixlib=rb-4.1.0&amp;q=15&amp;auto=format&amp;w=600&amp;h=401&amp;fit=crop&amp;dpr=3 1800w, https:\/\/images.theconversation.com\/files\/700626\/original\/file-20251105-56-sq8c99.jpg?ixlib=rb-4.1.0&amp;q=45&amp;auto=format&amp;w=754&amp;h=504&amp;fit=crop&amp;dpr=1 754w, https:\/\/images.theconversation.com\/files\/700626\/original\/file-20251105-56-sq8c99.jpg?ixlib=rb-4.1.0&amp;q=30&amp;auto=format&amp;w=754&amp;h=504&amp;fit=crop&amp;dpr=2 1508w, https:\/\/images.theconversation.com\/files\/700626\/original\/file-20251105-56-sq8c99.jpg?ixlib=rb-4.1.0&amp;q=15&amp;auto=format&amp;w=754&amp;h=504&amp;fit=crop&amp;dpr=3 2262w\"><\/a><figcaption> <span class=\"caption\">IBM Watson wowed on Jeopardy, but not in the clinic.<\/span> <span class=\"attribution\"><a class=\"source\" href=\"https:\/\/newsroom.ap.org\/detail\/Jeopardy-TopWinners\/c65add11f4d146738a4d63a8883fc2e8\/photo\">AP Photo\/Seth Wenig<\/a><\/span> <\/figcaption><\/figure>\n<p>In a world of rapid technological change, a pressing question arises: How can people determine whether a new technological marvel genuinely works and is safe to use? <\/p>\n<p>Borrowing from the language of science, this question is really <a href=\"https:\/\/misq.umn.edu\/misq\/article\/doi\/10.25300\/MISQ\/2024\/18064\/3273\/Validity-in-Design-Science\">about validity<\/a> \u2013 that is, the soundness, trustworthiness and dependability of a claim. Validity is the <a href=\"http:\/\/doi.org\/10.17705\/1jais.00594\">ultimate verdict<\/a> of whether a scientific claim accurately reflects reality. Think of it as quality control for science: It helps researchers know whether a medication really cures a disease, a health-tracking app truly improves fitness, or a model of a black hole genuinely describes how it behaves in space.<\/p>\n<p>How to evaluate validity for new technologies and innovations has been unclear, in part because science has mostly focused on validating claims about the natural world. <\/p>\n<p>In our <a href=\"https:\/\/scholar.google.com\/citations?user=_qBDX98AAAAJ&amp;hl=en\">work as researchers<\/a> <a href=\"https:\/\/scholar.google.com\/citations?user=t0ysd44AAAAJ&amp;hl=en\">who study how to<\/a> evaluate science across disciplines, we developed a <a href=\"https:\/\/doi.org\/10.48550\/arXiv.2503.09466\">framework to assess the validity<\/a> of any design, be it a new technology or policy. We believe setting clear and consistent standards for validity and learning how to assess it can empower people to make informed decisions about technology \u2013 and determine whether a new technology will truly deliver on its promise.<\/p>\n<h2>Validity is the bedrock of knowledge<\/h2>\n<p>Historically, validity was primarily concerned with ensuring the precision of scientific measurements, such as whether a thermometer correctly measures temperature or a <a href=\"https:\/\/doi.org\/10.48550\/arXiv.2509.09723\">psychological test accurately assesses anxiety<\/a>. Over time, it became clear that there is more than just one kind of validity. <\/p>\n<p>Different scientific fields <a href=\"https:\/\/uk.sagepub.com\/en-gb\/eur\/validity-in-educational-and-psychological-assessment\/book239005\">have their own ways of evaluating validity<\/a>. Engineers test new designs against safety and performance standards. Medical researchers use controlled experiments to verify treatments are more effective than existing options. <\/p>\n<p>Researchers across fields use <a href=\"https:\/\/people.tamu.edu\/%7Ew-arthur\/204\/15A\/PSYC%20204%2015A%20lecture%20notes,%20Topic%2002,%20Research%20Validity.pdf\">different types of validity<\/a>, depending on the kind of claim they\u2019re making. <\/p>\n<p>Internal validity asks whether the relationship between two variables is truly causal. A medical researcher, for instance, might run a <a href=\"https:\/\/theconversation.com\/what-is-a-clinical-trial-a-health-policy-expert-explains-137221\">randomized controlled trial<\/a> to be sure that a new drug led patients to recover rather than some other factor such as the placebo effect. <\/p>\n<p>External validity is about generalization \u2013 whether those results would still hold outside the lab or in a broader or different population. An example of low external validity is how many early studies that work in mice <a href=\"https:\/\/theconversation.com\/expanding-alzheimers-research-with-primates-could-overcome-the-problem-with-treatments-that-show-promise-in-mice-but-dont-help-humans-188207\">don\u2019t always translate<\/a> to people.<\/p>\n<p>Construct validity, on the other hand, is about meaning. Psychologists and social scientists rely on it when they ask whether a test or survey really captures the idea it\u2019s supposed to measure. Does a <a href=\"https:\/\/doi.org\/10.1016\/j.euroecorev.2021.103736\">grit scale<\/a> actually reflect perseverance or just stubbornness? <\/p>\n<p>Finally, ecological validity asks whether something works in the real world rather than just under ideal lab conditions. A behavioral model or AI system might perform brilliantly in simulation but fail once human behavior, noisy data or institutional complexity enter the picture. <\/p>\n<p>Across all these types of validity, the goal is the same: ensuring that scientific tools \u2013 from lab experiments to algorithms \u2013 connect faithfully to the reality they aim to explain.<\/p>\n<h2>Evaluating technology claims<\/h2>\n<p>We developed a method to help researchers across disciplines clearly test the reliability and effectiveness of their inventions and theories. The <a href=\"https:\/\/doi.org\/10.48550\/arXiv.2503.09466\">design science validity framework<\/a> identifies three critical kinds of claims researchers usually make about the utility of a technology, innovation, theory, model or method.<\/p>\n<p>First, a <a href=\"https:\/\/doi.org\/10.48550\/arXiv.2503.09466\">criterion claim<\/a> asserts that a discovery delivers beneficial outcomes, typically by outperforming current standards. These claims justify the technology\u2019s utility by showing clear advantages over existing alternatives. <\/p>\n<p>For example, developers of generative AI models such as ChatGPT may see higher engagement with the technology the more it flatters and agrees with the user. As a result, they may program the technology to be more affirming \u2013 a feature <a href=\"https:\/\/doi.org\/10.1007\/978-3-031-92611-2_5\">called sycophancy<\/a> \u2013 in order to <a href=\"https:\/\/doi.org\/10.48550\/arXiv.2510.01395\">increase user retention<\/a>. The AI models meet the criterion claim of users considering them <a href=\"https:\/\/doi.org\/10.48550\/arXiv.2510.01395\">more flattering than talking to people<\/a>. However, this does little to improve the technology\u2019s efficacy in tasks such as helping resolve <a href=\"https:\/\/theconversation.com\/do-you-talk-to-ai-when-youre-feeling-down-heres-where-chatbots-get-their-therapy-advice-257732\">mental health issues<\/a> or relationship problems. <\/p>\n<figure> <iframe loading=\"lazy\" width=\"440\" height=\"260\" src=\"https:\/\/www.youtube.com\/embed\/V5-mnu2BDGk?wmode=transparent&amp;start=0\" frameborder=\"0\" allowfullscreen=\"\"><\/iframe><figcaption><span class=\"caption\">AI sycophancy can lead users to break relationships rather than repair them.<\/span><\/figcaption><\/figure>\n<p>Second, a <a href=\"https:\/\/doi.org\/10.48550\/arXiv.2503.09466\">causal claim<\/a> addresses how specific components or features of a technology directly contribute to its success or failure. In other words, it is a claim that shows researchers know what makes a technology effective and exactly why it works.<\/p>\n<p>Looking at AI models and excessive flattery, researchers found that interacting with more sycophantic models <a href=\"https:\/\/doi.org\/10.48550\/arXiv.2510.01395\">reduced users\u2019 willingness to repair<\/a> interpersonal conflict and increased their conviction of being in the right. The causal claim here is that the AI feature of sycophancy reduces a user\u2019s desire to repair conflict. <\/p>\n<p>Third, a <a href=\"https:\/\/doi.org\/10.48550\/arXiv.2503.09466\">context claim<\/a> specifies where and under what conditions a technology is expected to function effectively. These claims explore whether the benefits of a technology or system generalize beyond the lab and can reach other populations and settings. <\/p>\n<p>In the same study, researchers examined how excessive flattery affected user actions in other datasets, including the \u201cAm I the Asshole\u201d community on Reddit. They found that AI models were <a href=\"https:\/\/doi.org\/10.48550\/arXiv.2510.01395\">more affirming of user decisions<\/a> than people were, even when the user was describing manipulative or harmful behavior. This supports the context claim that sycophantic behavior from an AI model applies across different conversational contexts and populations.<\/p>\n<h2>Measuring validity as a consumer<\/h2>\n<p>Understanding the validity of scientific innovations and consumer technologies is critical for scientists and the general public. For scientists, it\u2019s a road map to ensure their inventions are rigorously evaluated. And for the public, it means knowing that the tools and systems they depend on \u2013 such as health apps, medications and financial platforms \u2013 are truly safe, effective and beneficial. <\/p>\n<p>Here\u2019s how you can use validity to understand the scientific and technological innovations happening around you.<\/p>\n<p>Because it is difficult to compare every feature of two technologies against each other, focus on which features you value most from a technology or model. For example, do you prefer a chatbot to be accurate or better for privacy? Examine claims for it in that area, and check that it is as good as claimed. <\/p>\n<p>Consider not only the types of claims made for a technology but also which claims are not made. For example, does a chatbot company address bias in its model? It\u2019s your key to knowing whether you see untested and potentially unsafe hype or a genuine advancement.<\/p>\n<p>By understanding validity, organizations and consumers can cut through the hype and get to the truth behind the latest technologies.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"data:image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\" alt=\"The Conversation\" width=\"1\" height=\"1\" class=\"lazyload\" data-src=\"https:\/\/counter.theconversation.com\/content\/259030\/count.gif\"> <\/p>\n<p class=\"fine-print\"><em><span>The authors do not work for, consult, own shares in or receive funding from any company or organization that would benefit from this article, and have disclosed no relevant affiliations beyond their academic appointment.<\/span><\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Scientists across all fields make various types of claims about their innovations. Validity tests check whether they deliver on what they promise.<\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11],"tags":[],"class_list":["post-1658","post","type-post","status-publish","format-standard","hentry","category-europe"],"_links":{"self":[{"href":"http:\/\/ayercut.com\/index.php\/wp-json\/wp\/v2\/posts\/1658","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/ayercut.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/ayercut.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"http:\/\/ayercut.com\/index.php\/wp-json\/wp\/v2\/comments?post=1658"}],"version-history":[{"count":0,"href":"http:\/\/ayercut.com\/index.php\/wp-json\/wp\/v2\/posts\/1658\/revisions"}],"wp:attachment":[{"href":"http:\/\/ayercut.com\/index.php\/wp-json\/wp\/v2\/media?parent=1658"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/ayercut.com\/index.php\/wp-json\/wp\/v2\/categories?post=1658"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/ayercut.com\/index.php\/wp-json\/wp\/v2\/tags?post=1658"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}