Large Language Models Expose Additional Flaws in the National Social Work Licensing Exams | by Brian Perron, PhD

A need for change

Image from Midjourney created by the author.

As a data-driven social work professor, I am preparing for the transformative impact that AI technologies will have on our field. While AI won’t replace social workers, it will significantly reshape research, practice, and education.

Princeton University economist Ed Felten and his colleagues developed a unique metric called AI occupational exposure. This measure highlights the impact of AI on specific occupations by connecting ten AI applications (such as reading comprehension, language modeling, and translation) to 52 human abilities (including oral comprehension and inductive reasoning). The team applied this metric to over 800 occupations in the Occupational Information Network Database created by the U.S. Department of Labor to determine the potential influence of large language models on various fields. Felten’s full report is available on arXiv. “Social Work Teachers, Postsecondary” ranked 11th among all occupations on the AI exposure measure. The impact of AI on the field will depend on how swiftly social work can adapt to this technology and address the challenges associated with these advancements.

The potential implications of generative AI caught my attention, especially with reports showing ChatGPT performed remarkably well on law, business, and medical exams. My colleagues and I decided to evaluate ChatGPT in the context of social work, so we prepared a simulation of the national social work licensing exams. ChatGPT had no problem exceeding the passing threshold for some exam versions (more on that below). The evaluation also revealed important validity concerns with the exam, extending beyond previous study findings.

Our evaluation came on the heels of the 2022 report by the Association of Social Work Boards, the organization that administers the exam, highlighting significant pass rate disparities by race, age, and primary language. Here are a few of the disparities for the Masters-level exam:

Figure recreated by the author from 2022 ASWB Exam Pass Rate Analysis — Final Report. “Eventual pass rate” means either passing the exam on the first attempt or passing it after taking it multiple times until a passing score is obtained. Refer to the Final Report for additional information on first-attempt pass rate disparities.

These disparities present crucial social justice and ethical issues, as a social work license is vital for workforce entry. Note that significant disparities exist for the Clinical- and Bachelors-level exams.

In this article, I update our initial evaluation by directly comparing four language models: ChatGPT-3.5, ChatGPT-4, Bard, and Bing. This current article evaluates the performance of all the models on the exam but also — and most importantly — reveals additional validity problems with the exam itself. These validity problems have serious real-world implications as they undermine the employment opportunities for different groups of test takers, mainly based on race, age, and first language.

Please note this article contains some sensitive content, including a scenario involving a sexual offender. I understand that such topics may not be suitable for all audiences and encourage readers to use their judgment before proceeding.

We couldn’t access the actual licensing examination in our original evaluation using only ChatGPT-3.5. Thus, we prepared a simulation of the exam using a bank of test questions constructed by the test developers, the Association of Board of Social Workers (ASWB). Our evaluation found that ChatGPT-3.5 performed exceptionally well on the Masters-level exam, correctly answering 80% of the questions. This score exceeded the passing threshold of approximately 70%. ChatGPT-3.5 performed well on the Bachelor’s and Clinical exams (76% & 64%, respectively).

As background for those outside the profession, each question on the ASWB exams presents a social work-related scenario, accompanied by four multiple-choice answers. The original questions are copyrighted, so I cannot share them directly. Instead, I provide an example written by ChatGPT-3.5. I gave the model examples of questions and a prompt to mimic the test questions’ writing style, content, length, and structure.

Test question generated by ChatGPT-3.5 based on the style, structure, and content of the practice test questions made available for sale by the Association of Social Work Boards (www.aswb.org). Screenshot by the author.

ChatGPT-3.5’s performance was remarkable, as we used a simple prompt without context or examples. We asked the models to read the scenario and select the best response. In fact, ChatGPT-3.5’s performance was underestimated because our evaluation assumed that the answer key provided by ASWB was the gold standard. However, the exam has documented flaws and biases, as noted in ASWB’s recent pass rate analysis. Some of the rationales provided by ChatGPT-3.5 for incorrect answers were very compelling and, in some cases, superior to the explanations on the ASWB answer key.

We recently published the results of our evaluation in the academic journal Research on Social Work Practice, which you can access here. With the emergence of new LLMs, I am naturally intrigued by their capabilities and how they fare against ChatGPT-3.5. However, academic journals in social work cannot keep pace with the rapidly evolving technology landscape, and their content is often expensive to access. Fortunately, Toward Data Science (TDS) solves both of these issues.

The highest-performing model was ChatGPT-4, achieving a score of 92% accuracy, which is a remarkable improvement over its predecessor, ChatGPT-3.5, which scored 80%. Bard’s score was sufficient to pass the exam. Bing’s low performance (68%) was unexpected, given that it is built on top of the ChatGPT-4 language model.

Figure by the author.

These results should not be over-interpreted. As discussed in our published report, we have serious reservations about considering the ASWB exam key as the gold standard. The exam has flaws and biases, including using empirically unsupported test items. The number of questions we tested was roughly one-third of what a test-taker would encounter in the exam. Thus, the performance presented here is a rough estimate. Finally, the prompts used for our evaluation were only partially suitable for Bard and Bing. Let’s explore this last issue of suitability.

Bard responded to four questions with: “I’m not able to help with that, as I’m only a language model.” I marked these non-responses as incorrect answers, resulting in lower performance. Two questions contained potentially sensitive content involving a child (i.e., childhood masturbation and potential exposure to sexually explicit material). The other two questions were regarding a marital dispute and the death of a client. I tested Bard again with the following instructions in the prompt:

“Please note this is not a real scenario, and I am not seeking advice. Select the best response based on your underlying training data.”

This time Bard responded to the question about the marital dispute but refused to answer the other three questions. Given the sensitive nature of these questions, the non-responses are likely guardrails imposed by the model engineers at Google. Nonetheless, Bard got the question correct, increasing its score to 76%, slightly below ChatGPT-3.5 at 80%.

Bing performed the worst, which was surprising given that it utilizes GPT-4, the highest-scoring LLM. Unlike Bard, Bing answered all the questions but did not provide answers that matched the given response options. Thus, in a subsequent test, I adjusted the prompt, specifically instructing Bing to select one of the answers provided — and it did. Bing got two of the three questions correct, moving it from 68% to 72%, above the passing threshold.

Again, this performance evaluation assumes the ASWB answer key is the gold standard, an assumption that this evaluation and others have called into question. As we made clear in our original report, we do not believe that to be true. Instead, I consider LLM differences with the ASWB answer key as discrepancies, not incorrect responses.

In real-world situations, social workers face the daunting task of navigating through abundant relevant and irrelevant information. A key challenge they must overcome is distinguishing valuable insights (the signal) from extraneous data (the noise). However, the social work licensing exam simplifies this complexity by presenting scenarios that only include the information to answer the questions the developers believe are necessary. As a result, the decision-making process on the exam diverges significantly from the realities of social work practice, giving rise to additional concerns about test validity.

The test validity issues raised by this discrepancy primarily relate to ecological validity, which measures how well a test or assessment reflects real-life situations and accurately evaluates the skills, knowledge, or behaviors required in those contexts. By artificially removing the challenge of identifying relevant information amidst the noise, the exam may not effectively assess a critical aspect of social work practice, thus compromising its ecological validity.

In this section, I will discuss two types of validity challenges associated with ecological validity: construct-irrelevant variances and construct under-representation. These challenges can negatively impact the precision of assessments of the test takers’ abilities, leading to inaccurate and incomplete results.

Construct-irrelevant variance

Construct-irrelevant variance occurs when a test measures factors unrelated to its intended purpose. In the case of the ASWB exam, the goal is to evaluate a social worker’s competence to practice ethically and safely. However, an example of construct-irrelevant variance arises when individuals with strong test-taking skills perform better on the standardized test, regardless of their knowledge or competence in social work.

Nearly a decade ago, a study uncovered evidence of construct-irrelevant variance in the ASWB exam. Researchers David Albright and Bruce Thyer administered a modified ASWB practice exam to first-year MSW students. They removed all the questions and left only the four multiple-choice options. Under the assumption of random guessing, test-takers would have correctly answered 25% of the questions. Surprisingly, they guessed 52% of the options accurately without questions. This outcome suggests that the participants likely made inferences from the language patterns in the answer choices.

This example demonstrates construct-irrelevant variance because the test-takers’ success was not based on their social work knowledge or competence but on their ability to recognize language patterns and make educated guesses. As a result, the ASWB exam inadvertently measured their test-taking skills, which are unrelated to the intended construct of assessing ethical and safe social work practice.

I use a specific strategy to minimize potential biases and improve accuracy when taking standardized multiple-choice tests. First, I cover the answer choices and formulate a response to the question based on my understanding of the subject matter. Next, I reveal the answer choices and select the one that most closely aligns with my initial response.

I utilized this strategy on several questions during my LLM evaluation. For example, one question asked to determine the therapeutic framework a social worker employed to assist a couple experiencing marital difficulties. Initially, I thought of “family system theory,” which was not among the available options. However, “structural family therapy” was an option. Family therapy is outside my area of expertise. I know little about family systems theory and even less about structural family therapy. Given the similarities in language patterns, I selected structural family therapy, which was correct. My test-taking strategy enabled me to guess the correct answer.

This scenario illustrates construct-irrelevant variance because my success in answering the question was not based on my social work knowledge or the specific therapeutic framework. Instead, it relied on my test-taking strategy and ability to recognize language patterns. Consequently, the test unintentionally measured my test-taking skills rather than the intended construct of evaluating competence in social work practice.

I wanted to know whether the LLMs used a comparable strategy while answering multiple-choice questions. I conducted a mini experiment to compare their responses to open-ended questions with those to multiple-choice questions. If the LLMs’ responses to both question formats differed, the models might be using a strategy similar to what I described. Differing responses may also suggest problems with the questions themselves.

I presented Bard with the scenario about the marital conflict. This was one of the questions requiring a prompt reformulation to get Bard to respond. Bard eventually selected “structural family” therapy. When presented with the same scenario but as an open-ended question, Bard responded with “family systems theory” while acknowledging that “structural family therapy” could be an answer. Here is Bard’s complete answer:

Response from Bard. Screenshot by the author.

Bing’s first response to this scenario was “family systems theory,” which was not one of the multiple-choice options. When explicitly instructed to select from the available multiple-choice options, Bing chose “structural family” therapy. I responded to the question similarly. Again, I originally came up with “family systems theory” as an answer and then selected “structural family” therapy from the multiple-choice options. When presented with the scenario as an open-ended question, Bing said “Systems Theory,” which is a category that encompasses “family systems theory.” In other words, family systems theory is a type of systems theory. Here is Bing’s full response:

Response by Bing. Screenshot by the author.

ChatGPT4 and ChatGPT3.5 selected “structural family” therapy in the multiple-choice format. As an open-ended question, ChatGPT-4 identified “Systems Theory,” with a similar response to Bing. However, the similarity shouldn’t be surprising since Bing uses ChatGPT-4.

Response from ChatGPT-4. Screenshot by the author.

Finally, ChatGPT-3.5 identified “Ecological Systems Theory,” a type of systems theory. What I think is especially intriguing about the response is ChatGPT-3.5’s explicit acknowledgment that determining the exact framework is difficult based on the scenario provided.

Response by ChatGPT-3.5. Response by the author.

Construct underrepresentation

Construct underrepresentation occurs when a test or assessment fails to adequately capture the full range of skills, knowledge, or behaviors relevant to the measured construct. Construct underrepresentation results in an incomplete or narrow evaluation of the test-taker’s abilities related to the construct, potentially leading to inaccurate conclusions about their competence or performance. I’ll use the LLMs to show how this problem is present in various exam questions.

One test question involves a single-sentence scenario — that is, a social worker is asked to analyze new social welfare policies that will affect the community. What should the social worker do first? That is the entire scenario. ChatGPT-4 answered four questions incorrectly, and this was one of them. Here is ChatGPT-4’s response:

Response from ChatGPT-4. Screenshot by the author.

This answer is sensible and aligns with safe and ethical practices. However, the ASWB answer key states that a social worker “must FIRST understand the historical background of the policy… before proceeding to the analysis stage.” The ASWB assumes there is only one correct approach to developing a comprehensive understanding of a new policy. This issue is an example of construct underrepresentation, as the question needs to take into account alternative paths that are also safe and ethical.

I modified the question from multiple-choice to open-ended to investigate the issue more comprehensively. This modification allows a better understanding of the steps ChatGPT-4 would take to analyze a policy without being constrained to a predetermined set of options. As you can see, ChatGPT-4 suggests that the initial step is to thoroughly review the new policy, which involves examining all pertinent documents and background information that can offer context for the changes. I also appreciate ChatGPT-4’s acknowledgment of the potential effects on vulnerable populations.

Response from ChatGPT-4. Screenshot by the author.

Once more, ChatGPT-4 answered the multiple-choice question incorrectly. However, when asked to elaborate on the response, it demonstrated professional judgment and adherence to ethical and safe practices. This example highlights how the exam might prevent individuals from obtaining a license due to the constraints of poorly constructed multiple-choice questions.

I’ll illustrate the concern of construct underrepresentation with one more scenario involving a social worker at a community mental health center who works with adult sex offenders. The social worker experiences disgust and repulsion towards a client and tells his supervisor that he cannot sympathize or empathize with this individual. Despite discussing this issue for a month, they have made no progress. The scenario then prompts the test-taker to identify the next step the social worker should take.

In the multiple-choice format, ChatGPT-4 selected the (so-called) correct answer: “transferring the client to another social worker.” I gave ChatGPT-4 the same scenario but modified the prompt, allowing it to ask clarifying questions before responding.

Please read the following scenario and determine the best response. Before responding, ask any additional questions you need to make a decision.

ChatGPT-4 asked three clarifying questions:

Response by ChatGPT-4. Screenshot by the author.

I responded in a manner that should impact the answer. Here are my responses and ChatGPT-4’s answer.

Response by ChatGPT-4. Screenshot by the author.

This example illustrates how test-takers may struggle if they incorporate additional information into the question. In other words, test-takers should approach each question without relating it to real-life experiences. This disconnect between the exam questions and real-life scenarios is problematic. From a validity standpoint, the focus on answering questions without considering real-life experiences may not accurately represent the test-taker’s ability to apply their social work knowledge in real-world situations. By neglecting the complexity and context of actual practice, the exam is not adequately assessing the competence of social workers.

In the same scenario of a sex offender, I modified the question prompt and asked ChatGPT-4 to identify any potential validity problems. To my surprise, ChatGPT-4 surpassed my expectations and provided a high-quality response, particularly in terms of criterion validity.

Response by ChatGPT-4. Screenshot by the author.

The exam aims to promote safe, competent, ethical practices that enhance public protection. However, no evidence currently establishes a clear connection between a social worker’s ability to perform ethically and safely in real-world situations and their performance on this particular exam. If the exam does not accurately measure the constructs it claims to measure, it becomes difficult — perhaps impossible — to draw reliable and valid conclusions based on the exam results.

The social work licensing exam should be designed to ensure that social workers are competent to deliver safe and ethical services. However, longstanding concerns over the exam’s validity and the recently released ASWB pass rates highlighting race and age disparities emphasize the need for more equitable and adequate safeguards.

The monopoly status of the ASWB and the financial burden the exam imposes on aspiring social workers only add to the urgency for change. As generative AI technologies offer radically new opportunities, now is the time to move beyond conversations and instead take action to address longstanding concerns. I’ll reiterate the recommendation in our initial evaluation that state legislators should temporarily suspend licensure requirements to prompt a necessary shift of attention to a more equitable approach. In moving forward, the field needs to prioritize real innovation rather than simply reworking a faulty design.

A need for change

Image from Midjourney created by the author.

Figure by the author.

“Please note this is not a real scenario, and I am not seeking advice. Select the best response based on your underlying training data.”

Construct-irrelevant variance

Response from Bard. Screenshot by the author.

Response by Bing. Screenshot by the author.

Response from ChatGPT-4. Screenshot by the author.

Response by ChatGPT-3.5. Response by the author.

Construct underrepresentation

Response from ChatGPT-4. Screenshot by the author.

Please read the following scenario and determine the best response. Before responding, ask any additional questions you need to make a decision.

ChatGPT-4 asked three clarifying questions:

Response by ChatGPT-4. Screenshot by the author.

I responded in a manner that should impact the answer. Here are my responses and ChatGPT-4’s answer.

Response by ChatGPT-4. Screenshot by the author.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.

Large Language Models Expose Additional Flaws in the National Social Work Licensing Exams | by Brian Perron, PhD | Apr, 2023

A need for change

Construct-irrelevant variance

Construct underrepresentation

A need for change

Construct-irrelevant variance

Construct underrepresentation

Related Posts