Erasmus MC heeft ervoor gezorgd dat je Mijn BSL eenvoudig en snel kunt raadplegen. Je kunt je links eenvoudig registreren. Met deze gegevens kun je thuis, of waar ook ter wereld toegang krijgen tot Mijn BSL.
Om ook buiten de locaties van Erasmus MC, thuis bijvoorbeeld, van Mijn BSL gebruik te kunnen maken, moet je jezelf eenmalig registreren. Dit kan alleen vanaf een computer op een van de locaties van Erasmus MC.
Eenmaal geregistreerd kun je thuis of waar ook ter wereld onbeperkt toegang krijgen tot Mijn BSL.
Login
Als u al geregistreerd bent, hoeft u alleen maar in te loggen om onbeperkt toegang te krijgen tot Mijn BSL.
Using Artificial Intelligence to Improve Empathetic Statements in Autistic Adolescents and Adults: A Randomized Clinical Trial
Auteurs:
Lynn Kern Koegel, Elizabeth Ponder, Tommy Bruzzese, Mason Wang, Sina J. Semnani, Nathan Chi, Brittany L. Koegel, Tzu Yuan Lin, Ankush Swarnakar, Monica S. Lam
Challenges with social communication and social interaction are a defining characteristic of autism spectrum disorder (ASD). These challenges frequently interfere with making friendships, securing and maintaining employment, and can lead to co-occurring conditions. While face-to-face clinical interventions with trained professionals can be helpful in improving social conversation, they can be costly and are unavailable to many, particularly given the high prevalence of ASD and lack of professional training. The purpose of this study was to assess whether an AI program using a Large Language Model (LLM) would improve verbal empathetic responses during social conversation. Autistic adolescents and adults, 11–35 years of age, who were able to engage in conversation but demonstrated challenges with empathetic responses participated in this study. A randomized clinical trial design was used to assess the effects of the AI program (Noora) compared to a waitlist control group. Noora asks participants to respond to leading statements and provides feedback on their answers. In this study, participants were asked to respond to 10 statements per day 5 days per week for 4 weeks for an expected total of 200 trials. Pre- and post-intervention conversation samples were collected to assess generalization during natural conversation. Additionally pre- and post-intervention questionnaires regarding each participant’s comfort during social conversation and participants’ satisfaction with the AI program were collected. The results of this study demonstrated that empathetic responses could be greatly improved by using an AI program for a short period of time. Participants in the experimental group showed statistically significant improvements in empathetic responses, which generalized to social conversation, compared to the waitlist control group. Some participants in the experimental group reported improved confidence in targeted areas and most reported high levels of satisfaction with the program. These findings suggest that AI using LLMs can be used to improve empathetic responses, thereby providing a time- and cost-efficient support program for improving social conversation in autistic adolescents and adults.
Opmerkingen
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Introduction
Persistent challenges with social communication and social interaction across multiple contexts are one of the two diagnostic categories of autism spectrum disorder (ASD) (APA, 2013). Even the subgroup of autistic adolescents and adults that develops language, produces syntactically correct sentences, and engages in conversation often presents with social pragmatic challenges throughout their lifespan (Félix, et al., 2024; Loukusa & Moilanen, 2009). In particular, challenges with empathy can limit successful social interactions and lead to difficulties understanding and expressing interest in peers as well as with making and maintaining friendships (Baron-Cohen & Wheelwright, 2004; Laugeson et al., 2009). Empathy consists of both a cognitive component (i.e., understanding what the other is saying) and an affective component (i.e., recognizing what the other is feeling) (Baron-Cohen & Wheelwright, 2004; Hill, 2009). Following the cognitive component of receptive understanding of communicative intent and the affective recognition of an emotion, a relevant and empathetic verbal response is necessary for communicative competence during conversation (Koegel et al., 2024). The literature strongly supports the need for programs that improve areas relating to empathy, as research shows that autistic individuals with empathic abilities have better overall social functioning and enhanced interpersonal relationships (Baron-Cohen & Wheelwright, 2004).
To that end, a variety of interventions have targeted empathy. Many face-to-face programs have resulted in improvements in empathy following intervention. For example, social groups that include lessons on social communication (and sometimes parent involvement and homework) have resulted in self- or parent-reported improvements on questionnaires following completion of the program (Gantman et al., 2012; Hillier et al., 2007). Another face-to-face intervention, implemented in a one-on-one format, used a visual schematic that prompted the autistic adult to verbally express understanding of emotions and ask a relevant question. Weekly naturalistic conversational probes that contained opportunities for empathetic responses were recorded and feedback was provided by a clinician as participants watched the previous week’s recording. When the autistic adult saw clips of themselves not responding empathetically following a leading statement (e.g., “I twisted my ankle running this morning”), several potential examples of empathetic responses were discussed using a visual schematic. Results showed large improvements in all participants’ empathetic responses during conversation, gains generalized to conversations in natural settings, and maintenance of gains over time (Koegel et al., 2016). Prompting and feedback on verbal empathetic responses using multiple exemplars, during one-on-one naturalistic play activities with an adult, has also resulted in improvements in verbal responses to frustration, happiness, and sadness/pain, with generalization to untrained probe stimuli in children on the autism spectrum (Sivaraman, 2017).
In addition to these face-to-face contexts, more recent intervention programs have begun to use technology to target social emotional areas, such as empathy, in an effort to expand the reach and format of intervention and to provide lower-cost services. Technological approaches, including humanoid and nonhumanoid interactive robots, have been used for social emotional teaching, primarily with autistic children; however, notably, most focus on recognition of emotional states only. Interactive mobile apps have been used for alexithymia, which is often targeted through emotion recognition. For example, one intervention program with adult participants used a phone app to teach the recognition of seven different emotions (fear, anger, disgust, sadness, joy, pride, and surprise). Results showed that after 14 days of 45-min sessions, the experimental group showed significant improvements in computer-assessed emotion recognition and participants self-reported improvements in alexithymia compared to a psychoeducation control group (Lukas et al., 2019). Further, there was no attrition in the experimental group while almost a third of participants dropped out of the control group. While recognition of emotions was not assessed during natural interactions, this study has promising results for implementing intervention using electronic devices in place of a face-to-face format.
Virtual Reality (VR) has also been used to target the understanding of emotions in autistic children and, albeit few, adults. For example, 10 common scenarios were presented to elementary school children through an interactive and immersive audiovisual VR system, with a control group using a less-immersive desktop system with static images, with the goal of identifying, developing, and training appropriate emotional behaviors (Lorenzo et al., 2016). The results showed that after 40 sessions, more appropriate emotional behaviors occurred in the immersive VR scene environments compared with the desktop control group that interacted with 2D images of the scenes. In addition, the immersive VR system group showed some generalization to natural environments, regarding identifying emotions in the classroom, as reported on a Likert scale by a tutoring teacher. Despite the potential relative benefits of highly immersive VR systems (Ip et al., 2018), a less-immersive desktop program which displayed graphical animations of characters displaying emotions and monitored learners’ physiological signals and eye-gaze was also shown to improve emotion recognition, expression, regulation, and social interactions in children and adolescents (Bekele et al., 2013). Further, a meta-analysis showed no differences between less-immersive experiences (e.g., sensory experiences displayed on a desktop or audio-only experiences through headphones) and more-immersive VR programs (e.g., sensory experiences displayed using head-mounted displays) that targeted empathy (Martingano et al., 2021). Thus, different levels of immersion in VR systems appear to be helpful in improving recognition of empathy in children and adolescents; however, there is a need to refine and broaden the measures, such as assessing effectiveness with larger samples, comparison groups, and relative improvement compared to traditional approaches (Mesa-Gresa et al., 2018).
Other studies have used computer programs to teach recognition of basic emotions. For example, practicing with a computer program showing photographs of faces with various expressions resulted in significant improvements in emotion recognition among autistic adolescents and adults with low support needs after using the program for 2 h per week for 5 weeks compared to a control group (Bölte et al., 2002). In another study, photographs with captions describing situations that may evoke a particular emotion were presented to literate adolescents with low support needs in a randomized clinical trial. Participants in the experimental group selected the emotion from a field of four cartoon emojis. Data from the computer program showed that the experimental group improved in their recognition of emotions on the Emotions Recognition Cartoons and Strange Stories tests, which were significantly correlated with amount of usage; however, no improvement on the Facial Emotions Photographs Test occurred (Silver & Oakes, 2001).
To understand the relative effectiveness of different methods, Lecciso et al. (2021) compared a computer-based program with a robot-based program for improving emotion recognition and results showed no significant differences between the two formats. A preliminary systematic review and meta-analysis comparing VR-based programs with non-VR computer-based programs for emotion recognition suggested that non-VR computer-based programs may be more effective for autistic individuals (Farashi et al., 2024). As well, computer-based interventions and phone apps targeting social and emotional areas appear to be as, or more, effective than face-to-face interventions (Lukas et al., 2019; Ramdoss et al., 2012).
While some technology-based interventions appear to be helpful for autistic individuals, most have been designed for and implemented with children. Fewer programs have focused on adolescents and adults (Farashi et al., 2024; Kandalaft et al., 2013; Rezayi et al., 2023). Also, it appears that some areas of social engagement have been improved with technology-based interventions, however, gains in empathy have not been achieved with all studies (Takata et al., 2023) and other studies have reported challenges in generalization to social conversation (Kandalaft et al., 2013). Also, most intervention programs that use technology to target empathy have primarily focused on emotion recognition, without assessing or addressing verbal conversation. Thus, there is a need to ensure that autistic individuals can generalize the use of verbal empathetic statements in everyday social conversations. Given that Large Language Models (LLMs), such as ChatGPT, have the capability to produce human-like chat, conversational AI programs have the potential to assist in this area if programmed to evoke verbal responses from the user.
The advantages of using AI technology for intervention are numerous, especially for verbal conversation practice. AI tooling can offer access to one-on-one practice at any time and can offer increased ease over face-to-face sessions which inherently include multiple cues like facial expressions, voice intonation, and gestures that can make the learning process more challenging. AI technology interventions have significantly lower cost than face-to-face interventions, making it possible to provide one-on-one intervention at scale to individuals who may not have access due to geographical location, economic issues, or a lack of trained/competent providers (Yang & Mori, 2024). AI also has the potential to make interventions more personalized, aligning precisely with an individual’s strengths and weaknesses (Olawade et al., 2024).
There has been an increase in commercial AI products that focus on improving social engagement, however, relatively few have been empirically tested with published findings (Jaliaawala & Khan, 2020), presenting doubt regarding effectiveness and efficacy. In short, many AI programs lack ethical considerations and the accepted and rigorous scientific methodology to accurately evaluate the effectiveness of the programs (Iannone & Giansanti, 2023; Jaliaawala & Khan, 2020). Theoretical frameworks for using an LLM to improve the challenges of autism have been proposed (Bertacchini et al., 2023; Choi et al., 2024), and some studies have shown promise using voice recognition to target social connectedness in autistic adults (Xygkou et al., 2024). To our knowledge, no AI program has been clinically tested on how an LLM can evaluate and provide feedback to autistic user’s verbal responses to empathetic prompts. Therefore, the purpose of this research was to develop an AI chatbot program, Noora, based on prior face-to-face work to improve verbal empathetic responses (Koegel et al., 2016, 2024) in autistic individuals requesting support in this area. These prior methodologies guided the development of Noora, a chat interface app that provides leading statements that solicit an empathetic response, asks participants to verbally respond, and then provides immediate feedback regarding the empathetic quality of their response. Unlike prior face-to-face work, Noora is a fully automated program using AI to grade participant responses and provide live feedback. Our overall goals were to assess how well Noora was able to automate accurate and appropriate feedback (while mitigating the hallucinations and safety concerns present with LLMs) and to assess whether empathy would improve following use of the program. Noora first provided a written leading statement that was read aloud. It then asked the participant to rate the statement’s sentiment (negative, neutral, or positive). Next, it prompted participants to type or speak an empathetic response to the statement, as shown in Fig. 1. The response was then graded, and feedback was provided.
Fig. 1
Correct and incorrect empathetic responses. Correct empathetic response (on top) shows Noora providing a leading statement, a participant correctly rating the sentiment, Noora’s feedback to the rating, a participant’s correct response, and Noora’s feedback to that response. Confetti animation is displayed on correct responses. On the right, Noora circles numbers 1–10 which appear as a green checkmark after a correct response and blue “×” after an incorrect response. When participants voluntarily went over 10 trials per day, more circled numbers were incrementally added to their screen to reflect the extra trials. Incorrect empathetic response (on the bottom) shows Noora providing a leading statement, a participant incorrectly rating the sentiment, Noora’s feedback to the rating, a participant’s incorrect response and Noora’s feedback to that incorrect response, including a verified sample answer
×
We hypothesized that the group of participants using Noora would demonstrate greater improvements in verbal empathetic responses than the wait list control group and would find the intervention acceptable. To measure the effectiveness and acceptability of the Noora program, this study: (1) Assessed whether autistic adolescents’ and adults’ verbal empathetic responses would improve in the AI program over time; (2) Assessed whether participants’ empathetic response training using Noora would generalize to social conversation with a human; (3) Assessed participants’ self-reported comfort levels in conversational settings prior to and following intervention; and (4) Evaluated the acceptability of the intervention from the perspective of the participants.
Method
Design
This study used a waitlist-controlled randomized clinical trial. The study was approved by the Stanford University Panel on Medical Human Subjects in Medical Research and was registered with Clinical Trials.gov #NCT05987774. The period of recruitment to the final follow-up session occurred between November 2023 to August 2024.
Participatory Research
Prior to the start of the clinical trial, informal feedback regarding Noora was obtained from 20 autistic adolescents and adults who were invited to trial the program. Based on their feedback, adjustments to the program were made. Example changes included refining how Noora produced feedback, adding confetti for correct answers, and ensuring Noora’s leading statements had a more defined sentiment for users to rate.
Matching
Participants were assigned to the immediate intervention or a waitlist control group and equally allocated to the interventions with pairwise matching using a computer-generated coin flip. Given that the participants were relatively homogeneous, only age was used for matching. Adolescents (11–17) were matched with adolescents, and adults (18–35) were matched with adults.
Sample Size
The sample size was pre-determined at 30 participants, with 15 randomly assigned to each group. Prior to randomization, 2 adolescents did not give assent, 5 individuals were not able to engage in conversation, and 6 individuals were excluded as they scored above 60% at pre-intervention, and thus it was determined that they did not need the intervention (see inclusion/exclusion criteria below). In addition, the first 10 participants that were randomly assigned to the experimental group were removed from the study, as they did not exclusively see and respond to leading statements that required empathy due to a technical malfunction in Noora that was remedied. Ten replacement participants were added and matched with the control group in chronological order as they completed their measures. The study was closed following the enrollment of 30 final participants (15 in each group). See Fig. 2 for study flow diagram.
Fig. 2
Study flow diagram
×
Participants
Autistic adolescents and adults were recruited via flyers through the Stanford Autism Center, the Stanford Neurodiversity Project, community bulletin boards, social media, and email lists. Individuals were not paid for their participation.
Inclusion Criteria
Inclusion criteria included: (a) Primary diagnosis of autism spectrum disorder (ASD); (b) age between 11 to 35; (c) ability to engage in social conversation using full sentences for a 20-min period; (d) verbal communication difficulties in the targeted area of responding with empathy, measured during the conversation sample (see below). Participants with a co-occurring diagnosis of ADHD, depression, social anxiety, or similar condition, were not excluded since these are very common in this population and are often related to, or a biproduct of, difficulties with social conversation and engagement (Adams et al., 2023).
Exclusion Criteria
Exclusion criteria included: (a) Non-documented or self-diagnosis of ASD (although Noora may be helpful for self-diagnosed individuals, our intent was to first assess whether it was helpful for individuals who were formally diagnosed with autism); (b) above 60% correct empathic responses during pre-intervention conversation probe; (c) no access to a computer for intervention sessions; (d) inability to carry on a conversation during the conversation probe (e.g., responds only in single words or does not respond to conversation, lack of understanding of questions or content during conversation probe); (e) non-English speaking; (f) serious medical or psychiatric issues that may interfere with conversation or the ability to complete the program; or (g) lack of interest in participating. To assess inclusion/exclusion criteria, prior to the collection of any pre-intervention measures, our staff had a single phone/Zoom contact with autistic participants wishing to enroll in the study. During this brief screening, research staff asked and answered questions to determine eligibility and engaged in everyday conversational questions such as “How was your day?” or “Do you have any plans for the weekend?” to assess the potential participants’ conversational abilities. A small number of individuals whose parents or care providers reported that they could not engage in conversation during the screening were not included in this study. Demographics and clinical characteristics for the participants who qualified for the study are presented in Table 1. The majority (87%) of participants were male, consistent with more males being diagnosed with ASD. For race and ethnicity, 53% of the participants identified as white, with the remainder reporting as Asian (23%), Mixed (10%), Hispanic (7%), Native American (3%), and Black (3%) (numbers rounded). Please refer to Table 1 for the full breakdown between groups.
Table 1
Demographics and co-occurring conditions
Experimental
Waitlist Control
Average
Range
Average
Range
Age
18.3
11–26
18.8
11–35
n
%
n
%
Sex assigned at birth
Female
2
13%
2
13%
Male
13
87%
13
87%
Race/ethnicity
White
9
60%
7
47%
Asian (inc. Filipino and South Asian)
3
20%
4
27%
Mixed
1
7%
2
13%
Hispanic
1
7%
1
7%
Black
0
0%
1
7%
Native American
1
7%
0
0%
Co-occurring conditions
ADHD
1
7%
3
20%
Anxiety (GAD/Social)
3
20%
1
7%
Depression
1
7%
0
0%
OCD
1
7%
1
7%
Poor muscle tone/lack of coordination
1
7%
0
0%
Executive function deficit
1
7%
0
0%
Dyspraxia
1
7%
0
0%
Seizure
1
7%
0
0%
Language delay
1
7%
0
0%
Catatonia
1
7%
0
0%
Insomnia
1
7%
0
0%
Mild ID
0
0%
1
7%
No co-occurring conditions
8
54%
10
67%
Measures
Participants’ Empathetic Response Rate Within the AI Program
Noora gave participants various leading statements where an empathetic response was appropriate and asked participants to reply. Noora then graded each participant response as correct (showing empathy) or incorrect (lacking empathy). To assess participant improvement within the Noora program, the percentage of correct responses during the initial 50 responses was compared with the percentage of correct responses during the last 50 responses.
Human Conversational Sample (Generalization)
A between groups statistical comparison of growth scores was applied to conversational probe scores using a Mann Whitney-U test. The conversational probes were collected pre- and post-intervention for each participant (a total of 60 probes). Specifically, a conversational partner who held a PhD or master’s degree in special education/BCBA, and had extensive experience working with autistic individuals, engaged in informal conversation with the adolescent or adult for approximately 20 min via videoconferencing. The conversational partner was unaware of which group the participant had been randomized for 28 of the 30 post-intervention conversational samples (additional reliability was collected for the 2 probes that the conversational partner was aware of the participants’ study condition). The conversational partner was instructed to provide opportunities to probe for empathetic responses while engaging in natural conversations. Example probes and correct and incorrect responses are shown in Table 2. A minimum of three examples of a negative situation (e.g., “I woke up with a really bad headache this morning”) and two examples of positive situations (e.g., “I had such a great weekend at the beach”) were provided. Conversational samples were recorded and the participants’ responses to each statement was scored as either a (+) indicating a correct empathetic response or a (−) indicating an incorrect empathetic response; the percentage of correct responses was calculated.
Table 2
Conversational probe example and correct and incorrect responses for responding empathetically. Participant must respond with a relevant verbal statement that shows understanding, and/or offers comfort/support
Examples
Leading Statement: “I had the worst day today”
Appropriate responses:
• “Oh no!”
• “Bummer”
• “That stinks”
• “I’m so sorry to hear that. What’s going on?”
Inappropriate responses:
• “I went to the library with my mom today”
• “Cool!”
• “Everyone has bad days. It’s no big deal”
• “Uh huh”
• No response
Leading Statement: “I’m so excited to see my sister this weekend”
Appropriate responses:
• “Fun!”
• “Nice! What are you going to do?”
• “That’s always exciting to see family”
• “Cool! Do you have any plans?”
Inappropriate responses:
• “I’m going to play video games this weekend”
• “I don’t get along with my sister”
• “Oh”
• No response
Empathetic probe statements during pre- and post-intervention conversational samples were similar, but none were the same.
Confidence Survey
A confidence survey using a 5-point Likert Scale, ranging from “very insecure” to “very confident,” was administered at pre- and post-intervention. This included various questions about how confident the individual reports feel while engaging in conversation.
Acceptability/Satisfaction Survey
A post-intervention survey using a 5-point Likert Scale was administered following intervention to participants in the experimental group. The survey asked whether the participants enjoyed the Noora program, found the program useful, would use the targeted areas, and their overall experience with the app.
Procedure
Pre-Intervention
Following the brief screening, participants completed consent/assent documents electronically through Adobe Sign. After consenting, in lieu of conducting autism diagnostic assessments, a confirmation of an autism diagnosis was uploaded to a secure password-protected database, Research Electronic Data Capture (REDCap). All participants were diagnosed with autism in childhood by a licensed professional. Next, the human conversational sample was collected. Qualifying participants (scoring 60% or below) filled out demographic information including age, age of diagnosis, co-occurring diagnosis, race and ethnicity, and gender (for the latter two categories there was an option of “prefer not to answer”). Following completion of the conversational sample and questionnaires, participants were randomly assigned to either the experimental or waitlist control group.
Random Assignment
Adolescent participants were paired with adolescents and adults were paired with adults then randomly assigned to either the experimental group or waitlist control group using a computerized coin flip. The average age of the participants in the experimental group was 18.3 (range 11–26) and the average age of the participants in the control group was 18.8 (range 11–35). Random assignment did not include gender, however 2 females (13%) participated in the experimental and 2 females (13%) participated in the waitlist control condition. Following randomization, intervention began.
Intervention
Participants assigned to the experimental group were asked to complete at least 10 trials per day using the Noora program, 5 days per week for an expected total of 200 trials during the study. For each trial, participants had the leading statement read aloud to them and were first asked to rate the statement’s sentiment by clicking on a button for “positive,” “neutral,” or “negative.” Noora asked participants to rate the sentiment before replying in an effort to guide their replies towards exhibiting empathy. Noora would then inform a participant whether they correctly or incorrectly recognized the sentiment. Next, Noora asked how participants would reply. After replying, Noora graded their responses and provided live feedback on successful qualities of their responses or on what needed improvements. For responses that lacked empathy, Noora provided a verified example of a good response. Note that all dialogue and feedback by Noora was read aloud. The full process is described in more detail below. Noora was available online and each participant selected a username and password which was entered in the system within one day. Noora was delivered through a webapp. Participants could access Noora using a computer, tablet, or phone, although many found the webapp easier to use on a computer or tablet. Once participants were granted access to Noora and began their first session, they were not contacted again until approximately 4 weeks had passed when they were contacted to schedule a post-assessment human conversation sample.
Designing for User Safety
There is potential concern that an AI conversation partner with this demographic may be used for social engagement or as a friend in a way they are not built to support. It has been suggested that autistic adults may be very trusting of chatbots thus placing them at risk for increased greater isolation (Xygkou et al., 2024). To mitigate this risk, Noora is intentionally not an open-domain conversational agent. By design, the user was not able to select their own conversations. Instead, all the leading statements were manually pre-selected as appropriate for learning empathy and non-toxic, as described further below. There is also a concern that LLMs can “hallucinate,” or give answers that are seemingly coherent syntactically but have incorrect or misleading content. To mitigate the potential for hallucination, Noora relied on pre-written, manually crafted leading statements and verified answers, as described further below. To further ensure user safety in how Noora delivered feedback, we used the Microsoft Azure OpenAI Services platform to process our API calls to GPT-4, which provided robust safety measures including a toxicity filter for all user replies. In initial pre-testing, and during a few trials of one adolescent participant that included swear words during the study, Noora always responded appropriately to toxic replies, indicating that they were not acceptable. In addition, the Microsoft Azure service is HIPPA-compliant, ensuring participant data is secure.
Manually Creating Leading Statements to Reduce Hallucination
To reduce the possible hallucination errors of the model coming up with potentially irrelevant leading statements, we manually wrote, and collaboratively used an LLM to help generate, statements before the start of the study. In total, around 330 statements were created. All statements, whether human-written or LLM-written, were verified by multiple members of the research team and edited accordingly to be scenarios where empathetic responses were needed. In addition, each statement was manually tagged as either “positive,” “neutral,” or “negative” (or a combination of two or more sentiments when necessary) so that participants could rate the intended sentiment before replying.
Real-Time Grading the Participants’ Responses
The grading of each participant response occurs in real-time, as there is no one single appropriate way to display empathy (Hill, 2009; Rogers, 1980); automatic live AI grading is critical to account for the breadth of correct participant responses. Noora’s feedback includes: (1) classifying participant responses as correct or incorrect on whether they showed an appropriate level of empathy; and (2) providing specific comments to explain why the participants’ answers were appropriate or inappropriate. Our approach is to leverage the in-context learning ability of LLMs. We prompt the LLM with instructions and with a few representative examples to learn how to provide feedback to the participant responses live. These examples themselves were created initially with the help of an LLM: We used the LLM to simulate users with different personalities and get a diverse variety of possible responses, which ranged in levels of empathy shown as well as in conversational quality. We then manually selected the more difficult cases and crafted the desired feedback we wanted the LLM to learn from in each scenario. The latter is necessary, because without example behavior, LLMs would fail to generate feedback that reflects realistic conversational behavior. The model we used in-context learning for was gpt-4–0613, made available through Microsoft Azure OpenAI Services.
Providing a Sample Verified Answer
In addition to live grading, Noora also generated live, immediate, and tailored feedback to explain to the participant why their response was successful or not. If a participant’s reply needed improvement, alongside this live feedback, Noora showed an example of a verified correct response. The verified responses were crafted by the research team manually, rather than generated live by the LLM to eliminate the possibility of inappropriate responses. This sample answer would be shown after the constructive feedback Noora had computed live (see Fig. 1).
Examples of Using Noora
When Noora provided a leading statement, such as the negative statement “My current job is too stressful,” if the participant responded with, for example, “I’m sorry to hear that, is there anyone you could talk to?,” Noora would record that answer as correct and provided feedback to the participant. For example, “That was a great response. It showed your concern and offered a helpful suggestion.” In contrast, if the participant responded with an answer that lacked empathy, such as, “Well mine isn’t,” Noora would record that answer as needing improvement (lacking empathy) and suggest a better response. For example, “Warm, but not quite there. Your answer seems like you may not care about me. A better response might be ‘I’m sorry to hear your current job is difficult. Are you looking for a new one?’” The alternative response was always the verified answer written by the research team to ensure it was correct. As a second example, Noora might provide a positive statement, “My parents are coming to visit this weekend, I’m very excited.” If the participant responded with an empathetic answer such as, “That’s great. What are you going to do?,” Noora would record that as correct and provide the participant with positive feedback. If, however, the participant responded with a non-empathetic response such as, “I hate weekends, I’m always so bored,” Noora recorded that as incorrect and provided feedback, such as, “Almost there. This response is too negative. You should show interest and engage in a conversation about my parents’ visit instead of shifting it to your personal feelings about weekends. A better reply might’ve been: ‘That’s great! Are you excited to spend time with them?’” Confetti on the screen followed each correct answer. Noora always started its feedback for incorrect responses with a motivating phrase like “almost there” or “close” to encourage participants.
Reliability of the Human Conversation Sample (Generalization)
Following completion of the study, one-third of the recordings (5 from pre-intervention from each group and 5 from post-intervention for each group) were scored for reliability. The 20 recordings were randomly selected via an online random number generator and were scored by an independent evaluator who held a PhD in special education and had more than 10 years of experience working in the area of autism. The reliability scorer was unaware of the condition to which each participant was assigned. Reliability, calculated by dividing the sum of agreements by the total number of agreements and disagreements then multiplied by 100, for the 20 recordings was 88% (range 83–100%). Reliability on the additional two probes that were collected by an individual who was aware of the participants’ experimental condition were 86% and 100%.
Reliability of AI Program Feedback
To assess the accuracy of the AI program’s feedback, 300 participant responses were randomly selected from the nearly 3988 total responses (7.5%), 20 from each participant; 10 responses that Noora evaluated as “correct” and 10 responses that Noora evaluated as “incorrect.” For the 3 participants (20%) that had less than 10 “incorrect” responses, additional “correct” responses were used to achieve a consistent 20 trials for each participant. Two MA level individuals, one held a degree in computer science and the other held a degree in special education, independently scored the program using the leading statements, participants responses, and Noora’s AI-generated feedback. The results of this analysis showed that the reliability observers agreed with Noora’s assessment of whether participants responded empathetically 85% of the time. Examples of disagreements included when Noora stated, “I’ve been feeling really stressed out lately” and the participant responded with, “Sorry to hear that. Me too.” While the human evaluators agreed that this was a correct participant response, Noora evaluated this as incorrect and offered the following suggestion: “Almost there! By saying ‘Me too’ you’re pivoting the focus onto yourself. It’s good to acknowledge my feeling first, before sharing yours, so it doesn’t seem like you’re ignoring mine.” Or, when Noora stated, “I need to pay my rent this week” and the participant responded with, “What are you renting? Is it hard to pay your rent sometimes? You must be very grown up; paying for yourself.” For this example, Noora suggested, “Close! You’ve asked too many questions at once, which can be overwhelming. A more simplistic approach could have been beneficial. Also, suggesting the person is ‘grown up’ for paying rent can come off as condescending, so it’s best to avoid such assumptions or judgmental language” whereas the reliability recorders felt this was an adequate participant response. In addition to scoring the accuracy of the program’s correctness of identifying adequate empathy in participant responses, the program was also evaluated on whether its feedback was satisfactory. For the 300 selected responses, the raters labeled a binary “yes” or “no” for whether the feedback was: (1) Relevant (was Noora’s feedback relevant to the participant’s response?); (2) Reasonable (was the feedback helpful for the participant?), and (3) Toxic (did Noora provide any toxic, inappropriate feedback to the participant?). Scorers had to find no issue with any of the three areas for the feedback to be considered adequate (“yes”); if the scorers found any issue, the feedback was labelled as a “no” for satisfactory. The reliability of Noora’s feedback being satisfactory was 95% for this combined measure. Importantly, no responses were found to be toxic. However, the issues that raters found were that occasionally Noora’s feedback was non-relevant and not reasonable. For instance, in response to the leading statement, “I’m in the mood for some relaxing music. Any suggestions?” the participant replied, “I love classical music! I’ve always been very passionate about it.” Noora responded, “Almost. While it’s great to share your passion for classical music, it doesn’t directly address my request for suggestions. Instead of telling me what you love, suggest some specific classical compositions or artists that I might enjoy based on my stated desire for relaxing music,” which the raters found to be unreasonable feedback. Another example included the initial statement, “I can’t wait to travel the world and see all the beautiful places.” In this case, the participant replied, “That was great I wish I could travel the world to the china,” and Noora responded, “Close! You show support and share in my excitement. However, be more clear in your statement. Are you wishing to join me or is this a separate wish of your own? Making sure your intentions are clear can improve the conversation.” The raters found this to be unreasonable feedback for an otherwise empathetic response.
Post Intervention
At the end of the approximate 4-week period, the conversation sample (generalization) was collected in exactly the same manner as pre-intervention, using similar but different probes for all participants, regardless of group. As well, the comfort survey was completed again by participants in both groups. Additionally, for participants in the experimental group the questionnaire included five additional questions on their satisfaction with Noora.
Results
The first question asked in this study was whether participants’ empathetic responses would numerically improve in the AI program itself after practice. For this measure, data were analyzed from each participants’ usage of the AI program, collected by Noora, on a daily basis, beginning on their start date until their post conversation sample, which was approximately 4 weeks of usage (a few participants used the program on weekends for extra practice or continued to use the program before they had availability for their conversation sample). The first 50 responses were compared with the last 50 responses. Twelve participants used the app for 19 to 38 days (with an average of 25.75 days, SD = 4.95). These participants completed between 259 and 389 total trials. One participant used the app for 13 days, completing between 153 trials, and one participant used the app for 12 days completing 146 trials. One participant used the app for only 6 days, completing 55 trials. Some of the participants completed slightly more or slightly less than 10 daily responses (range 9–15), so looking at the first and last 50 responses are approximate representatives of the first and last week of usage. The participant with only 55 trials was not included in this analysis. The results of this first analysis showed that 71% of the participants (10 of the 14) demonstrated an improving trend from their first trials of practice to their last trials of practice, with an average improvement of 13.2% (SD = 9.3%), which equates to around 6.6 more correct trials. One participant had the same number of correct responses during their first and last group responses. The remaining 3 participants had more incorrect responses in the last 50 responses, with two participants having 2% more incorrect responses, equating to 1 more incorrect response, and one participant having 12% more incorrect responses, equating to 6 more incorrect trials. Thus, almost all of the participants (79%) improved or maintained their accuracy in their responses from the approximate first week to the last week of intervention.
The second, and most important, question asked in this study was whether the empathetic responses targeted in Noora would generalize to social conversation. The statistical analysis for this measure was completed by an independent statistician who was provided with the pre- and post-intervention scores for both groups. The pre-intervention score for the experimental group was a mean of 16.67% correct with an SD of 19.15 and the pre-intervention score for the control group was a mean of 28.87% correct with an SD of 22.34. After approximately 4 weeks of practice with Noora, the post-score for the experimental group was a mean of 50.94% correct (SD: 36.25), and the post-score for the waitlist control group was a mean of 31.40% correct (SD: 27.05). The delta change score for the experimental group was a mean of 37.67% (SD: 28.34), whereas the improvement score for the waitlist control group was a mean of 2.53% (SD: 21.28). A Mann–Whitney U test was employed to compare the relative gains between the experimental group and waitlist control group on the conversation sample following the approximately 4-week intervention. The U statistic was used rather than an independent t-test because the scores were not normally distributed. There was a significantly greater gain by the experimental group compared to the waitlist control group (U = 35.0; p < 0.01; df = 28). A z transformation was used to calculate the effect size [Z/√(n1 + n2)] and showed a medium to large effect size (0.62).
The third question asked in this study was whether the participants’ self-reported comfort levels would improve following intervention. Table 3 shows the self-reports on the Confidence Questionnaire for both groups. The results were somewhat mixed on this questionnaire with some slight improvements in the experimental group. For example, following intervention, more individuals in the experimental group reported feeling “very confident” and fewer reported feeling “very insecure” or “insecure.” Regarding exiting a conversation (an area that was not targeted) both groups were similar, showing little change. More individuals in the experimental group felt “confident” or “very confident” asking questions to maintain a conversation and fewer in the experimental group felt insecure in this area. More participants in the experimental group reported feeling “very confident” in responding to others’ fortune or misfortune (targeted area) and fewer felt “insecure” compared to the control group. Confidence with overall social conversation skills was somewhat similar between the groups, although fewer participants in the experimental group reported feeling “insecure” following intervention. Finally, regarding anxiety during conversation, scores remained relatively stable, however more participants in the experimental group reported feeling “confident” and fewer reported feeling “insecure” in regard to anxiety following intervention.
Table 3
Pre- and post-intervention scores for each group on the confidence during conversation questionnaire
Experimental
Waitlist control
Pre (n/%)
Post (n/%)
Pre (n/%)
Post (n/%)
How confident do you feel starting a conversation?
Very insecure
1 (7%)
1 (7%)
0 (0%)
0 (0%)
Insecure
3 (20%)
4 (26.7%)
4 (26.7%)
5 (33.3%)
Somewhat confident
8 (53.3%)
4 (26.7%)
8 (53.3%)
4 (26.7%)
Confident
2 (13%)
3 (20%)
3 (20%)
6 (40%)
Very confident
1 (7%)
3 (20%)
0 (0%)
0 (0%)
How confident do you feel exiting a conversation?
Very insecure
0 (0%)
1 (7%)
0 (0%)
1 (7%)
Insecure
4 (26.7%)
1 (7%)
4 (26.7%)
5(33.3%)
Somewhat confident
4 (26.7%)
7 (46.7%)
10 (66.7%)
5(33.3%)
Confident
4 (26.7%)
4 (26.7%)
1 (7%)
3 (20%)
Very confident
3 (20%)
2 (13%)
0 (0%)
1 (7%)
How confident do you feel asking questions to maintain a conversation?
Very insecure
3 (20%)
0 (0%)
0 (0%)
1 (7%)
Insecure
4 (26.7%)
0 (0%)
4 (26.7%)
3 (20%)
Somewhat confident
5 (33.3%)
6 (40%)
8 (53.3%)
5 (33.3%)
Confident
2 (13%)
3 (20%)
3 (20%)
4 (26.7%)
Very confident
1 (7%)
6 (40%)
0 (0%)
2 (13%)
How confident do you feel responding to others' fortune or misfortune?
Very insecure
2 (13%)
1 (7%)
0 (0%)
1 (7%)
Insecure
3 (20%)
3 (20%)
3 (20%)
5(33.3%)
Somewhat confident
3 (20%)
2 (13%)
6 (40%)
5(33.3%)
Confident
5 (33.3%)
5 (33.3%)
5 (33.3%)
3 (20%)
Very confident
2 (13%)
4 (26.7%)
1 (7%)
1 (7%)
How confident do you feel overall with your social conversation skills?
Very insecure
2 (13%)
0 (0%)
0 (0%)
1 (7%)
Insecure
4 (26.7%)
2 (13%)
5 (33.3%)
2 (13%)
Somewhat confident
5 (33.3%)
6 (40%)
6 (40%)
7 (46.7%)
Confident
3 (20%)
5 (33.3%)
4 (26.7%)
3 (20%)
Very confident
1 (7%)
2 (13%)
0 (0%)
2 (13%)
How anxious are you when engaging in conversation with your peers?
Very insecure
1 (7%)
0 (0%)
0 (0%)
1 (7%)
Insecure
2 (13%)
2 (13%)
2 (13%)
2 (13%)
Somewhat confident
5 (33.3%)
3 (20%)
6 (40%)
6 (40%)
Confident
3 (20%)
8 (53.3%)
3 (20%)
3 (20%)
Very confident
4 (26.7%)
2 (13%)
4 (26.7%)
3 (20%)
The fourth aim of this study was to evaluate the acceptability of the intervention from the perspective of the participants at post-intervention. As can be seen in Table 4, 67% of the participants reported that they “enjoyed” or “completely enjoyed” the program, 33% were neutral, and none reported disliking the program. Regarding their improvement of social conversation skills, 67% reported being “satisfied” or “very satisfied,” 33% reported feeling neutral, and none reported dissatisfaction. Nearly all participants (94%) reported that they would recommend Noora to others, with one participant disagreeing. A large majority (74%) of the participants reported that they were able to use the social skills regularly in social conversation, two (13%) were neutral, and two participants (13%) disagreed. Finally, 86% of participants rated their overall experience with the software as either “very positive” or “positive” and two participants (13%) rated it as neutral.
Table 4
Noora participant acceptability questionnaire
Question
n
%
How much did you enjoy participating in this intervention?
Completely disliked
0
0
Disliked
0
0
Neutral
5
33
Enjoyed
3
20
Completely enjoyed
7
47
How satisfied are you with your improvement in social conversation skills?
Very dissatisfied
0
0
Dissatisfied
0
0
Neutral
5
33
Satisfied
6
40
Very satisfied
4
27
I would recommend this intervention to others
Strongly disagree
0
0
Disagree
1
7
Neutral
0
0
Agree
7
47
Strongly agree
7
47
I was able to use the social skills regularly and they fit with my social conversation
Strongly disagree
0
0
Disagree
2
13
Neutral
2
13
Agree
4
27
Strongly agree
7
47
Please rate your overall experience using the software
Very negative
0
0
Negative
0
0
Neutral
2
13
Positive
5
33
Very positive
8
53
Listwise Deletion There was no attrition in the study. All participants in the experimental group completed the intervention and all participants in both groups completed all the measures.
Missing Data There was no attrition in either group.
Unintended Effects There were no reported unintended effects in this trial.
Discussion
The findings of this study demonstrated that AI, in particular Large Language Models (LLMs), could be used to provide live, actionable improvements in the empathetic responding in verbal autistic adolescents and adults. The current study showed that a short-term, AI-assisted intervention resulted in significant improvements in verbal empathy compared to a waitlist control group, during usage and a subsequent generalization conversation sample, with high participant acceptability. The results can be discussed regarding several issues. First, these types of computerized conversational agents may provide a more controlled and less socially demanding forum, which are often preferable for autistic individuals, than face-to-face interventions (Dubois-Sage et al., 2024). Our AI program, Noora, allowed for practice in a self-paced context, which is not always possible with face-to-face interventions, and thus can be less stressful for autistic users. In addition, there are very few programs available to autistic adolescents and adults and parents find programs difficult to obtain, inconsistent, inadequate, and expensive (Marsack-Topolewski & Weisz, 2020). Many autistic individuals report loneliness and a desire for social interaction (Bennett et al., 2018; Mendelson et al., 2016) and there are low levels of employment (often for social reasons) among this group (Wehman et al., 2014). While face-to-face interventions can be helpful in improving social interaction in adolescents and adults (Koegel et al., 2016), they can be costly and there may be limited opportunities, depending on access to services. AI programs can be accessible to the general public and provide a low-cost educational option (Alam, 2021) for individuals who lack resources or access to qualified providers. As well, AI programs can be used in the comfort of the user’s home or other desired location at the user’s convenience and have the potential to provide increased practice in important areas, thereby improving clinical outcomes (Ghafghazi et al., 2021).
Generalization is another issue that has been raised in the literature. Many in-person interventions do not provide sufficient practice for generalization of goals, particularly given limited clinician availability and the lack of access to services. According to a systemic review (Kewalramani et al., 2023), most published studies have shown progress in a clinical setting only. Thus, there is a need for assessment of generalization to natural contexts. Some computer-assisted programs for understanding emotions have resulted in generalization of gains to other settings and communicative partners (Whalen et al., 2006), suggesting optimism for these formats. However, additional research with generalization probes is needed. Another issue relates to dosage. Results in our study, and others, suggest that participants required a low number of trials for a short period of time. Given that usage of other tools has been associated with greater gains (Silver & Oakes, 2001), assessing correlations between outcome gains and usage will be important. As well, mediators, moderators of successful outcomes are all areas that warrant further research.
The results of this study necessitate a discussion of the neurodiversity movement. Many autistic individuals feel perfectly comfortable with their communication and assert that a double empathy problem exists (Rizvi et al., 2024). We fully support the notion that individuals with any condition should be included and accepted, and that many autistic individuals prefer accommodation rather than intervention (Whelpley et al., 2023), however many autistic individuals seek to improve their social communication. To gain an understanding of the acceptability of our AI program, we recruited autistic individuals that provided feedback prior to the start of our randomized clinical trial. We assented and consented participants, and 51 of 53 verbalized their desire to improve their socialization. Further, we only included individuals who demonstrated substantial challenges with responding empathetically during social conversation. Finally, our AI program accepted a variety of different responses as correct, validating that there are no single normative responses that are considered correct. For the most part, when a participant responded in a relevant and sympathetic or supportive manner, they were told what was good about their response and were not provided with unnecessary feedback. Given the improvement of participants’ empathetic responses in program and in a generalization conversation sample, along with positive ratings of the program suggest that the AI program used in this study was helpful. As well, many of the participants asked to continue using the program following completion of the study, suggesting they enjoyed and found the program useful. As well, AI programs have been reported as enjoyable, or more enjoyable, than in-person programs, with low levels of unintended escape and avoidance behaviors, further suggesting the potential benefits of AI (Jaliaawala & Khan, 2020).
To provide some context for the results, we highlight some participant responses. One most improved participant in the experimental group improved from a pre-score of 20% to a post-score of 100% appropriate empathetic responses. In their 50 first trials with Noora, they achieved 88% correct in responding with empathy and achieved 100% correctness in their last 50 trials. In the first 50 trials, some of their responses were minimal: Responding with “I hear ya” to the leading statement “My aunt wants to visit but I’m nervous because my house is too small to host her for a week” or with “oh that’s too bad” to the leading statement “I’m struggling with self-doubt and lack of confidence.” In the last 50 trials, these responses improved dramatically: When Noora started with the leading statement “My cat knocked over my favorite vase and completely broke it,” they responded with, “That is too bad. Have you been looking for a replacement? You may like it even more,” and when Noora provided a leading statement, “I accidentally stepped on my dog’s tail and he’s in pain,” they responded with, “Sorry to hear that. Is he feeling any better now?” Similarly, another participant who improved from a pre-score of 20% to a post-score of 100% responded with “Um, um, um, I don’t know what to say. Is that bad?” to the leading statement “I twisted my ankle running and it really hurts” during the pre-intervention conversation sample. Following the use of Noora for approximately 4 weeks, they responded with, “Oh no. I’m so sorry to hear that, does she need to have surgery?” to the leading statement “My best friend fell off her horse and broke her shoulder last week” in the post-intervention conversational sample. These types of qualitative improvements in empathetic responses were frequent in the experimental group.
It was also interesting that in identifying the leading statement sentiment as “positive,” “neutral,” or “negative,” the participants were overwhelming successful across their trials. Across all participant trials, the sentiment was correctly identified an average of 92.73% of the time (median: 93.71%, SD: 6.18%). This suggests that they were able to accurately identify the sentiment of a statement but had difficulty verbalizing an empathetic statement in response to the statement. For many, verbal empathetic may be more difficult than the cognitive or affective component. Alternatively, emotional recognition with a combination of stimuli included, such as voice differences and facial expressions, may be more difficult than a program that contained fewer external stimuli (Golan & Baron-Cohen, 2006).
Another strength of Noora is that we used an LLM in a targeted manner to generate leading statements. By manually verifying and editing the generative suggestions before the study began, we were able to reach 330 prompts that had wide variety in topic and structure, without compromising on their correctness and appropriateness, ensuring a reliable system with carefully edited prompts refined through trials before the study began. To ensure the accuracy of Noora’s feedback, we recorded and hand-sampled responses during the trial phase, using the feedback to improve the in-context learning of the GPT-4 model before final deployment. This refinement led to feedback that was more personalized and accurate for autistic participants, guiding responses to better align with our expectations. Importantly, Noora was not designed as a general-purpose chatbot but as a task-specific tool for grading user responses. By focusing on this narrow, problem-specific domain, we could ensure a more controlled and reliable interaction, avoiding the complexities and risks associated with open-domain chatbots. This approach made it a more effective use case for AI-driven feedback. Further, Noora allowed participant preference with speech-to-text or textbox input. Nine of our experimental participants exclusively or almost exclusively used the textbox, five almost exclusively used the voice input, and one used a combination of both. We did not find any significant correlations between the response modality used, suggesting that either method can be helpful.
There are several limitations to the current study. First, a waitlist group served as our controls. We did not compare our AI program directly with a face-to-face intervention. However, effective face-to-face interventions served the basis of the targeted area in this study (Koegel et al., 2016), thus suggesting that AI may be an effective alternative. Second, although there are relatively few studies with adolescents and adults using AI, some studies with autistic children have provided voice feedback for correct responses to the emotional expressions of a robot and some studies have used computerized programs with text-to-speech functions and output generation for detection of accurate emotional responses (Abu-Amara et al., 2021) and have noted that these programs can be challenging to use and may not accurately understand the user (Catania & Garzotto, 2023). The AI program in the current study was largely accurate in terms of understanding and providing relevant feedback, as reported in our reliability analysis. Our participants were conversational at baseline, and did not have intelligibility/articulation issues, thus for the most part there were no problems with Noora understanding the participants’ responses. However, a few participants used incorrect grammar at times, which Noora had difficulties with: When Noora had a slight issue with a participant response, it nearly always provided helpful feedback about a spelling or grammar (95% feedback adequacy) but would sometimes label the entire response as incorrect even when human raters recorded a correct overall attempt at empathy (85% grading accuracy). We consider Noora providing helpful feedback on grammar and spelling as positive and often found such suggestions reasonable and relevant, however future improvements are needed to ensure small errors do not overly-affect the overall grading. In face-to-face interventions, clinicians often reward attempts rather than a strict shaping paradigm, while Noora tended to be stricter with feedback (Koegel et al., 2016), however, while fidelity of implementation is now common in treatment articles, few provide the level of detail that we were able to assess with Noora. Future upgrades to Noora’s in-context learning should reduce this issue. Next, there have been concerns that LLMs may produce hallucinations that autistic users believe. While we believe we were successful in mitigating this issue by designing a non-open domain chatbot, pre-writing the leading statements and verified answers, and using Microsoft Azure OpenAI Services to achieve our 100% non-toxic feedback reliability rating, further work can explore whether the small number of errors in grading accuracy and feedback may negatively affect this demographic. Also, dosage-dependent effect sizes are always a consideration in treatment research (Minjarez et al., 2024; Virués-Ortega, 2010). We arbitrarily chose a low number of daily trials in this study. While our results were significant, and most participants made large gains, about a third of the participants made small improvements. We pilot tested a few of the participants who made no or small gains by offering them an additional 4 weeks of usage, and all continued to make additional gains. Further research to determine whether further practice or an increased number of daily trials would be beneficial, particularly for the low responders, may be fruitful. Finally, we did not measure distal outcomes but focused on acquisition and generalization. The ultimate impact of the intervention in regard to employment and relationships is an important area for future research.
In summary, AI has the potential to improve social-emotional learning, however very few programs have been empirically validated using a comparison design or controlled experimental design. It is of concern that there is a growing gap between the computer science community and psychology to develop effective AI-assisted interventions (Jaliaawala & Khan, 2020; Wright, 2024) when these types of programs may greatly benefit users. The capability of AI to leverage user data and provide consistent feedback suggests a promising avenue for precision treatment and is particularly relevant given that many autistic adults report that AI, and in particular LLMs, are helpful for social situations (Choi et al., 2024). Thus, the potential benefits of AI for underserved populations, including autistic adolescents and adults, are substantial considering the long waitlists, overburdened clinicians, and a paucity of qualified providers.
Acknowledgments
The authors express gratefulness for the funding from the Kind World Foundation, the Verdant Foundation, Stanford’s Institute of Human-Centered Artificial Intelligence (HAI), the Lucille Packard Foundation Auxiliary, anonymous match donors, Microsoft Azure, and Vercel for making this project possible. We appreciate the assistance of Lily Wallace, Emma Sturm, Eva Harte, Jackie Palovicino, Michael Aschkenasy and Jimmy Kerr for their testing of the module, and to Shriya Dwivedi for assistance with the literature review. A special thank you to Robert Koegel who provided feedback throughout the study. Finally, we greatly appreciate the autistic individuals and their families for providing us with initial feedback before the start of the study and the adolescents and adults that participated in our research.
Declarations
Conflict of interest
Lynn K. Koegel is the editor of JADD. She was not involved in the editorial process. Lynn Koegel is a partner in the private company Koegel Autism Consultants, LLC. The other authors declare no conflicts of interest.
Ethical Approval
This study was approved by the medical research ethics committee of Stanford School of Medicine and carried out according to the latest version of the Helsinki Declaration of 1975. All participants signed informed consent, parents of minor participants signed informed consents, and all minor participants signed informed assent.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
share
DELEN
Deel dit onderdeel of sectie (kopieer de link)
Optie A:
Klik op de rechtermuisknop op de link en selecteer de optie “linkadres kopiëren”
Met BSL Psychologie Totaal blijf je als professional steeds op de hoogte van de nieuwste ontwikkelingen binnen jouw vak. Met het online abonnement heb je toegang tot een groot aantal boeken, protocollen, vaktijdschriften en e-learnings op het gebied van psychologie en psychiatrie. Zo kun je op je gemak en wanneer het jou het beste uitkomt verdiepen in jouw vakgebied.