Updates
Non-professional programmers are individuals who have basic knowledge of computation (e.g. mathematical operation, linear algebra, etc.) but much less than a computer science major or professional engineer. Their limited introductory programming experience making them hard to write and debug source code themselves.
LLMs have demonstrated potential in translating natural language instructions into code. Users can interact with LLMs (e.g., ChatGPT) by posing programming questions and optionally providing input-output samples to specify requirements. When errors are identified or the generated code fails to meet the specified criteria, users often follow up with feedback prompting LLMs to refine the code solution. Despite its simplicity, the difficulty in accurately pinpointing and articulating errors in the generated code makes it challenging for non-professional users to provide meaningful corrective feedback (Figure 1 Right).
In this work, we present IntelliExplain, which offers a novel human-LLM interaction paradigm to enhance non-professional programmers' experience by enabling them to interact with source code via natural language explanations (Figure 1 Left). Users interact with IntelliExplain by providing natural language corrective feedback on errors they identify from the explanations. Feedback is used by the system to revise the code, until the user is satisfied with explanations by the system of the code.
The most straightforward way is to ask the LLM to explain their predicted code. However, this vanilla approach often results in explanations that are lengthy and too technical to be read by non-professional programmers. To address this limitation, we propose two distinct styles for program explanations: Question Restatement for text-to-SQL and Concise Description for Python Code Generation.
In our preliminary experiments, we observed that a significant portion of LLM errors in text-to-SQL stemmed from a misunderstanding of concepts within the original question. However, such mistakes can hardly be captured from a vanilla explanation of the SQL code, which is often full of technical jargon distracting users from identifying concepts involved in the code. Observing this challenge, we instead propose to use "restated question" from the source code as an explanation for text-to-SQL programming. A restated question is an NL question generated by the LLM to describe the intent of a model-generated code.
In our exploration, we observed that question restatement proves to be more effective for short code snippets, like SQL queries, and can effectively address conceptual misunderstanding errors. However, it cannot capture the inner logical errors in scenarios involving lengthy generated code and intricate coding tasks, especially when the LLM makes an inaccurate generation despite correctly understanding the input question. This raises a need for a more fine-grained exploration of the inner logic. We achieve this by proposing a concise explanation of the source code, striking a balance between the succinct question restatement and the technical and lengthy line-by-line explanation.
To utilize our designed explanation for assisting non-expert programmers in coding tasks, we introduce an interaction paradigm. Our designed interaction paradigm consists of (1) user asking a coding question and providing the necessary context that are necessary for answering the question; (2) LLM predicting an initial code answer; (3) LLM generating an NL explanation for the initial code; (4) user judging the explanation and determining whether the code is correct; if any error is found in the explanation, user providing NL feedback for error correction; and (5) the LLM refining its answer based on user feedback. Steps 3-5 repeat until users cannot find more errors in the explanation.
We evaluate IntelliExplain by conducting a user study involving 20 participants (IRB approved). We use GPT-3.5 (turbo-0613) as backend LLM. Our user study demonstrates that users with IntelliExplain achieve a significantly higher success rate 11.6% and 25.3% better than with vanilla GPT-3.5, while also requiring 39.0% and 15.6% less time in Text-to-SQL and Python code generation tasks, respectively.
IntelliExplain enables users to achieve success rates 11.6% and 25.3% higher than the vanilla GPT-3.5 group in text-to-SQL and Python code generation, respectively. The t-Test showed that the means between the two groups are statistically significant in success rate (SQL: t=1.935, p=0.043; Python: t=2.361, p=0.021) and time spent on each question for text-to-SQL (SQL: t=-2.611, p=0.014), but no difference in time spent per question for Python code generation (Python: t=-1.374, p=0.101).
Our explanations generally align precisely with the generated code. In addition, over 50% of text-to-SQL and 70% of Python code generation erros in the soucre code can be found via our designed explaination as shown in Table 4. We delve into cases that errors cannot be easily identified in the explaination and find that this limitation arises from the challenge of encapsulating intricate inner logic into concise explanations. It highlights the challenge of striking a balance between presenting concise and easy-to-understand NL explanations and presenting more fine-grained inner logic of the code.
(a) Text-to-SQL
(b) Python Code Generation
Table 4: Our designed explaintion can caputure errors in the source code.
Throughout the user study, we observed a variety of feedback from participants using IntelliExplain which can be broadly classified into three categories:
In Table 6, the success rates of different feedback types illustrate IntelliExplain's ability in integrating human feedback for error correction.
Despite the achievement in incorporating human feedback in IntelliExplain, we observed a notable gap in the success rate from the user study especially in text-to-SQL. A closer examination of failed cases revealed various contributing factors:
We conducted a pilot study to determine if our designed explanation and interaction paradigm remain effective with a more powerful LLM model. Our findings indicate that all proposed mechanisms remain operational, demonstrating comparable success rates. Further analysis into the quality of explanations generated by different LLMs revealed that GPT-4 generates an explanatory sentence on how to solve the problem in general (in green). This information makes the explanation generated by GPT-4 more comprehensible than GPT-3.5.
We showed one real conversation using IntelliExplain in Figure 3. With IntelliExplain, users can comprehend the source code via NL explanation to more easily identify potential errors. IntelliExplain makes corrections based on user feedback. In contrast, when interacting directly with code in vanilla GPT-3.5, non-professional programmers may struggle to understand source code and fail to identify errors. Vanilla GPT-3.5 may also sometimes generates responses that are irrelevant to the user question (Figure 3 Right in red background color).
This project was sponsored by NSF SHF 2311468, GMU College of Computing and Engineering, and GMU Department of Computer Science. We appreciate the Office of Research Integrity and Assurance at GMU for their work in reviewing and approving our Institutional Review Board (IRB) application. We also appreciate comments from students in GMU NLP and SE labs.
@misc{yan2024intelliexplain,
title={IntelliExplain: Enhancing Interactive Code Generation through Natural Language Explanations for Non-Professional Programmers},
author={Hao Yan and Thomas D. Latoza and Ziyu Yao},
year={2024},
eprint={2405.10250},
archivePrefix={arXiv},
primaryClass={cs.HC}
}