IntelliExplain: Enhancing Interactive Code Generation through Natural Language Explanations for Non-Professional Programmers

{hyan5, tlatoza, ziyuyao}@gmu.edu
Department of Computer Science, George Mason University

IntelliExplain is an LLM-powered system aiding non-professional programmers for interactive code generation. IntelliExplain enables non-professional programmers to write code in natural language without requiring direct interaction with source code. The user starts with a question in natural language (NL), accompanied by relevant context (top). IntelliExplain then generates source code and confirms its understanding of the question by presenting an NL explanation to the user. When this understanding is incorrect, the user can provide corrective feedback in NL and instruct the system for error correction.

Text-to-SQL Demo

Python Code Generation Demo

Updates

  • 2024-05: The code is coming soon!
  • 2024-05: ArXiv version is available. Please check out our preprint.

Motivation

Non-professional programmers are individuals who have basic knowledge of computation (e.g. mathematical operation, linear algebra, etc.) but much less than a computer science major or professional engineer. Their limited introductory programming experience making them hard to write and debug source code themselves.


LLMs have demonstrated potential in translating natural language instructions into code. Users can interact with LLMs (e.g., ChatGPT) by posing programming questions and optionally providing input-output samples to specify requirements. When errors are identified or the generated code fails to meet the specified criteria, users often follow up with feedback prompting LLMs to refine the code solution. Despite its simplicity, the difficulty in accurately pinpointing and articulating errors in the generated code makes it challenging for non-professional users to provide meaningful corrective feedback (Figure 1 Right).


Figure 1: Comparsion between using IntelliExplain and vanilla GPT-3.5 for interactive code generation.

In this work, we present IntelliExplain, which offers a novel human-LLM interaction paradigm to enhance non-professional programmers' experience by enabling them to interact with source code via natural language explanations (Figure 1 Left). Users interact with IntelliExplain by providing natural language corrective feedback on errors they identify from the explanations. Feedback is used by the system to revise the code, until the user is satisfied with explanations by the system of the code.


Natural Language Explanation

The most straightforward way is to ask the LLM to explain their predicted code. However, this vanilla approach often results in explanations that are lengthy and too technical to be read by non-professional programmers. To address this limitation, we propose two distinct styles for program explanations: Question Restatement for text-to-SQL and Concise Description for Python Code Generation.


Question Restatement (for Text-to-SQL)

In our preliminary experiments, we observed that a significant portion of LLM errors in text-to-SQL stemmed from a misunderstanding of concepts within the original question. However, such mistakes can hardly be captured from a vanilla explanation of the SQL code, which is often full of technical jargon distracting users from identifying concepts involved in the code. Observing this challenge, we instead propose to use "restated question" from the source code as an explanation for text-to-SQL programming. A restated question is an NL question generated by the LLM to describe the intent of a model-generated code.


Table 1: Question Restatement for text-to-SQL

Concise Description (for Python Code Generation)

In our exploration, we observed that question restatement proves to be more effective for short code snippets, like SQL queries, and can effectively address conceptual misunderstanding errors. However, it cannot capture the inner logical errors in scenarios involving lengthy generated code and intricate coding tasks, especially when the LLM makes an inaccurate generation despite correctly understanding the input question. This raises a need for a more fine-grained exploration of the inner logic. We achieve this by proposing a concise explanation of the source code, striking a balance between the succinct question restatement and the technical and lengthy line-by-line explanation.


Table 2: Consice Description for Python Code Generation

Interaction Paradigm

To utilize our designed explanation for assisting non-expert programmers in coding tasks, we introduce an interaction paradigm. Our designed interaction paradigm consists of (1) user asking a coding question and providing the necessary context that are necessary for answering the question; (2) LLM predicting an initial code answer; (3) LLM generating an NL explanation for the initial code; (4) user judging the explanation and determining whether the code is correct; if any error is found in the explanation, user providing NL feedback for error correction; and (5) the LLM refining its answer based on user feedback. Steps 3-5 repeat until users cannot find more errors in the explanation.


Figure 2: Interaction paradigm of IntelliExplain

User Study and Results

We evaluate IntelliExplain by conducting a user study involving 20 participants (IRB approved). We use GPT-3.5 (turbo-0613) as backend LLM. Our user study demonstrates that users with IntelliExplain achieve a significantly higher success rate 11.6% and 25.3% better than with vanilla GPT-3.5, while also requiring 39.0% and 15.6% less time in Text-to-SQL and Python code generation tasks, respectively.


Overall Performance

IntelliExplain enables users to achieve success rates 11.6% and 25.3% higher than the vanilla GPT-3.5 group in text-to-SQL and Python code generation, respectively. The t-Test showed that the means between the two groups are statistically significant in success rate (SQL: t=1.935, p=0.043; Python: t=2.361, p=0.021) and time spent on each question for text-to-SQL (SQL: t=-2.611, p=0.014), but no difference in time spent per question for Python code generation (Python: t=-1.374, p=0.101).


Table 3: Overall performance using IntelliExplain compared to vanilla GPT-3.5

Designed NL Explanation Can Accurately Describe the Source Code

Our explanations generally align precisely with the generated code. In addition, over 50% of text-to-SQL and 70% of Python code generation erros in the soucre code can be found via our designed explaination as shown in Table 4. We delve into cases that errors cannot be easily identified in the explaination and find that this limitation arises from the challenge of encapsulating intricate inner logic into concise explanations. It highlights the challenge of striking a balance between presenting concise and easy-to-understand NL explanations and presenting more fine-grained inner logic of the code.

(a) Text-to-SQL

(b) Python Code Generation

Table 4: Our designed explaintion can caputure errors in the source code.


Users Can Provide Effective Feedback based on the NL Explanation

Throughout the user study, we observed a variety of feedback from participants using IntelliExplain which can be broadly classified into three categories:

  • Instructions for Error Correction: Users can spot errors in the explanation and suggest how to fix them. This implies the effectiveness of our explanations in aiding non-professional programmers in code understanding and debugging.
  • Question Rephrasing: This type of feedback suggests that users perceive errors in the explanation, attributing them to the underspecified intent of the original question.
  • Step-by-Step Instruction: Users offer detailed step-by-step instructions to guide the model in solving the problem based on their understanding.
Table 5: Types and frequencies (for SQL/Python programming) of feedback provided by users. Errors mentioned in the explanation are marked in red. Diverse types of feedback received from user study demonstrated the effectiveness of our explanation in aiding non-professional programmers in both code comprehension and debugging.

IntelliExplain Has Ability to Make Corrections Based on User Feedback

In Table 6, the success rates of different feedback types illustrate IntelliExplain's ability in integrating human feedback for error correction.


Table 6: Success rate of IntelliExplain for each feedback type. The percentages are calculated by dividing the number of successful error corrections for each feedback type by the number of total feedback of the same type per conversation.

Despite the achievement in incorporating human feedback in IntelliExplain, we observed a notable gap in the success rate from the user study especially in text-to-SQL. A closer examination of failed cases revealed various contributing factors:

  • User feedback was sometimes too vague or abstract and lacked the specificity needed for precise corrections.
  • Misaligned reasoning between participants’ mental models and the LLM's thought led to human feedback based on flawed assumptions.
  • Although explanations can capture small changes in the source code after previous rounds of error correction, users sometimes fail to recognize these changes, resulting in mistakes when providing feedback.

Performance of IntelliExplain with GPT-4 as Backbone LLM

We conducted a pilot study to determine if our designed explanation and interaction paradigm remain effective with a more powerful LLM model. Our findings indicate that all proposed mechanisms remain operational, demonstrating comparable success rates. Further analysis into the quality of explanations generated by different LLMs revealed that GPT-4 generates an explanatory sentence on how to solve the problem in general (in green). This information makes the explanation generated by GPT-4 more comprehensible than GPT-3.5.


Table 7: Explanations provided by GPT-3.5 and GPT-4 on the same question for the generated code.

Case Study

We showed one real conversation using IntelliExplain in Figure 3. With IntelliExplain, users can comprehend the source code via NL explanation to more easily identify potential errors. IntelliExplain makes corrections based on user feedback. In contrast, when interacting directly with code in vanilla GPT-3.5, non-professional programmers may struggle to understand source code and fail to identify errors. Vanilla GPT-3.5 may also sometimes generates responses that are irrelevant to the user question (Figure 3 Right in red background color).


Figure 3: Case study from Python Code Generation using IntelliExplain and vanilla GPT-3.5.

Acknowledgments

This project was sponsored by NSF SHF 2311468, GMU College of Computing and Engineering, and GMU Department of Computer Science. We appreciate the Office of Research Integrity and Assurance at GMU for their work in reviewing and approving our Institutional Review Board (IRB) application. We also appreciate comments from students in GMU NLP and SE labs.

BibTeX

@misc{yan2024intelliexplain,
      title={IntelliExplain: Enhancing Interactive Code Generation through Natural Language Explanations for Non-Professional Programmers}, 
      author={Hao Yan and Thomas D. Latoza and Ziyu Yao},
      year={2024},
      eprint={2405.10250},
      archivePrefix={arXiv},
      primaryClass={cs.HC}
}