While reports of ChatGPT successfully solving live coding interview problems (and supposedly passing a Google interview for L3 grade) have been quite enthusiastic, it's important to note a few characteristics of the model:
The output has a random element. The same input can yield both correct and incorrect results.
ChatGPT is trained on data published online before September 2021. The task likely becomes easier if the problem description and its solution were available before this date.
I was intrigued to find how much ChatGPT's performance was genuinely "problem-solving ability" versus sheer randomness or “memorization of the correct solution." You see the short answer in the title, but let's proceed with the details.
I randomly picked several problems from Leetcode based on two independent parameters: difficulty level (easy/hard) and publication date (before/after September 2021). I selected 13 problems from each of the 4 categories, totaling 52 problems. Each problem description was inputted into ChatGPT-4 with a prompt describing a "live coding interview" setting. I expected the output to be Python code to pass the given problem's tests. If the code didn't pass, I provided only one opportunity to fix it.
Why Leetcode? Simply put, Leetcode's popularity has soared to the point where it's virtually a household name. Thousands of interview candidates use it for preparation and seem successful. Leetcode provides problem descriptions, solutions examples, and automatic tests for submitted code — just what we need.
I decided to exclude problems that incorporated images in their descriptions (as I was uncertain how to input an image into the chat), those with more downvotes than upvotes (a sign of a lame problem), and all problems classified as medium difficulty.
Python was chosen as it's not only widely popular in production applications but also sufficiently succinct to solve algorithmic problems.
Also initially, I chose a smaller set of problems. Eventually, I added a few more to each category to enhance the statistical significance of the results.
In my experiment, I got the following Acceptance rate (essentially the percentage of correct solutions), taking into account one additional attempt:
|
Easy |
Hard |
---|---|---|
Before September 2021 |
100% |
46% |
After September 2021 |
69% |
0% |
All Time |
85% |
23% |
(A link to the complete list of problems, solutions, and all dialogues can be found at the end of the post)
ChatGPT-4 solved all easy problems and nearly half of the hard problems it likely had access to, given they were published before its cut-off date. However, the model only solved 69% of the easy problems it hadn't seen before, as those were published after the cut-off date. It was unable to tackle any new hard problems. From this, we can deduce that while some memorization does occur, the model doesn't precisely recall all problems and their respective solutions.
These results might be considered expected if we were discussing a human. Indeed, easy problems are easier to solve than hard ones (duh); and already-seen problems are easier than unfamiliar ones. Intuitively, this reasoning should also hold true for the results generated by the language model.
Out of 25 incorrect submissions, ChatGPT successfully fixed its code only once. Interestingly, it's unclear why the correct answer wasn't produced in the first place. My request to improve the solution included no additional details except pointing out that the previously proposed code was broken.
How good is ChatGPT-4 in an interview context?
Let's envision a scenario where an interview at a hypothetical company involves solving two easy Leetcode problems, typical for junior or trainee positions. Here, ChatGPT-4's odds of acing the interview hover around 72%. However, when faced with a mix of one easy and one hard problem, these chances plummet to a mere 20%. And if presented with two hard problems — an approximation of senior-level interviews in big tech — the success rate falls even further to approximately 5%. Claims that ChatGPT "outperforms 90% of programmers" could be considered exaggerated, though the validity depends significantly on the programmer sample used.
With non-zero odds of passing an interview, it's plausible to speculate that big tech companies are already testing ChatGPT, assessing their problem databases for "cracking" by this AI model and drawing corresponding conclusions. It's time to reflect again on what traits we truly value in a candidate during live coding. After all, the sole ability to output code only covers some of the essential skills, and it's often not the most critical.
ChatGPT does have the ability to solve algorithmic problems, which is not simply a result of luck or "remembering" solutions found online. Its proficiency in live coding interviews is currently likely limited to entry-level grades. However, it has non-zero chances of success even in interviews for senior positions. It's an impressive feat for a general-purpose language model that's not specifically trained to code.
It's reasonable to anticipate that future models will significantly outperform ChatGPT-4 in solving algorithmic problems. This will undoubtedly reshape how companies conduct live coding interviews overall.
Summary table with the results of my experiment and all chat transcripts:
https://docs.google.com/spreadsheets/d/1C2b9ai-DBD4AwgRNEJuiLaGPUvwNie_1e-0vb6N5ohc/edit?usp=sharing
Cover image generated by DALL-E 2.