
AI is changing how professionals work—but is it ready for Indian tax law? To find out, Taxmann.AI teamed up with IIT Kharagpur to conduct the LE-BTL Benchmark Study, testing 12 leading LLMs on Income Tax, GST, FEMA, and more. Here's what they found.
Table of Contents
- Introduction
- Key Findings
- The Methodology
- Domains Covered
- The Findings
- How to Overcome the Shortcomings of LLMs?
1. Introduction
In a first-of-its-kind research collaboration, Taxmann.AI, in partnership with IIT Kharagpur, conducted India’s inaugural benchmarking of 12 leading Large Language Models[1] (LLMs). The study was conducted by a curated panel of AI researchers and senior tax professionals to evaluate the accuracy and reasoning capabilities of these models within the intricate landscape of Indian tax law.
As the global benchmarks often ignore the unique statutory and judicial nuances of the Indian legal system, the study introduced a purpose-built IRAC+ evaluation framework to test models on issue identification, rule application, and professional justification.
The results reveal that the advanced proprietary models, such as GPT-o3 and Gemini 2.5 Pro, consistently outperform open-weight and lightweight alternatives on complex tax queries. The methodology employed in the study is a structured, multi-layered approach designed to address the specific complexities of the Indian legal system, which includes frequent amendments, layered statutes, and contradictory judicial precedents.
2. Key Findings
The study yielded the following principal findings:
- Proprietary models (GPT and Gemini families) consistently outperform open-weight and lightweight models across all evaluated dimensions.
- LLMs exhibit inverse performance behaviour with respect to question complexity, achieving lower accuracy on simple questions while performing better on complex, multi-layered legal problems.
- “Persona” prompting (assigning the model a specific role) reliably improves model performance. However, “few-shot” (providing sample questions and answers) prompting produces inconsistent results and may degrade performance.
- Open-weight models such as DeepSeek V3 fail to cross the 50% accuracy threshold on specialised legal tasks.
- GPT-5 falls within the mid-tier accuracy band and fails to deliver the depth of structured reasoning expected of frontier models.
- While most models perform adequately on Issue and Rule Identification, they struggle significantly with Application of Law and Justification, which emerge as the primary bottlenecks.
- GPT-4o was validated as an automated judge, exhibiting a very high rank-correlation (0.97) with human expert evaluations.
- Although the LLM-as-a-judge showed mild score inflation, it preserved consistent relative rankings across models, demonstrating scalability for benchmarking.
- Performance varied significantly by domain. Models performed strongly in tax law but poorly in niche areas such as FEMA and Accounting Standards.
- Top-tier models demonstrated high internal stability, with minimal divergence between best- and worst-case scores. Lower-tier models exhibited high volatility.
- The LLM-as-a-judge is slightly more “optimistic” (lenient) in its absolute scores than human experts.
- No model exceeded 80% accuracy; even the strongest proprietary systems remained capped at approximately 70–73%.
3. The Methodology
Four core components define the methodology:
- The IRAC+ Evaluation Framework
- Benchmark Construction and Dataset
- Experimental Setup
- Scoring Mechanism
3.1 The IRAC+ Evaluation Framework
To move beyond generic language benchmarks, the study extended the traditional IRAC framework into IRAC+, recognising that standard IRAC inadequately captures the reasoning demands of complex tax scenarios. The six evaluated dimensions were:
- Issue Identification: Identifying the core legal or factual controversy.
- Rule Identification: Citing relevant statutes, circulars, notifications, and case laws.
- Application of Law: Applying the identified rules to specific facts, including tax reasoning and handling exceptions.
- Conclusion: Delivers a logical, defensible, and actionable final outcome.
- Interpretation: A new dimension added to test the understanding of legislative intent, amendments, and specific statutory context.
- Justification: A new dimension added to evaluate the ability to construct legally persuasive reasoning (e.g., drafting grounds for appeal).6
3.2 Benchmark Construction and Dataset
The benchmark consists of over 100 expert-validated questions covering Indian Direct Taxes, Indirect Taxes (GST), FEMA (Foreign Exchange Management Act), Accounting Standards, and judicial precedents.
To ensure objective evaluation, every question was paired with a high-quality reference answer drafted by experienced Chartered Accountants and tax consultants. These “gold standard” answers served as the ground truth against which model outputs were scored.
3.3 Experimental Setup
Each model was tested using three distinct prompting strategies to analyse how instruction tuning affects performance:
- Base (Zero-shot): The question was posed in a zero-shot manner with no additional context.
- Persona Prompt: The model was assigned a specific role (e.g., “You are a Tax expert…”) to trigger domain-specific behaviour.
- Few-shot Persona Prompt: The model was given the persona, context, and an illustrative example to guide its reasoning.
3.4 Scoring Mechanism
The study used an LLM-as-a-Judge approach, primarily using GPT-4o to score responses. The judge was provided with the Candidate’s Answer, the Ground Truth Reference, and a strict Scoring Rubric. The judge assigned a score of 1 to 5 for each of the six IRAC+ dimensions using specific checklists.
- 1 (Inadequate): Does not cover 50% of the required response.
- 3 (Usable): Aligned with ground truth and covers 80% of the response; immediately usable after minor edits.
- 5 (Gold Standard): Reserved for answers with explicit discussion of counterarguments, caveats, and verbatim statutory extracts.
To mitigate the inherent biases of LLM judges (such as a preference for longer answers), the methodology employed a Hybrid Approach. This involved cross-validating the results using a subject-matter human expert and an alternative LLM Judge (Gemini 2.5 Flash). The study found a very high correlation (0.97) between the rankings provided by the LLM judge and the human experts, validating the methodology’s reliability.
4. Domains Covered
The study covered the following domain areas:
- Income Tax (Income Tax Act, 1961 and Income Tax Rules, 1962)
- GST (Central Goods and Services Tax Act, 2017)
- FEMA
- Accounting Standards
- Judicial Precedents (Conflicting interpretations by different Tribunals, High Courts and Supreme Court).
- Circulars and notifications
- Double Taxation Avoidance Agreements (DTAA)
5. The Findings
5.1 Clear Stratification of Models
The study identified three distinct performance clusters based on overall accuracy, with a marked dominance of proprietary models over open-weight versions.
- Top-Tier (Accuracy > 50%): This tier is occupied exclusively by proprietary models. GPT o3 Pro, GPT o3, and Gemini 2.5 Pro consistently led the rankings.
- Mid-Tier (Accuracy 30% – 50%): This group includes Gemini Flash 2.5, GPT o4 mini, and the open-weight model DeepSeek V3. These models often struggled to maintain consistency across complex legal tasks. Notably, GPT-5 (specifically in “No Prompt” and “Persona Prompt” settings) fell into this tier, failing to deliver the detailed logical reasoning expected of a newer model.
- Low-Tier (Accuracy < 30%): Models such as Grok3, GPT-4o, and GPT-4o mini performed poorly, often failing to understand the nuances required for tax compliance.
5.2 Scoring Dynamics by Prompting Strategy
The study found that how a model is prompted significantly alters its scoring potential, though the effect is not uniform across all models.
- Persona Prompting: Assigning a specific role (e.g., “You are a Tax Expert”) improved performance across all tiers.
- The Few-Shot Dilemma: While few-shot prompting (providing examples) helped under-aligned models like DeepSeek V3 and GPT o1, it actually reduced the performance of top-tier models like GPT o3 and Gemini 2.5 Pro. The study suggests that for high-reasoning models, static examples may cause “context saturation,” crowding out the model’s latent reasoning capabilities or introducing noise.
5.3 IRAC+ Dimension Scores
Models generally scored well on Issue Identification and Rule Identification (syntactic tasks). However, scores dropped sharply for Application of Law and Justification. These dimensions require multi-step logic and statutory adaptation, making them the primary differentiators between strong and weak models.
Top-tier models exhibited a very tight divergence (0.23%) between their “Interpretation” and “Conclusion” scores. This indicates that when these models correctly interpreted the law, they almost always reached the correct conclusion. In contrast, lower-tier models displayed high volatility, often interpreting a rule correctly but failing to apply it to a logical conclusion.
5.4 The “Complexity Paradox”
A counterintuitive finding was the relationship between question complexity and accuracy. Models frequently achieved higher accuracy on “Complex” questions than on “Simple” ones. For example, Top-Tier models averaged ~60-70% on complex questions but often scored lower on simple queries.
This suggests that current LLMs may be over-optimised for intricate reasoning or “overfitted” to complex training data, inadvertently sacrificing the ability to handle basic recall and straightforward comprehension tasks.
5.5 Topic-Specific Variance
Scoring varied significantly depending on the legal subject matter:
- Direct Tax: This was the strongest category, with top models achieving up to 72.84% accuracy.
- FEMA and Accounting Standards: These niche areas saw lower scores and higher dispersion among models (mostly 40%–60%).
5.6 Biasedness of LLM-as-a-judge
The study found that the LLM-as-a-judge demonstrated biasedness at various fronts:
- They assigned higher scores to longer answers, regardless of their actual correctness.
- They favoured answers listed first in the evaluation prompt.
- They showed a preference for responses generated by their own model family (e.g., a GPT -4 judge preferring GPT-generated answers).
- They exhibited higher scores than human experts.
- They are more lenient than human reviewers. While human experts penalise superficial legal fluency, LLM judges may award high scores to responses that appear well-structured or fluent but lack legal depth.
- They can be easily manipulated by “hacks” or superficial formatting choices rather than the substance of the reasoning.
- Inserting specific tokens (such as a colon “:”) or boilerplate phrases like “Solution:” or “Thought process:” can systematically trigger false positives.
- Slight rewording of the evaluation prompt can yield different scores, making reproducibility difficult.
- They often report high confidence in their judgments even when they are wrong. This phenomenon undermines trust, especially in high-stakes domains like tax law.
- They failed to identify hallucinations in judicial precedents.
- They favour fluent writing over legal accuracy.
To mitigate these biases, we employed a Hybrid Approach, using human experts to “anchor” the benchmark and cross-validating results with a secondary LLM judge (Gemini 2.5 Flash) to ensure the rankings were reliable despite these inherent systemic flaws.
6. How to Overcome the Shortcomings of LLMs?
Here are the key methods to overcome these shortcomings:
- Implement a Hybrid Evaluation Framework: To overcome the biases and reliability issues of LLM-as-a-Judge, the study proposes a Hybrid Approach that combines AI scalability with human rigour. Use subject-matter experts to evaluate dimensions like reasoning depth and citation fidelity, rather than generic fluency.
- Separate Deterministic from Generative Tasks: LLMs often fail at precise numeric calculation and formal rule evaluation. Limit the LLM’s role to tasks it excels at, such as rule selection, summarisation of statutes, and interpretation of text, rather than relying on it for mathematical logic.
- Move from Static to Dynamic Prompting: The study found that while Persona Prompting consistently improves performance, Few-Shot Prompting can sometimes degrade performance in top-tier models due to “context saturation” or noise. Instead of using static few-shot examples that remain unchanged, systems should use dynamic, context-aware few-shot examples tailored to the specific task or input. This prevents crowding out the model’s latent reasoning capabilities with irrelevant signals.
- Integrate Retrieval Augmented Generation (RAG): LLMs lack knowledge of real-time amendments and specific circulars, leading to outdated or hallucinated advice. Systems must toggle between the LLM’s inherent knowledge and retrieval tools that give direct access to the latest statutes, notifications, and case laws. The retrieval datasets must be continuously updated to reflect the latest statutes, case law, etc., as models cannot “learn” these from static training weights alone.
- Human-in-the-loop: Given the risk of “overconfident errors” and hallucinations regarding judicial precedents, fully autonomous deployment is unsafe. Deployment must follow a human-in-the-loop model, with legal experts validating outputs, particularly for “Application” and “Justification,” which are identified as the primary bottlenecks to AI accuracy. Maintain meticulous audit trails and clearly separate deterministic outputs from generative ones to verify the provenance of every conclusion.
Disclaimer:
All comparative statements, rankings, scores, and performance observations presented herein arise solely from a controlled, methodology-based benchmarking exercise conducted as part of the LE-BTL research study using the IRAC+ evaluation framework. The results are based on specific test conditions, datasets, prompts, evaluation criteria, and scoring parameters defined exclusively for the purposes of this study.
References to third-party artificial intelligence models, products, services, trademarks, or brand names are made strictly for academic and research-based comparison. All such names and marks remain the property of their respective owners. Nothing contained herein shall be construed as implying any affiliation, endorsement, sponsorship, partnership, or commercial relationship with any third-party AI provider.
The findings reflect performance only within the defined scope and conditions of the study and do not constitute a general, definitive, ongoing, or future assessment of any AI model’s capabilities, accuracy, reliability, compliance, or suitability for any particular purpose. Actual performance may vary materially depending on system updates, deployment environments, user inputs, and other variables.
This publication is intended solely for informational and research purposes and does not constitute legal, technical, commercial, regulatory, or professional advice.
[1] Jain, Nitish and Wadhwa, Naveen and Goyal, Pawan and Ghosh, Saptarshi and Pawar, Sankalp and Shinde, Abhishek and Boinepally, Karthik and Malpani, Vrinda V and K, Raaga, LLM Evaluations for Bharat Tax Laws (‘LE-BTL’). A Framework to Evaluate and Benchmark the Accuracy of Large Language Models (‘LLMs’) in the Context of the Indian Tax Laws (November 19, 2025). Available at SSRN: https://ssrn.com/abstract=5941734 or http://dx.doi.org/10.2139/ssrn.5941734
The post Taxmann.AI X IIT Kharagpur LLM Evaluation | LE-BTL Benchmark Study appeared first on Taxmann Blog.




