Airline Chatbot Answer Generation: Preventing Hallucinations

Abstract

AI-powered chatbots are transforming customer service in aviation, where accuracy isn't just important, it's critical. This document explores techniques and evaluation methods for building reliable answer generation systems within airline chatbots.

The challenge?

Large language models love to hallucinate. They misrepresent context, invent facts, and ignore retrieved information when generating responses. In aviation, where incorrect baggage policies or flight change rules can cost airlines millions, this isn't acceptable.

We examine practical, low-cost strategies using prompt engineering and design evaluation frameworks tailored to retrieval-augmented generation (RAG) pipelines. Our analysis draws on recent research in hallucination detection, answer faithfulness, and LLM evaluation. We propose methodologies for domain-specific dataset creation and benchmarking, essential tools missing from current airline-specific resources.

The goal is simple: guide implementation of robust, scalable chatbots capable of supporting real-world airline operations without the hallucination headaches.

1. Problem Statement

Designing a reliable QA chatbot for the airline industry introduces several domain-specific and technical challenges that must be addressed from the outset:

High requirement for accuracy: Airline customer queries often involve time-sensitive or policy-critical information (e.g., flight changes, baggage rules, refunds). Incorrect answers can lead to user frustration, compliance risks, or financial penalties.
LLM limitations: While large language models are strong at generating fluent responses, they frequently:
- Hallucinate facts or make up information.
- Ignore important context retrieved by the RAG system.
- Rely on pretraining knowledge instead of grounded content.
Lack of domain-specific benchmarks: Most available QA benchmarks (e.g., Natural Questions, MedHallu, FinanceBench) are either open-domain or targeted at medical, legal, or financial domains. No established datasets or benchmarks exist for airline-specific use cases.
Need for scalable evaluation: A practical evaluation strategy must:
- Detect hallucinations.
- Measure faithfulness and correctness of answers.
- Support automated scoring for rapid iteration.
- Integrate with CI/CD pipelines for ongoing QA during development.
Feasibility over complexity: Many high-performance QA pipelines rely on complex models, fine-tuning, or multi-stage reranking. These are expensive to build and maintain. Early stage airline chatbots need cost-effective techniques that can be deployed using general-purpose models and cleanly structured prompts.
Custom dataset requirement: To measure quality accurately in this domain, a tailored dataset must be built using airline documents, user-relevant questions, and annotated answers, enabling internal benchmarking that reflects real world usage.

2. Research Questions

A. Answer Generation Techniques

What are the main existing approaches for generating accurate and reliable answers in a retrieval-augmented chatbot?
Which attributes (e.g., hallucination reduction, cost-efficiency, implementation complexity, compatibility with RAG) are used to compare answer generation techniques?
How do these techniques perform across those attributes in terms of quality, scalability, and robustness in an airline domain?
Which techniques should be selected for the first version of the airline chatbot and why are they most appropriate for early deployment?

B. Evaluation Methods

What are the primary evaluation methods and benchmarks used to assess the performance of QA chatbots, particularly in detecting hallucinations and measuring answer quality?
Which criteria (e.g., faithfulness, correctness, hallucination rate, robustness, LLM-human agreement) are used to compare evaluation methods and datasets?
How do selected benchmarks and evaluation strategies perform across those criteria, especially for domain-specific or RAG-based systems?
Which evaluation flow and tools should be implemented for the airline QA chatbot, and why do they best support accuracy, maintainability, and CI/CD integration?

Techniques

In the architecture of a QA chatbot, the answer generation step plays a critical role in shaping the user experience and ensuring factual correctness. Once a user's intent is identified and relevant documents are retrieved, the system must generate a coherent, accurate, and contextually grounded response. This is particularly important in domains like aviation, where misinformation about policies, legal terms, or procedures can have significant operational and reputational consequences. While large language models (LLMs) have shown strong capabilities in generating fluent text, they are also prone to hallucinations, bias toward pre-trained knowledge, and misinterpretation of retrieved context. To address these challenges, a range of techniques have emerged, primarily in two areas: prompt construction, which aims to guide the model's behavior through carefully crafted instructions, and response generation, which involves mechanisms for improving how the model synthesizes information. This section explores those techniques, compares their strengths and limitations, and justifies the most suitable strategies for the development of an airline-specific QA chatbot.

Attributes for comparison

Attribute	Description	Relevance to airline chatbot	References
Hallucination Mitigation	Evaluates how well a technique prevents the generation of hallucinated or fabricated information by ensuring all factual claims are grounded in the retrieved documents.	Incorrect answers about baggage allowances, visa rules, or rebooking policies can lead to customer dissatisfaction, regulatory issues, or operational mistakes	arXiv:2409.13385 arXiv:2405.07437 arXiv:2303.18223
Answer Relevance & Completeness	How directly and fully the answer addresses the user's question, including all required details from the relevant sources.	Incomplete or partially correct answers (e.g., mentioning baggage size but not weight) can lead to confusion and repeat queries during travel planning or check-in	arXiv:2409.13385 arXiv:2405.07437 arXiv:2309.01431
Latency impact	How the technique affects system responsiveness and computation time of generation process	Timely responses are critical during high-stress situations where customers expect near-instant replies	arXiv:2504.09037 arXiv:2303.18223
Robustness to Irrelevant or Conflicting Information	Measures a technique's ability to handle "noise" in the retrieved context	Policy documents may have outdated or duplicate clauses, robust techniques help avoid errors by focusing only on the correct version for the user's context	arXiv:2409.13385 arXiv:2405.07437 arXiv:2309.01431
Cost impact	Evaluates the overall resource and financial burden associated with implementing and operating the technique	Scalability requires deliver high accuracy without inflating operational budgets	arXiv:2504.09037 arXiv:2303.18223
Evaluation Ease	How easily the technique's performance can be evaluated using standard metrics or evaluation pipelines	Fast evaluation cycles help teams verify quality before production updates
Implementation Complexity	Technical effort required to implement, integrate, deploy and maintain the technique	Teams must balance performance with simplicity for fast delivery of products.
Data dependency	How dependent a technique is on training or fine-tuning datasets	Low data dependency or easy update mechanisms ensure that chatbots remain accurate and compliant without constant retraining

Comparison table

Technique	Paper	Hallucination mitigation	Answer Relevance	Latency impact	Robustness to Irrelevant or Conflicting Information	Cost impact	Implementation Complexity	Data dependency	Paper date	Cited By
Select most relevant documents from retrieval results	REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering	Medium	High	Medium (Extra steps)	High	Medium-high (Fine-tuning costs + extra inference steps)	High (requires fine-tuned models and datasets)	Fine-tuned models	2024-11	38
Context compression	Context Embeddings for Efficient Answer Generation in RAG	Medium-high	High	Low	High	Medium-high (Fine-tuning costs + extra inference steps)	High (fine-tuning for decoder/encoder)	Fine-tuned models	2024-10	19
Compression and Selective Augmentation	RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation	Medium-high	High	Medium	High	Medium-high (Fine-tuning costs + extra inference steps)	High (fine-tuning for 2 compressors)	Fine-tuned	2023-10	176
Self-check for increase quality of generated output	Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy	Medium-high	High	Medium (iterative process.)	High	Medium (inference cost for each iteration)	Low (Prompt adjustments. Examples and code in paper)	None	2023-10	310
Self-Correction	Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation	High (synergy RAG-generation iterative)	High	Medium (iterative process)	High	Medium (inference cost for each iteration)	Medium (Paper includes prompts and algorithms)	None	2024-03	214
Prompting for complex tasks	TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks	Medium-high	High	Low	High	Low	Low (prompt improvement)	None	2023-10	80
Self-Reflexion	Chain-of-Verification Reduces Hallucination in Large Language Models	High	High	Medium (iterative process)	High	Medium (inference cost for each iteration)	Low (papers include prompt templates)	None	2023-09	514

Proposed Answer Generation Strategy for the Airline QA Chatbot

In evaluating how to generate high-quality responses for the airline QA chatbot, our research focused on identifying techniques that not only improve factual accuracy and reduce hallucinations but also align with constraints of cost, development effort, and scalability. We reviewed a broad set of strategies grouped into two practical categories: prompt construction and response generation.

Recent literature and benchmark studies show that many state-of-the-art improvements depend on complex architectures, such as multi-stage reranking pipelines or fine-tuned generation models. While these methods can significantly improve faithfulness and contextual accuracy, they often require substantial engineering effort and rely on large, domain-specific models. This makes them impractical for early-stage deployment, where simplicity, speed, and cost-efficiency are critical.

Instead, we found strong support for prompt-based methods that leverage general-purpose, cost-effective LLMs like GPT-4o-mini. These techniques offer a high return on investment by enhancing model behavior through structured prompting without needing model retraining. Among the most promising are Chain-of-Verification, which prompts the model to internally validate its output before responding, and Self-Correction, which encourages detection of contradictions or unsupported claims. Approaches like ReAct and Self-Ask further improve reasoning by breaking complex queries into sub-steps or decisions. We also identified the effective use of Few-Shot prompting, especially when paired with Chain-of-Thought (CoT) reasoning, as a lightweight method to steer answer generation toward reliable, structured outputs.

Given the need for fast, cost-efficient, and transparent development in the first iteration of the airline QA chatbot, we recommend the following stepwise adoption of answer generation techniques:

Prompt Engineering Best Practices
- Use clear, structured prompts with system-level instructions emphasizing accuracy, faithfulness, and domain knowledge preference over generic completions.
- Incorporate user intent explicitly and guide the model to avoid speculation.
Chain-of-Verification (CoVe)
- Start with inline self-verification prompts that ask the model to validate factual basis before finalizing an answer.
- Future expansion: use multi-stage verification with RAG-based re-checks or document comparisons.
Prompt Tactics During Tuning
- Integrate and test ReAct, CoT, Self-Ask, and other tactics as enhancements based on use-case needs.
- Use A/B testing and evaluation benchmarks (described in the Evaluation section) to guide iteration.
Add Self-Correction Layer
- Introduce a second-generation pass that prompts the LLM to detect contradictions or unsupported claims in its own response.
- This step may be triggered selectively for high-stakes queries (e.g., cancellations, legal clauses).

Evaluation methods

Robust evaluation of a QA chatbot is essential to ensure that its responses are accurate, relevant, and trustworthy. A critical challenge in modern AI systems, particularly those using large language models (LLMs), is the phenomenon of hallucination, where the system generates fluent but factually incorrect or fabricated information. In high-stakes domains like aviation, such hallucinations can result in misleading information for passengers, regulatory issues, and ultimately, loss of user trust. Thus, evaluation must go beyond surface-level fluency and measure faithfulness, relevance, and robustness to user intent and noisy input.

Attributes for comparison

Attribute	Description	Relevance for Airline Chatbot
Metrics	The evaluation methods used to score chatbot answers or retrieval quality.	Choosing the right metrics ensures accurate, useful, and safe answers in customer-facing systems.
Integration/Use Complexity	The effort required to implement and use the metric in a real system.	Low-complexity metrics help teams iterate and deploy faster in production airline use cases.
Generated Dataset	Whether the metric requires a benchmark or labeled dataset for evaluation.	Important for testing on controlled scenarios like fare rules, baggage policy, or FAQs.
Reproducible Dataset Generation for Airlines	Can the evaluation data be generated specifically for airline use cases.	Enables continuous domain-specific evaluation, aligned with routes, services, or regulations.
Paper Date	The year the metric was introduced in literature.	Helps identify maturity, newer metrics may be more capable, older ones more proven.
Cited By	Number of academic or industry citations referencing the metric.	Indicates trust, adoption, and stability of the metric in research and production settings.

Evaluation Metrics

Metric	Description	Relevance for Airline Chatbot
Exact Match (EM)	Measures if the generated answer matches the reference exactly.	Useful for strict responses like flight numbers or booking codes where precision is critical.
F1 Score	Balances precision and recall between predicted and reference answers.	Important for evaluating short factual responses like baggage rules or visa requirements.
BLEU	Assesses n-gram overlap between generated and reference texts.	Limited use for open-ended answers but helpful for checking structured language consistency.
ROUGE	Measures recall-based n-gram overlap, especially in summarization tasks.	Useful for summarizing flight policies or condensing customer service responses.
BERTScore	Uses BERT embeddings to compare semantic similarity between responses.	Effective for detecting semantic accuracy in varied natural language phrasing from users.
LLM as Judge	Uses a language model to evaluate response quality holistically.	Highly relevant for evaluating nuanced responses in complex airline queries and dynamic situations.
Faithfulness	Measures if the answer is supported by the retrieved context.	Crucial to prevent hallucinated or unsupported claims in regulatory or safety-related answers.
Correctness	Evaluates factual accuracy of the answer.	Vital to ensure legal and operational accuracy in flight information or service policies.
Relevance	Assesses how well the answer addresses the user's query.	Ensures responses are aligned with customer intent in high-stakes travel scenarios.
Hallucination	Tracks invented or incorrect content not grounded in source documents.	Critical to avoid misinformation about itineraries, costs, or regulations.
Others	Placeholder for task-specific or emerging metrics.	Allows flexibility to incorporate domain-specific criteria like tone or empathy.
Noise Robustness	Tests model resilience to typos, slang, or speech-to-text artifacts.	Important for voice-based interactions or noisy chat inputs in multilingual airline settings.
Negative Rejection	Assesses the system's ability to reject unanswerable or irrelevant queries.	Key for maintaining trust by not fabricating answers to security, legal, or operational questions.
Counterfactual Robustness	Tests if the system changes output appropriately with minimal input change.	Ensures reliability and consistency in close variants of customer questions (e.g., flight times).

References for metrics:

https://arxiv.org/pdf/2501.00269 (pag 2-3)
https://arXiv.org/pdf/2309.01431v2 (pag 3-5)
https://arxiv.org/pdf/2312.10997 (pag 12, 14)
https://arxiv.org/pdf/2405.07437 (pag 9 - 12)

Comparison table

Relevant benchmarks

Benchmark	Area	Exact Match (EM)	F1 score	ROUGE	Bert Score	LLM Judge	Faithfulness	CCorrectness	Relevance	Hallucination	Others	Noise robustness	Negative rejection	Counterfactual Robustness	Integration/Use Complexity	Generated Dataset	Reproducible Dataset Generation for Airlines	Dataset creation complexity	Dataset notes	Paper Date	Cited by
FaithBench	Answer Generation / summarization hallucinations									✅					Medium (create a new module)	No (Vectara's leaderboard)	Yes	High (requires manual annotations per summary)	- Uses Vectara's hallucinations leaderboard	2024-10	8
RAGAS	Answer Generation / RAG- Automated evaluation						✅		✅	✅					Low (installable Py package)	Yes	Yes	Medium (create golden set of questions with human revision)	- WikiEval (generated from 50 wikipedia pages) - questions are generated by ChatGPT passing docs - Answers are generated by ChatGPT - Human annotations on QA	2024-03	662
RAG Bench	RAG system								✅		✅				Medium (create a new module)	Yes	Yes	Medium (create golden set of questions with human revision)	- 11K questions from 12 existing domain-specific datasets - Generate answers with GTP - LLM as annotator (93% human agreement for DelucionQA)	2024-06	38
From 'Hallucination Free?' paper	RAG system							✅	✅	✅					non-feasible (human evaluation)	Yes	No	mismatch domain	- 202 questions in 4 categories - Too tight to legal domain	2024-05	118
HaluEval 2.0	RAG system				✅					✅					Medium (create a new module)	Yes	Yes	Medium (create golden set of questions with human revision)	- Take questions from 6 domain-specified dataset - Generate 3 response using ChatGPT per question - 8770 questions - human annotations	2024-01	150
Retrieve Augmented Benchmark	RAG system											✅	✅	✅	Hard (unclear way to implement the benchmark)	Yes	Partial (unclear ways to implement the benchmark)	Medium (create golden set of questions with human revision)	- 600 generated questions + 200 for information integration + 200 robustness - 50% English, 50% Chinese	2023-12	664
Domain-RAG	RAG system abilities (6)	✅	✅	✅		✅	✅					✅			Medium (human annotations)	Yes	Yes	Medium (create golden set of questions with human revision)	- Generates 7 datasets (extractive, conversational, structured, faithful, noisy, multi-doc) for each ability - Process for each dataset creation: - use ChatGPT and pass external documents if necessary to generate pairs questions-answers - Responses are filtered-out manually	2024-06	22

Benchmark Landscape: What the Research Reveals

During our investigation, we studied a wide range of evaluation methodologies and datasets, spanning from foundational benchmarks like Natural Questions to more advanced evaluations such as DO-RAG. Despite an exhaustive review of domain-specific resources, no benchmarks tailored to the airline industry were found. Most domain-specific evaluation datasets center around medicine, law, and finance. Examples include:

MedHalt and MedHallu – for detecting hallucinations in medical QA.
FinanceBench – evaluates open-book financial QA performance.
Hallucination-Free? (Stanford University) – one of the few studies assessing commercial legal chatbots using manually curated questions and evaluations. While insightful, its manual process and proprietary data make it infeasible for reuse in a scalable airline chatbot benchmark.

Two comprehensive surveys, arXiv:2404.12041v3 and arXiv:2405.07437, helped map the evolving landscape of hallucination evaluation, scoring techniques, and annotator reliability. These reviews confirmed a strong trend: modern evaluation workflows increasingly rely on LLMs as annotators and scorers, primarily due to their adaptability, speed, and cost-effectiveness. These LLMs are used to assess correctness, detect hallucinations, score fluency, and measure alignment with user intent.

From these findings, we distilled a strategy to design an evaluation framework suitable for our airline-specific, RAG-based chatbot, emphasizing automated, scalable techniques while incorporating human-in-the-loop validation for critical subsets.

Evaluation Framework and Strategy for the Airline QA Chatbot

The selection of evaluation methods was driven by the need to assess both the accuracy of question answering and the presence of hallucinations in generated responses. Among the many options, we prioritized frameworks that offer a clear methodology for scoring faithfulness, annotating hallucinations, and integrating seamlessly into development pipelines. A key factor in the comparison was the availability of tools or guidance for dataset creation, particularly those suited for RAG-based systems. Since no existing benchmarks target the airline domain, the selection focused on transferable techniques and workflows that could be adapted to build our own domain-specific dataset and benchmark, rather than relying on out-of-the-box solutions.

We divided the evaluation pipeline into two main phases: Dataset Creation and Benchmark Execution.

1. Dataset Creation

A carefully constructed dataset is the foundation of any meaningful evaluation. Our process includes the following stages:

1.1 Data Collection

Extract authoritative content from airline websites, policies, FAQs, and internal documentation. Sources are filtered for factual accuracy and categorized (e.g., baggage, check-in, cancellations, legal, claims, etc.).

1.2 Question Generation

Use LLMs (e.g., GPT-4) to generate representative user questions for each category.
Perform a human review to filter out ambiguous, unanswerable, or overlapping questions.

1.3 Answer Generation & Candidate Retrieval

For each validated question, generate three candidate answers.
Backtrack each answer to identify the source question that would most likely yield it using semantic similarity matching. This simulates document-based retrieval in RAG systems (inspired by RAGAS).

1.4 Human Annotation (First Phase)

Select a subset of question-answer pairs for manual annotation. Annotators label:
- Correctness: Is the answer factually accurate?
- Hallucination: Are any unsupported claims made?
Compare human judgment with LLM-generated content to assess human-LLM agreement, helping validate the use of LLMs as proxy annotators.

1.5 Dataset Expansion (Second Phase)

Extend the dataset with more diverse QA pairs using semi-automated generation.
Add non-relevant questions and ambiguous queries to evaluate negative rejection and noise robustness.
Annotate with tools like FaithBench, RAGAS, and HaluEval 2.0 to assess faithfulness (factuality with respect to retrieved context), fluency, and truthfulness.

2. Benchmark Definition & Integration

With a curated dataset, the benchmark process includes two evolving phases:

2.1 Initial Benchmarking

Integrate a flexible evaluation framework such as RAGAS, which allows granular scoring of faithfulness, context relevance, and hallucination.
Additional options include:
- ARES – for robust scoring with reference-free metrics.
- DeepEval – for integration into CI/CD workflows and regression detection.
Each QA pair is evaluated using:
- Traditional metrics: BLEU, ROUGE, METEOR, BertScore.
- LLM-based evaluators (e.g., GPT-4 or Claude) that assign scores (1–10) for criteria like fluency, correctness, and alignment with context.

2.2 Advanced Benchmarking

Include domain-adapted evaluations for:
- Hallucination rate
- Faithfulness to retrieved context
- User intent alignment
- Handling of irrelevant or misleading queries
Introduce stress tests for negative rejection (e.g., chatbot says “I don’t know” instead of hallucinating) and noise robustness (performance with adversarial or incomplete inputs).

Integration into CI/CD

To ensure ongoing reliability:

The benchmark suite is run after every major change to the retrieval or generation components.
A snapshot-based evaluation is included as part of the CI/CD pipeline.
Thresholds are defined for acceptance, particularly for hallucination rate, negative response handling, and faithfulness score. Regression triggers alerts or rollback mechanisms.

Conclusion

Despite the absence of off-the-shelf airline-specific benchmarks, our investigation revealed transferrable strategies and adaptable toolkits from domains like medicine, finance, and law. By combining structured dataset creation, automated LLM-assisted scoring, and benchmark integration into development pipelines, we propose a scalable, domain-sensitive, and high-fidelity evaluation methodology tailored to the realities of an airline QA chatbot. This ensures that not only is the system performant, but it is also trustworthy, resilient, and grounded in facts.

Conclusions

Ensuring accurate and faithful responses in a QA chatbot is a multi-faceted challenge that requires a careful balance between model capabilities, infrastructure feasibility, and evaluation reliability. Our investigation has shown that prompt-based answer generation techniques, such as Chain-of-Verification, Self-Correction, and structured prompting, offer a practical and scalable way to enhance response quality without incurring the costs of fine-tuning or complex model pipelines. At the same time, in the absence of an airline-specific benchmark, we propose a custom evaluation flow grounded in methodologies from adjacent domains. By focusing on hallucination detection, faithfulness scoring, and dataset creation strategies, and by integrating frameworks like RAGAS or HaluEval 2.0, the proposed evaluation process can be embedded into CI/CD pipelines to maintain quality over time. Together, the selected techniques and evaluation methods form a solid foundation for building a trustworthy, airline-grade AI chatbot that aligns with both user expectations and enterprise constraints.