Back to Blog
Research

Advanced Answer Generation Techniques for Airline Chatbots

Advanced techniques for answer generation strategies and evaluation methodologies for developing reliable, hallucination-free airline chatbot responses that maintain accuracy while balancing user experience.

Visual representation of answer generation process in AI systems with focus on accuracy and trust
Ernesto Carrasco
Ernesto Carrasco
Applied AI Engineer, Kaiban
Published
August 9, 2025
Read Time
25 min read

Abstract

AI-powered chatbots are transforming customer service in aviation, where accuracy isn't just important, it's critical. This document explores techniques and evaluation methods for building reliable answer generation systems within airline chatbots.

The challenge?

Large language models love to hallucinate. They misrepresent context, invent facts, and ignore retrieved information when generating responses. In aviation, where incorrect baggage policies or flight change rules can cost airlines millions, this isn't acceptable.

We examine practical, low-cost strategies using prompt engineering and design evaluation frameworks tailored to retrieval-augmented generation (RAG) pipelines. Our analysis draws on recent research in hallucination detection, answer faithfulness, and LLM evaluation. We propose methodologies for domain-specific dataset creation and benchmarking, essential tools missing from current airline-specific resources.

The goal is simple: guide implementation of robust, scalable chatbots capable of supporting real-world airline operations without the hallucination headaches.

1. Problem Statement

Designing a reliable QA chatbot for the airline industry introduces several domain-specific and technical challenges that must be addressed from the outset:

  • High requirement for accuracy: Airline customer queries often involve time-sensitive or policy-critical information (e.g., flight changes, baggage rules, refunds). Incorrect answers can lead to user frustration, compliance risks, or financial penalties.
  • LLM limitations: While large language models are strong at generating fluent responses, they frequently:
    • Hallucinate facts or make up information.
    • Ignore important context retrieved by the RAG system.
    • Rely on pretraining knowledge instead of grounded content.
  • Lack of domain-specific benchmarks: Most available QA benchmarks (e.g., Natural Questions, MedHallu, FinanceBench) are either open-domain or targeted at medical, legal, or financial domains. No established datasets or benchmarks exist for airline-specific use cases.
  • Need for scalable evaluation: A practical evaluation strategy must:
    • Detect hallucinations.
    • Measure faithfulness and correctness of answers.
    • Support automated scoring for rapid iteration.
    • Integrate with CI/CD pipelines for ongoing QA during development.
  • Feasibility over complexity: Many high-performance QA pipelines rely on complex models, fine-tuning, or multi-stage reranking. These are expensive to build and maintain. Early stage airline chatbots need cost-effective techniques that can be deployed using general-purpose models and cleanly structured prompts.
  • Custom dataset requirement: To measure quality accurately in this domain, a tailored dataset must be built using airline documents, user-relevant questions, and annotated answers, enabling internal benchmarking that reflects real world usage.

2. Research Questions

A. Answer Generation Techniques

  1. What are the main existing approaches for generating accurate and reliable answers in a retrieval-augmented chatbot?
  2. Which attributes (e.g., hallucination reduction, cost-efficiency, implementation complexity, compatibility with RAG) are used to compare answer generation techniques?
  3. How do these techniques perform across those attributes in terms of quality, scalability, and robustness in an airline domain?
  4. Which techniques should be selected for the first version of the airline chatbot and why are they most appropriate for early deployment?

B. Evaluation Methods

  1. What are the primary evaluation methods and benchmarks used to assess the performance of QA chatbots, particularly in detecting hallucinations and measuring answer quality?
  2. Which criteria (e.g., faithfulness, correctness, hallucination rate, robustness, LLM-human agreement) are used to compare evaluation methods and datasets?
  3. How do selected benchmarks and evaluation strategies perform across those criteria, especially for domain-specific or RAG-based systems?
  4. Which evaluation flow and tools should be implemented for the airline QA chatbot, and why do they best support accuracy, maintainability, and CI/CD integration?

Techniques

In the architecture of a QA chatbot, the answer generation step plays a critical role in shaping the user experience and ensuring factual correctness. Once a user's intent is identified and relevant documents are retrieved, the system must generate a coherent, accurate, and contextually grounded response. This is particularly important in domains like aviation, where misinformation about policies, legal terms, or procedures can have significant operational and reputational consequences. While large language models (LLMs) have shown strong capabilities in generating fluent text, they are also prone to hallucinations, bias toward pre-trained knowledge, and misinterpretation of retrieved context. To address these challenges, a range of techniques have emerged, primarily in two areas: prompt construction, which aims to guide the model's behavior through carefully crafted instructions, and response generation, which involves mechanisms for improving how the model synthesizes information. This section explores those techniques, compares their strengths and limitations, and justifies the most suitable strategies for the development of an airline-specific QA chatbot.

Attributes for comparison

AttributeDescriptionRelevance to airline chatbotReferences
Hallucination MitigationEvaluates how well a technique prevents the generation of hallucinated or fabricated information by ensuring all factual claims are grounded in the retrieved documents.Incorrect answers about baggage allowances, visa rules, or rebooking policies can lead to customer dissatisfaction, regulatory issues, or operational mistakesarXiv:2409.13385
arXiv:2405.07437
arXiv:2303.18223
Answer Relevance & CompletenessHow directly and fully the answer addresses the user's question, including all required details from the relevant sources.Incomplete or partially correct answers (e.g., mentioning baggage size but not weight) can lead to confusion and repeat queries during travel planning or check-inarXiv:2409.13385
arXiv:2405.07437
arXiv:2309.01431
Latency impactHow the technique affects system responsiveness and computation time of generation processTimely responses are critical during high-stress situations where customers expect near-instant repliesarXiv:2504.09037
arXiv:2303.18223
Robustness to Irrelevant or Conflicting InformationMeasures a technique's ability to handle "noise" in the retrieved contextPolicy documents may have outdated or duplicate clauses, robust techniques help avoid errors by focusing only on the correct version for the user's contextarXiv:2409.13385
arXiv:2405.07437
arXiv:2309.01431
Cost impactEvaluates the overall resource and financial burden associated with implementing and operating the techniqueScalability requires deliver high accuracy without inflating operational budgetsarXiv:2504.09037
arXiv:2303.18223
Evaluation EaseHow easily the technique's performance can be evaluated using standard metrics or evaluation pipelinesFast evaluation cycles help teams verify quality before production updates
Implementation ComplexityTechnical effort required to implement, integrate, deploy and maintain the techniqueTeams must balance performance with simplicity for fast delivery of products.
Data dependencyHow dependent a technique is on training or fine-tuning datasetsLow data dependency or easy update mechanisms ensure that chatbots remain accurate and compliant without constant retraining

Comparison table

TechniquePaperHallucination mitigationAnswer RelevanceLatency impactRobustness to Irrelevant or Conflicting InformationCost impactImplementation ComplexityData dependencyPaper dateCited By
Select most relevant documents from retrieval resultsREAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question AnsweringMediumHighMedium
(Extra steps)
HighMedium-high
(Fine-tuning costs + extra inference steps)
High
(requires fine-tuned models and datasets)
Fine-tuned models2024-1138
Context compressionContext Embeddings for Efficient Answer Generation in RAGMedium-highHighLowHighMedium-high
(Fine-tuning costs + extra inference steps)
High
(fine-tuning for decoder/encoder)
Fine-tuned models2024-1019
Compression and Selective AugmentationRECOMP: Improving Retrieval-Augmented LMs with Compression and Selective AugmentationMedium-highHighMediumHighMedium-high
(Fine-tuning costs + extra inference steps)
High
(fine-tuning for 2 compressors)
Fine-tuned2023-10176
Self-check for increase quality of generated outputEnhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation SynergyMedium-highHighMedium
(iterative process.)
HighMedium
(inference cost for each iteration)
Low
(Prompt adjustments. Examples and code in paper)
None2023-10310
Self-CorrectionSelf-contradictory Hallucinations of Large Language Models: Evaluation, Detection and MitigationHigh
(synergy RAG-generation iterative)
HighMedium
(iterative process)
HighMedium
(inference cost for each iteration)
Medium
(Paper includes prompts and algorithms)
None2024-03214
Prompting for complex tasksTELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex TasksMedium-highHighLowHighLowLow
(prompt improvement)
None2023-1080
Self-ReflexionChain-of-Verification Reduces Hallucination in Large Language ModelsHighHighMedium
(iterative process)
HighMedium
(inference cost for each iteration)
Low
(papers include prompt templates)
None2023-09514

Proposed Answer Generation Strategy for the Airline QA Chatbot

In evaluating how to generate high-quality responses for the airline QA chatbot, our research focused on identifying techniques that not only improve factual accuracy and reduce hallucinations but also align with constraints of cost, development effort, and scalability. We reviewed a broad set of strategies grouped into two practical categories: prompt construction and response generation.

Recent literature and benchmark studies show that many state-of-the-art improvements depend on complex architectures, such as multi-stage reranking pipelines or fine-tuned generation models. While these methods can significantly improve faithfulness and contextual accuracy, they often require substantial engineering effort and rely on large, domain-specific models. This makes them impractical for early-stage deployment, where simplicity, speed, and cost-efficiency are critical.

Instead, we found strong support for prompt-based methods that leverage general-purpose, cost-effective LLMs like GPT-4o-mini. These techniques offer a high return on investment by enhancing model behavior through structured prompting without needing model retraining. Among the most promising are Chain-of-Verification, which prompts the model to internally validate its output before responding, and Self-Correction, which encourages detection of contradictions or unsupported claims. Approaches like ReAct and Self-Ask further improve reasoning by breaking complex queries into sub-steps or decisions. We also identified the effective use of Few-Shot prompting, especially when paired with Chain-of-Thought (CoT) reasoning, as a lightweight method to steer answer generation toward reliable, structured outputs.

Given the need for fast, cost-efficient, and transparent development in the first iteration of the airline QA chatbot, we recommend the following stepwise adoption of answer generation techniques:

  1. Prompt Engineering Best Practices
    • Use clear, structured prompts with system-level instructions emphasizing accuracy, faithfulness, and domain knowledge preference over generic completions.
    • Incorporate user intent explicitly and guide the model to avoid speculation.
  2. Chain-of-Verification (CoVe)
    • Start with inline self-verification prompts that ask the model to validate factual basis before finalizing an answer.
    • Future expansion: use multi-stage verification with RAG-based re-checks or document comparisons.
  3. Prompt Tactics During Tuning
    • Integrate and test ReAct, CoT, Self-Ask, and other tactics as enhancements based on use-case needs.
    • Use A/B testing and evaluation benchmarks (described in the Evaluation section) to guide iteration.
  4. Add Self-Correction Layer
    • Introduce a second-generation pass that prompts the LLM to detect contradictions or unsupported claims in its own response.
    • This step may be triggered selectively for high-stakes queries (e.g., cancellations, legal clauses).

Evaluation methods

Robust evaluation of a QA chatbot is essential to ensure that its responses are accurate, relevant, and trustworthy. A critical challenge in modern AI systems, particularly those using large language models (LLMs), is the phenomenon of hallucination, where the system generates fluent but factually incorrect or fabricated information. In high-stakes domains like aviation, such hallucinations can result in misleading information for passengers, regulatory issues, and ultimately, loss of user trust. Thus, evaluation must go beyond surface-level fluency and measure faithfulness, relevance, and robustness to user intent and noisy input.

Attributes for comparison

AttributeDescriptionRelevance for Airline Chatbot
MetricsThe evaluation methods used to score chatbot answers or retrieval quality.Choosing the right metrics ensures accurate, useful, and safe answers in customer-facing systems.
Integration/Use ComplexityThe effort required to implement and use the metric in a real system.Low-complexity metrics help teams iterate and deploy faster in production airline use cases.
Generated DatasetWhether the metric requires a benchmark or labeled dataset for evaluation.Important for testing on controlled scenarios like fare rules, baggage policy, or FAQs.
Reproducible Dataset Generation for AirlinesCan the evaluation data be generated specifically for airline use cases.Enables continuous domain-specific evaluation, aligned with routes, services, or regulations.
Paper DateThe year the metric was introduced in literature.Helps identify maturity, newer metrics may be more capable, older ones more proven.
Cited ByNumber of academic or industry citations referencing the metric.Indicates trust, adoption, and stability of the metric in research and production settings.

Evaluation Metrics

MetricDescriptionRelevance for Airline Chatbot
Exact Match (EM)Measures if the generated answer matches the reference exactly.Useful for strict responses like flight numbers or booking codes where precision is critical.
F1 ScoreBalances precision and recall between predicted and reference answers.Important for evaluating short factual responses like baggage rules or visa requirements.
BLEUAssesses n-gram overlap between generated and reference texts.Limited use for open-ended answers but helpful for checking structured language consistency.
ROUGEMeasures recall-based n-gram overlap, especially in summarization tasks.Useful for summarizing flight policies or condensing customer service responses.
BERTScoreUses BERT embeddings to compare semantic similarity between responses.Effective for detecting semantic accuracy in varied natural language phrasing from users.
LLM as JudgeUses a language model to evaluate response quality holistically.Highly relevant for evaluating nuanced responses in complex airline queries and dynamic situations.
FaithfulnessMeasures if the answer is supported by the retrieved context.Crucial to prevent hallucinated or unsupported claims in regulatory or safety-related answers.
CorrectnessEvaluates factual accuracy of the answer.Vital to ensure legal and operational accuracy in flight information or service policies.
RelevanceAssesses how well the answer addresses the user's query.Ensures responses are aligned with customer intent in high-stakes travel scenarios.
HallucinationTracks invented or incorrect content not grounded in source documents.Critical to avoid misinformation about itineraries, costs, or regulations.
OthersPlaceholder for task-specific or emerging metrics.Allows flexibility to incorporate domain-specific criteria like tone or empathy.
Noise RobustnessTests model resilience to typos, slang, or speech-to-text artifacts.Important for voice-based interactions or noisy chat inputs in multilingual airline settings.
Negative RejectionAssesses the system's ability to reject unanswerable or irrelevant queries.Key for maintaining trust by not fabricating answers to security, legal, or operational questions.
Counterfactual RobustnessTests if the system changes output appropriately with minimal input change.Ensures reliability and consistency in close variants of customer questions (e.g., flight times).

References for metrics:

  • https://arxiv.org/pdf/2501.00269 (pag 2-3)
  • https://arXiv.org/pdf/2309.01431v2 (pag 3-5)
  • https://arxiv.org/pdf/2312.10997 (pag 12, 14)
  • https://arxiv.org/pdf/2405.07437 (pag 9 - 12)

Comparison table

Relevant benchmarks

BenchmarkAreaExact Match (EM)F1 scoreBLEUROUGEBert ScoreLLM JudgeFaithfulnessCCorrectnessRelevanceHallucinationOthersNoise robustnessNegative rejectionCounterfactual RobustnessIntegration/Use ComplexityGenerated DatasetReproducible Dataset Generation for AirlinesDataset creation complexityDataset notesPaper DateCited by
FaithBenchAnswer Generation / summarization hallucinationsMedium
(create a new module)
No
(Vectara's leaderboard)
YesHigh
(requires manual annotations per summary)
- Uses Vectara's hallucinations leaderboard2024-108
RAGASAnswer Generation / RAG- Automated evaluationLow
(installable Py package)
YesYesMedium
(create golden set of questions with human revision)
- WikiEval (generated from 50 wikipedia pages)
- questions are generated by ChatGPT passing docs
- Answers are generated by ChatGPT
- Human annotations on QA
2024-03662
RAG BenchRAG systemMedium
(create a new module)
YesYesMedium
(create golden set of questions with human revision)
- 11K questions from 12 existing domain-specific datasets
- Generate answers with GTP
- LLM as annotator (93% human agreement for DelucionQA)
2024-0638
From 'Hallucination Free?' paperRAG systemnon-feasible
(human evaluation)
YesNomismatch domain- 202 questions in 4 categories
- Too tight to legal domain
2024-05118
HaluEval 2.0RAG systemMedium
(create a new module)
YesYesMedium
(create golden set of questions with human revision)
- Take questions from 6 domain-specified dataset
- Generate 3 response using ChatGPT per question
- 8770 questions
- human annotations
2024-01150
Retrieve Augmented BenchmarkRAG systemHard
(unclear way to implement the benchmark)
YesPartial
(unclear ways to implement the benchmark)
Medium
(create golden set of questions with human revision)
- 600 generated questions + 200 for information integration + 200 robustness
- 50% English, 50% Chinese
2023-12664
Domain-RAGRAG system abilities (6)Medium
(human annotations)
YesYesMedium
(create golden set of questions with human revision)
- Generates 7 datasets (extractive, conversational, structured, faithful, noisy, multi-doc) for each ability
- Process for each dataset creation:
- use ChatGPT and pass external documents if necessary to generate pairs questions-answers
- Responses are filtered-out manually
2024-0622

Benchmark Landscape: What the Research Reveals

During our investigation, we studied a wide range of evaluation methodologies and datasets, spanning from foundational benchmarks like Natural Questions to more advanced evaluations such as DO-RAG. Despite an exhaustive review of domain-specific resources, no benchmarks tailored to the airline industry were found. Most domain-specific evaluation datasets center around medicine, law, and finance. Examples include:

  • MedHalt and MedHallu – for detecting hallucinations in medical QA.
  • FinanceBench – evaluates open-book financial QA performance.
  • Hallucination-Free? (Stanford University) – one of the few studies assessing commercial legal chatbots using manually curated questions and evaluations. While insightful, its manual process and proprietary data make it infeasible for reuse in a scalable airline chatbot benchmark.

Two comprehensive surveys, arXiv:2404.12041v3 and arXiv:2405.07437, helped map the evolving landscape of hallucination evaluation, scoring techniques, and annotator reliability. These reviews confirmed a strong trend: modern evaluation workflows increasingly rely on LLMs as annotators and scorers, primarily due to their adaptability, speed, and cost-effectiveness. These LLMs are used to assess correctness, detect hallucinations, score fluency, and measure alignment with user intent.

From these findings, we distilled a strategy to design an evaluation framework suitable for our airline-specific, RAG-based chatbot, emphasizing automated, scalable techniques while incorporating human-in-the-loop validation for critical subsets.

Evaluation Framework and Strategy for the Airline QA Chatbot

The selection of evaluation methods was driven by the need to assess both the accuracy of question answering and the presence of hallucinations in generated responses. Among the many options, we prioritized frameworks that offer a clear methodology for scoring faithfulness, annotating hallucinations, and integrating seamlessly into development pipelines. A key factor in the comparison was the availability of tools or guidance for dataset creation, particularly those suited for RAG-based systems. Since no existing benchmarks target the airline domain, the selection focused on transferable techniques and workflows that could be adapted to build our own domain-specific dataset and benchmark, rather than relying on out-of-the-box solutions.

We divided the evaluation pipeline into two main phases: Dataset Creation and Benchmark Execution.

1. Dataset Creation

A carefully constructed dataset is the foundation of any meaningful evaluation. Our process includes the following stages:

1.1 Data Collection

  • Extract authoritative content from airline websites, policies, FAQs, and internal documentation. Sources are filtered for factual accuracy and categorized (e.g., baggage, check-in, cancellations, legal, claims, etc.).

1.2 Question Generation

  • Use LLMs (e.g., GPT-4) to generate representative user questions for each category.
  • Perform a human review to filter out ambiguous, unanswerable, or overlapping questions.

1.3 Answer Generation & Candidate Retrieval

  • For each validated question, generate three candidate answers.
  • Backtrack each answer to identify the source question that would most likely yield it using semantic similarity matching. This simulates document-based retrieval in RAG systems (inspired by RAGAS).

1.4 Human Annotation (First Phase)

  • Select a subset of question-answer pairs for manual annotation. Annotators label:
    • Correctness: Is the answer factually accurate?
    • Hallucination: Are any unsupported claims made?
  • Compare human judgment with LLM-generated content to assess human-LLM agreement, helping validate the use of LLMs as proxy annotators.

1.5 Dataset Expansion (Second Phase)

  • Extend the dataset with more diverse QA pairs using semi-automated generation.
  • Add non-relevant questions and ambiguous queries to evaluate negative rejection and noise robustness.
  • Annotate with tools like FaithBench, RAGAS, and HaluEval 2.0 to assess faithfulness (factuality with respect to retrieved context), fluency, and truthfulness.

2. Benchmark Definition & Integration

With a curated dataset, the benchmark process includes two evolving phases:

2.1 Initial Benchmarking

  • Integrate a flexible evaluation framework such as RAGAS, which allows granular scoring of faithfulness, context relevance, and hallucination.
  • Additional options include:
    • ARES – for robust scoring with reference-free metrics.
    • DeepEval – for integration into CI/CD workflows and regression detection.
  • Each QA pair is evaluated using:
    • Traditional metrics: BLEU, ROUGE, METEOR, BertScore.
    • LLM-based evaluators (e.g., GPT-4 or Claude) that assign scores (1–10) for criteria like fluency, correctness, and alignment with context.

2.2 Advanced Benchmarking

  • Include domain-adapted evaluations for:
    • Hallucination rate
    • Faithfulness to retrieved context
    • User intent alignment
    • Handling of irrelevant or misleading queries
  • Introduce stress tests for negative rejection (e.g., chatbot says “I don’t know” instead of hallucinating) and noise robustness (performance with adversarial or incomplete inputs).

Integration into CI/CD

To ensure ongoing reliability:

  • The benchmark suite is run after every major change to the retrieval or generation components.
  • A snapshot-based evaluation is included as part of the CI/CD pipeline.
  • Thresholds are defined for acceptance, particularly for hallucination rate, negative response handling, and faithfulness score. Regression triggers alerts or rollback mechanisms.

Conclusion

Despite the absence of off-the-shelf airline-specific benchmarks, our investigation revealed transferrable strategies and adaptable toolkits from domains like medicine, finance, and law. By combining structured dataset creation, automated LLM-assisted scoring, and benchmark integration into development pipelines, we propose a scalable, domain-sensitive, and high-fidelity evaluation methodology tailored to the realities of an airline QA chatbot. This ensures that not only is the system performant, but it is also trustworthy, resilient, and grounded in facts.

Conclusions

Ensuring accurate and faithful responses in a QA chatbot is a multi-faceted challenge that requires a careful balance between model capabilities, infrastructure feasibility, and evaluation reliability. Our investigation has shown that prompt-based answer generation techniques, such as Chain-of-Verification, Self-Correction, and structured prompting, offer a practical and scalable way to enhance response quality without incurring the costs of fine-tuning or complex model pipelines. At the same time, in the absence of an airline-specific benchmark, we propose a custom evaluation flow grounded in methodologies from adjacent domains. By focusing on hallucination detection, faithfulness scoring, and dataset creation strategies, and by integrating frameworks like RAGAS or HaluEval 2.0, the proposed evaluation process can be embedded into CI/CD pipelines to maintain quality over time. Together, the selected techniques and evaluation methods form a solid foundation for building a trustworthy, airline-grade AI chatbot that aligns with both user expectations and enterprise constraints.

Related Topics

answer generationairline chatbotsLLM hallucination preventionchatbot evaluationairline AIRAG systemsprompt engineeringaviation technologyAI trustregulatory compliance

About the Author

Ernesto Carrasco

Ernesto Carrasco

Applied AI Engineer, Kaiban

I'm a passionate and driven professional who truly believes in the power of hard work, collaboration, and making things happen. Throughout my career, I’ve had the chance to be part of complex and incredible projects, But nothing excites me more than being part of a movement that's transforming air travel through AI.

Found this article helpful? Share it with your network.