Finetuning LLMs to Draft Parliamentary Responses

Introduction

This repo contains the code for the experiment set-up to fine-tune LLMs on Ha, as described in this blog post. The set-up consists of the following stages, as shown in the diagram below:

Data Collection and Preprocessing
Fine-tuning
Evaluation

Data Collection and Preprocessing

Consolidating written question and answers from Hansard: Crawl Hansard data for written question and answers (code not in this repo). Store the results in a jsonl file (e.g. written_question_answers.jsonl) with each row having these fields:
- title: Title of parliamentary question
- question: Written question that was asked
- answer: Answer to question
- filename: Reference to date of sitting in the format sitting_YYYY-MM-DD
E.g. {"title": "Proportion of Public Service Agencies that Provide Unpaid Leave for Parents with Multiple or Pre-term Babies", "question": "Mr Louis Ng Kok Kwang asked the Prime Minister (a) what is the number and percentage of public service agencies that provide unpaid leave for parents with multiple or pre-term babies; (b) what is the take-up rate of such leave for each year in the past five years; and (c) if data on the take-up rate has not been collected, whether the Prime Minister's Office will start collecting this data. ", "answer": "In the past five years, an average of 25 officers in the Civil Service utilised this leave provision each year.", "filename": "sitting_2023-02-22"}
Classify answers based on whether they answer the questions directly using GPT-4o: Run python data_generation/classify_parliamentary_questions.py --model_type "gpt4" --input_file_path <path_to_input_file> --output_file_path <path_to_save_classified_file> to generate a file with status of whether question is answered, and summary points of answers. Output will be stored in <path_to_save_classified_file> (e.g. written_question_answers_processed.jsonl)
Generate hypothetical documents as context for answered questions: Run python data_generation/generate_hypotherical_documents.py --model_type "gpt4" --input_file_path <path_to_classified_file> --output_file_path <path_to_save_file_with_hypothetical_docs> to get the hypothetical documents for each record
Split the data into train and test splits, and format them for fine-tuning: Run python data_generation/train_test_split_format.py --input_file_path <path_to_file_with_hypothetical_docs> --output_dir_path <path_to_dir_to_save_train_test_split> <path_to_dir_to_save_train_test_split> will now contain the following files:
1. hansard_answered_questions_train.csv : Train data - Pandas DataFrame saved as csv
2. hansard_answered_questions_test.csv : Test data - Pandas DataFrame saved as csv
3. hansard_answered_questions_llama3_formatted_train.jsonl : Train data formatted for finetuning
4. hansard_answered_questions_llama3_formatted_test.jsonl : Test data formatted for finetuning
5. hansard_answered_questions_llama3_formatted_test_no_response.jsonl : Test data formatted for finetuning, with no response (to be used for fine-tuned models prompt completion)

The processes can also be executed by running data_generation/generate_data.sh.

Fine-tuning and Inference

Fine-tune LLMs: Fine-tune models using axolotl with the prompt-formatted train and test files from Step 4 above. These are the config files I used for Llama-3-8B and Gemma-2-9B.
Generate answers using fine-tuned models: After the models have been fine-tuned, use axolotl's inference scripts to generate responses for the test data using the file hansard_answered_questions_llama3_formatted_test_no_response.jsonl as input. Use <|eot_id|> as the token for early stopping. Save the responses in the generation field as follows, where input contains the original prompt-formatted input used for generation: {"generation": <fine-tuned model's generated answer>, "input": }
Generate answers using pre-trained LLM: To get GPT-4o's zero-shot and one-shot answers, to the same question, by running python pretrained_prediction/generate_answers_pretrained.py --model_type "gpt4" --one_shot True/False --input_path <path_to_hansard_answered_questions_test_df> --output_file_path <path_so_save_gpt4_answers> pretrained_prediction/batch_generate_answers_pretrained.sh shows an example each on how to generate for zero-shot and one-shot.

Evaluation

Combine the predictions into a single pandas dataframe: Use the notebook evaluation/combine_predictions.ipynb to combine the prediction files into one pandas dataframe, and save it as a single csv file consolidated_predictions.csv. A notebook is used instead of a script so that it is easy to troubleshoot merging errors from different files.
Select better answer using pre-trained LLM: Use Gemini to select between two answers by running python evaluation/llm_preference_eval.py --eval_llm_type "gemini" output_dir <path_to_output_dir> --input_file_name <path_to_consolidated_prediction_csv_file> --answer_key_1 <field_for_first_llm_answer> --answer_key_2 <field_for_first_llm_answer> --output_file_prefix <llm1_name_llm2_name>. Output will be a jsonl file with the fields in the input DataFrame and an additional evaluation field containing the JSON string output from the evaluation LLM. E.g. {"evaluation": "{"winner": "a", "reason": "Answer A is factually correct as per the context and also matches the model answer. Answer B is factually correct but uses more words to convey the same information and is not as concise as the model answer."}}
Compute Ragas scores using pre-trained LLM: Use Claude-3 Sonnet (Gemini results in too many JSON parsing errors for the Ragas library) to generate the ragas scores by running python evaluation/ragas_eval.py --eval_llm_type "claude" output_dir <path_to_output_dir> --input_file_name <path_to_consolidated_prediction_pandas_df_file> --answer_key <field_for_llm_answer> --output_file_prefix <llm_name>. Output will be a pandas dataframe with scores for each record for each of the following metrics:
- faithfulness
- answer_relevancy
- answer_similarity
- answer_correctness
Parse output to get scores: Use the notebook evaluation/evaluation_comparison.ipynb to parse the output and view the scores.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
data_generation		data_generation
evaluation		evaluation
llm_utils		llm_utils
pretrained_prediction		pretrained_prediction
tests		tests
.gitignore		.gitignore
README.md		README.md
overall.png		overall.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Finetuning LLMs to Draft Parliamentary Responses

Introduction

Data Collection and Preprocessing

Fine-tuning and Inference

Evaluation

About

Releases

Packages

Languages

watsonchua/hansard-finetuning

Folders and files

Latest commit

History

Repository files navigation

Finetuning LLMs to Draft Parliamentary Responses

Introduction

Data Collection and Preprocessing

Fine-tuning and Inference

Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages