Instruction Tuning LLMs for Vietnamese Dataset

Introduction

Bộ dữ liệu hướng dẫn mới cho quá trình tinh chỉnh các mô hình Ngôn ngữ lớn trên các lĩnh vực tổng quát và y tế. Được xây dựng bằng cách thu thập và dịch từ các nguồn công khai khác.

Brief statistic:

instruct_merged.jsonl: instruction dataset. It contains 52k samples from Alpaca + 170k samples from GPT4All. Then translated to Vietnamese.

translated_health_200k.jsonl: Medical instruction dataset. It was collected from ChatDoctor

image info Một vài mẫu dữ liệu

How to download

Link Drive

Link Github

Paper

Author:

Vu-Thuan Doan, Quoc-Truong Truong, Duc-Vu Nguyen, Vinh-Tiep Nguyen, Thuy-Ngan Nguyen Luu

Name of paper:

Efficient Finetuning Large Language Models For Vietnamese Chatbot

Name of journal or conference

MAPR-2023

Year:

2023

Optional: code to load data

instruct_merged.jsonl:

wget https://storage.googleapis.com/doanthuan/data/instruct_merged.jsonl 

translated_health_200k.jsonl:

wget https://storage.googleapis.com/doanthuan/data/translated_health_200k.jsonl 

Introduction​

Brief statistic:​

How to download​

Paper​

Author:​

Name of paper:​

Name of journal or conference​

Year:​

Optional: code to load data​