使用 torchtext 的 Multi30k 数据集遇到的问题。
报错信息:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 37: invalid start byte
相关代码(完整代码详见):
from torchtext import datasets
from torchtext.vocab import build_vocab_from_iterator
train, val, test = datasets.Multi30k(language_pair=("de", "en"))
vocab_src = build_vocab_from_iterator(
iterator=yield_tokens(train+val+test, tokenizer_de, index=0),
min_freq=2,
specials=["<s>", "</s>", "<blank>", "<unk>"]
)
相关库版本:
pytorch 2.1.2 py3.11_cuda12.1_cudnn8_0 pytorch
pytorch-cuda 12.1 hde6ce7c_5 pytorch
pytorch-mutex 1.0 cuda pytorch
torchaudio 2.1.2 pypi_0 pypi
torchdata 0.7.1 py311 pytorch
torchtext 0.16.2 py311 pytorch
torchvision 0.16.2 pypi_0 pypi
该问题疑似为 torchtext 更新导致的,两周前相关 Github 仓库还有人反映该问题:相关链接。尝试了 0.16.1 版的 torchtext 还是报错。
而且其他两个机器翻译的数据库也有问题,替换成 IWSLT2016 与 IWSLT2017 会报错:
IWSLT2016:
HTTPError: 404 Client Error: Not Found for url: https://drive.usercontent.google.com/download?id=1l5y6Giag9aRPwGtuZHswh3w5v3qEz8D8
This exception is thrown by __iter__ of GDriveReaderDataPipe(skip_on_error=False, source_datapipe=OnDiskCacheHolderIterDataPipe, timeout=None)
IWSLT2017:
HTTPError: 404 Client Error: Not Found for url: https://drive.usercontent.google.com/download?id=12ycYSzLIG253AFN35Y6qoyf9wtkOjakp
This exception is thrown by __iter__ of GDriveReaderDataPipe(skip_on_error=False, source_datapipe=OnDiskCacheHolderIterDataPipe, timeout=None)
问题出在运行这行代码上:
train, val, test = datasets.Multi30k(language_pair=("de", "en"))
尝试了原仓库在 torch 1.11.0 下的操作,但任然报错:
Exception: Could not get the file at http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz. [RequestException] None.
于是放弃通过回退版本解决BUG的方式,还是以 torch 2.1.2 为运行环境。受分别处理数据集启发,改为如下代码可以运行,但是加上 test 报错,原因未知。
train = datasets.Multi30k(root='.data', split='train', language_pair=('de', 'en'))
val = datasets.Multi30k(root='.data', split='valid', language_pair=('de', 'en'))