首页 新闻 赞助 找找看

python3使用PDFMiner读取pdf文件时如何保存LTImage类型即图片怎么保存的啊

0
悬赏园豆:100 [待解决问题]

Python使用PDFMiner解析PDF
其中有个LTFigure类型
现在已经知道可以从LTfigure提取LTImage类型的图片了
请教,LTImage类型即图片怎么保存的啊

凉云的主页 凉云 | 初学一级 | 园豆:106
提问于:2019-02-20 09:43

您好,问题解决了吗?

金克丝 4年前
< >
分享
所有回答(3)
0
三人乐乐 | 园豆:4819 (老鸟四级) | 2019-03-06 10:52

谢谢,并没有,一个是从pdf读取,一个是爬虫,不太相干

支持(0) 反对(0) 凉云 | 园豆:106 (初学一级) | 2019-03-20 10:53
-1

def parse_lt_objs (lt_objs, page_number, images_folder, text=[]):
#Iterate through the list of LT* objects and capture the text or image data contained in each#
text_content = []
for lt_obj in lt_objs:
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
# text
text_content.append(lt_obj.get_text())
elif isinstance(lt_obj, LTImage):
# text_content.append('<img src=tt" />')
# an image, so save it to the designated folder, and note it's place in the text
saved_file = save_image(lt_obj, page_number, images_folder)
if saved_file:

use html style <img /> tag to mark the position of the image within the text

            text_content.append('<img src="'+os.path.join(images_folder, saved_file)+'" />')
        else:
            print >> sys.stderr, "Error saving image on page", page_number, lt_obj.__repr__
    elif isinstance(lt_obj, LTFigure):

LTFigure objects are containers for other LT* objects, so recurse through the children

        text_content.append('<Figure src=tt" />')
        text_content.append(parse_lt_objs(lt_obj.objs, page_number, images_folder, text_content))  #这句话报错,你知道为什么吗?提示说lt_obj没有objs属性
return '\n'.join(text_content)

def save_image (lt_image, page_number, images_folder):
#Try to save the image data from this LTImage object, and return the file name, if successful#
result = None
if lt_image.stream:
file_stream = lt_image.stream.get_rawdata()
file_ext = determine_image_type(file_stream[0:4])
if file_ext:
file_name = ''.join([str(page_number), '_', lt_image.name, file_ext])
if write_file(images_folder, file_name, lt_image.stream.get_rawdata(), flags='wb'):
result = file_name
return result

def determine_image_type (stream_first_4_bytes):
#Find out the image file type based on the magic number comparison of the first 4 (or 2) bytes#
file_type = None
bytes_as_hex = b2a_hex(stream_first_4_bytes)
if bytes_as_hex.startswith('ffd8'):
file_type = '.jpeg'
elif bytes_as_hex == '89504e47':
file_type = ',png'
elif bytes_as_hex == '47494638':
file_type = '.gif'
elif bytes_as_hex.startswith('424d'):
file_type = '.bmp'
return file_type

def write_file (folder, filename, filedata, flags='w'):
#Write the file data to the folder and filename combination
#(flags: 'w' for write text, 'wb' for write binary, use 'a' instead of 'w' for append)#
result = False
if os.path.isdir(folder):
try:
file_obj = open(os.path.join(folder, filename), flags)
file_obj.write(filedata)
file_obj.close()
result = True
except IOError:
pass
return result
按照文档来说这个应该是可以的

XuPeppy | 园豆:202 (菜鸟二级) | 2019-05-24 14:27
0

已放弃,多谢诸位

凉云 | 园豆:106 (初学一级) | 2019-05-24 15:02
清除回答草稿
   您需要登录以后才能回答,未注册用户请先注册