python怎么读取pdf文本内容-php教程-PHP中文网

python怎么读取pdf文本内容

PHPz

发布： 2016-06-13 11:06:30

原创

11540人浏览过

python读取pdf文本内容的方法：首先打开相应的python脚本文件；然后使用PDFMiner工具来读取pdf文本内容；最后通过print输出读取后的内容即可。

python怎么读取pdf文本内容

python读取pdf文本内容

python处理pdf也是常用的技术了，对于python3来说，pdfminer3k是一个非常好的工具。

PDFMiner是一个可以从PDF文档中提取信息的工具。与其他PDF相关的工具不同，它注重的完全是获取和分析文本数据。

PDFMiner允许你获取某一页中文本的准确位置和一些诸如字体、行数的信息。它包括一个PDF转换器，可以把PDF文件转换成HTML等格式。它还有一个扩展的PDF解析器，可以用于除文本分析以外的其他用途。

立即学习“Python免费学习笔记（深入）”；

pip install pdfminer3k

登录后复制

首先，为了满足大部分人的需求，我先给一个通用一点的脚本来读取pdf中的文本：

Seede AI

AI 驱动的设计工具

586

查看详情

from io import StringIO
from io import open
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, process_pdf

def read_pdf(pdf):
    # resource manager
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    # device
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    process_pdf(rsrcmgr, device, pdf)
    device.close()
    content = retstr.getvalue()
    retstr.close()
    # 获取所有行
    lines = str(content).split("\n")
    return lines
 
 
 
if __name__ == '__main__':
    with open('t1.pdf', "rb") as my_pdf:
        print(read_pdf(my_pdf))

登录后复制

我主要是想在pdf中抽出自己想要的一些关键信息，所以需要找到这些信息的共同点。幸运的是，这些关键信息的行都含有'//'，所以我只需找到含有'//'的行就行了，于是写了以下脚本。

这样就可以直接使用了，我们先看脚本：

from io import StringIO
from io import open
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
 
 
def read_pdf(pdf):
    # resource manager
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    # device
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    process_pdf(rsrcmgr, device, pdf)
    device.close()
    content = retstr.getvalue()
    retstr.close()
    # 获取所有行
    lines = str(content).split("\n")
 
    units = [1, 2, 3, 5, 7, 8, 9, 11, 12, 13]
    header = '\x0cUNIT '
    # print(lines[0:100])
    count = 0
    flag = False
    text = open('words.txt', 'w+')
    for line in lines:
        if line.startswith(header):
            flag = False
            count += 1
            if count in units:
                flag = True
                print(line)
                text.writelines(line + '\n')
        if '//' in line and flag:
            text_line = line.split('//')[0].split('. ')[-1]
            print(text_line)
            text.writelines(text_line+'\n')
    text.close()
 
 
def _main():
    my_pdf = open('t1.pdf', "rb")
    read_pdf(my_pdf)
    my_pdf.close()
 
 
if __name__ == '__main__':
    _main()

登录后复制

其实看到lines = str(content).split("\n")那一行就够了，我们可以把lines都print出来，就可以看到pdf里面的内容。

这样我们就可以把pdf文件处理看作简单的字符串数据处理了。接下来的脚本操作也不用过多解释了。

更多相关知识，请访问 PHP中文网！！