python简单的pdf表格刮刀的示例python代码(代码片段)

author author     2022-12-27     464

关键词:


# 1. Add some necessary libraries
import scraperwiki
import urllib2, lxml.etree

# 2. The URL/web address where we can find the PDF we want to scrape
url = 'http://cdn.varner.eu/cdn-1ce36b6442a6146/Global/Varner/CSR/Downloads_CSR/Fabrikklister_VarnerGruppen_2013.pdf'

# 3. Grab the file and convert it to an XML document we can work with
pdfdata = urllib2.urlopen(url).read()
xmldata = scraperwiki.pdftoxml(pdfdata)
root = lxml.etree.fromstring(xmldata)

# 4. Have a peek at the XML (click the "more" link in the Console to preview it).
print lxml.etree.tostring(root, pretty_print=True)

# 5. How many pages in the PDF document?
pages = list(root)
print "There are",len(pages),"pages"

'''
# 6. Iterate through the elements in each page, and preview them
for page in pages:
    for el in page:
        if el.tag == "text":
            print el.text, el.attrib

# REPLACE STEP 6 WITH THE FOLLOWING
# 7. We can use the positioning attibutes in the XML data to help us regenerate the rows and columns
for page in pages:
    for el in page:
        if el.tag == "text":
            if int(el.attrib['left']) < 100: print 'Country:', el.text,
            elif int(el.attrib['left']) < 250: print 'Factory name:', el.text,
            elif int(el.attrib['left']) < 500: print 'Address:', el.text,
            elif int(el.attrib['left']) < 1000: print 'City:', el.text,
            else:
                print 'Region:', el.text

# REPLACE STEP 7 WITH THE FOLLOWING
# 8. Rather than just printing out the data, we can generate and display a data structure representing each row.
#    We can also skip the first page, the title page that doesn't contain any of the tabulated information we're after.
for page in pages[1:]:
    for el in page:
        if el.tag == "text":
            if int(el.attrib['left']) < 100: data =  'Country': el.text 
            elif int(el.attrib['left']) < 250: data['Factory name'] = el.text
            elif int(el.attrib['left']) < 500: data['Address'] = el.text
            elif int(el.attrib['left']) < 1000: data['City'] = el.text
            else:
                data['Region'] = el.text
                print data

# REPLACE STEP 8 WITH THE FOLLOWING
# 9. This really crude hack ignores data values that correspond to column headers.
#    A more elecgant solution would use ignore elements in the first table row on each page.
skiplist=['COUNTRY','FACTORY NAME','ADDRESS','CITY','REGION']
for page in pages[1:]:
    for el in page:
        if el.tag == "text" and el.text not in skiplist:
            if int(el.attrib['left']) < 100: data =  'Country': el.text 
            elif int(el.attrib['left']) < 250: data['Factory name'] = el.text
            elif int(el.attrib['left']) < 500: data['Address'] = el.text
            elif int(el.attrib['left']) < 1000: data['City'] = el.text
            else:
                data['Region'] = el.text
                print data

# REPLACE STEP 9 WITH THE FOLLOWING
# 10. A crude way of adding data o the database - write each row as we scrape it.
skiplist=['COUNTRY','FACTORY NAME','ADDRESS','CITY','REGION']
for page in pages[1:]:
    for el in page:
        if el.tag == "text" and el.text not in skiplist:
            if int(el.attrib['left']) < 100: data =  'Country': el.text 
            elif int(el.attrib['left']) < 250: data['Factory name'] = el.text
            elif int(el.attrib['left']) < 500: data['Address'] = el.text
            elif int(el.attrib['left']) < 1000: data['City'] = el.text
            else:
                data['Region'] = el.text
                print data
                scraperwiki.sqlite.save(unique_keys=[], table_name='fabvarn', data=data)

# REPLACE STEP 10 WITH THE FOLLOWING
# 11. A more efficient way of writing to the database might be to write all the records scraped from a page one page at a time.
skiplist=['COUNTRY','FACTORY NAME','ADDRESS','CITY','REGION']
bigdata=[]
for page in pages[1:]:
    for el in page:
        if el.tag == "text" and el.text not in skiplist:
            if int(el.attrib['left']) < 100: data =  'Country': el.text 
            elif int(el.attrib['left']) < 250: data['Factory name'] = el.text
            elif int(el.attrib['left']) < 500: data['Address'] = el.text
            elif int(el.attrib['left']) < 1000: data['City'] = el.text
            else:
                data['Region'] = el.text
                print data
                bigdata.append( data.copy() )
    scraperwiki.sqlite.save(unique_keys=[], table_name='fabvarn', data=bigdata)
    bigdata=[]

'''
# REPLACE STEP 11 WITH THE FOLLOWING
# 12. If necessary, and becuase we are unsing incremental rather than repeat keys,
#     we may need to clear the database table before we right to it.
#     A utulity function can help us do that.

def dropper(table):
    if table!='':
        try: scraperwiki.sqlite.execute('drop table "'+table+'"')
        except: pass

dropper('fabvarn')

skiplist=['COUNTRY','FACTORY NAME','ADDRESS','CITY','REGION']
bigdata=[]
for page in pages[1:]:
    for el in page:
        if el.tag == "text" and el.text not in skiplist:
            if int(el.attrib['left']) < 100: data =  'Country': el.text 
            elif int(el.attrib['left']) < 250: data['Factory name'] = el.text
            elif int(el.attrib['left']) < 500: data['Address'] = el.text
            elif int(el.attrib['left']) < 1000: data['City'] = el.text
            else:
                data['Region'] = el.text
                print data
                bigdata.append( data.copy() )
    scraperwiki.sqlite.save(unique_keys=[], table_name='fabvarn', data=bigdata)
    bigdata=[]'''

PDF 上的光学字符识别 (python)

】PDF上的光学字符识别(python)【英文标题】:OpticalCharacterRecognitiononPDFs(python)【发布时间】:2020-12-2603:32:01【问题描述】:我正在使用ocrmypdf。我正在尝试对竞选财务pdf进行ocr。示例pdf:https://apps1.lavote.net/camp/comm.cfm?&cid=11我的... 查看详情

python图像刮刀,在bing上无法正常工作

】python图像刮刀,在bing上无法正常工作【英文标题】:pythonimagescraper,notworkingproperlyonbing【发布时间】:2020-07-2500:18:19【问题描述】:我正在尝试构建图像抓取工具,我首先在Google上尝试过,但没有图像被抓取所以我尝试了Bing,... 查看详情

使用python处理pdf中的表格

】使用python处理pdf中的表格【英文标题】:Workingontablesinpdfusingpython【发布时间】:2012-04-0416:39:25【问题描述】:我正在处理一个pdf文件。该pdf中有许多表格。根据pdf中给出的表名,我想使用python从该表中获取数据。我从事过html... 查看详情

python链接刮刀(代码片段)

查看详情

python创建刮刀并使用它们(代码片段)

查看详情

python-camelot:用三行代码提取pdf表格数据

...melot-dev/camelotCamelot是什么据项目介绍称,Camelot是一个Python工具,用于将PDF文件中的表格数据提取出来。具体而言,用户可以像使用Pandas那样打开PDF文件,然后利用这个工具提取表格数据,最后再指定输出的形... 查看详情

如何在python中填写PDF表单?

】如何在python中填写PDF表单?【英文标题】:HowtofillPDFforminpython?【发布时间】:2013-07-1909:00:41【问题描述】:我正在寻找用数据库数据表格填充预制pdf并“展平”它的最佳方法。现在我使用pdftk,但它不能正确处理国家字符是否... 查看详情

在 Python 中提取 PDF 文件的文本和表格

】在Python中提取PDF文件的文本和表格【英文标题】:ExtracttextandtablesofaPDFfileinPython【发布时间】:2021-11-1415:18:22【问题描述】:我正在寻找一种从PDF文件中提取文本和表格的解决方案。虽然有些包很适合提取文本,但它们不足以... 查看详情

Python BeautifulSoup 硒刮刀

】PythonBeautifulSoup硒刮刀【英文标题】:PythonBeautifulSoupseleniumscraper【发布时间】:2019-08-3116:55:58【问题描述】:我正在使用以下python脚本从Amazonpages抓取信息。在某些时候,它停止返回页面结果。脚本正在启动,浏览关键字/页面... 查看详情

使用python提取pdf中表格中包含的文本的最佳方法是啥?

】使用python提取pdf中表格中包含的文本的最佳方法是啥?【英文标题】:Whatisthebestwaytoextracttextcontainedwithinatableinapdfusingpython?使用python提取pdf中表格中包含的文本的最佳方法是什么?【发布时间】:2019-12-0405:09:23【问题描述】:... 查看详情

python:带有一个文件源文件的 python 鸡蛋的简单示例?

】python:带有一个文件源文件的python鸡蛋的简单示例?【英文标题】:python:simpleexampleforapythoneggwithaone-filesourcefile?【发布时间】:2011-02-2014:21:54【问题描述】:我不太确定如何构建一个非常简单的单文件源模块。是否有一个可以... 查看详情

ruby使用nokogiri的简单ruby刮刀(代码片段)

查看详情

python解析pdf表格——pdfplumbervscamelot

...。因此考虑尝试解析出PDF文件中的表格,以便后续分析。Python处理PDF文件的程序包,pdfminer、tabula、pdfplumber、camelot……查询资料表明,似乎普遍认为pdfminer的效果不怎么好,而tabula需要java支持,想偷懒于是只试了pdfplumber和camelot... 查看详情

用python开发的pdf抽取excel表格2.0版

前些天向大家介绍了我开发的从PDF抽取表格小工具的使用方法(⬅️点击直达),有同学反馈说有一些问题:一页PDF有多张表,只能抽取第一个有些表格线条是透明的,无法抽取一页一页处理太麻烦,不能一次性抽取针对以上情... 查看详情

龙卷风 python 的简单异步示例

】龙卷风python的简单异步示例【英文标题】:Simpleasyncexamplewithtornadopython【发布时间】:2014-04-0719:05:12【问题描述】:我想找到简单的异步服务器示例。我有很多等待的功能,数据库事务......等等:defblocking_task(n):foriinxrange(n):prin... 查看详情

python使用numba的简单示例。(代码片段)

查看详情

Python PDF/图像表重建选项

】PythonPDF/图像表重建选项【英文标题】:PythonPDF/Imagetablereconstructionoptions【发布时间】:2021-10-2500:05:10【问题描述】:我正在寻找Python中的包以将表格从PDF转换为CSV。我在下面附上了这样一张表格的图片,而原始PDF可以从here下... 查看详情

如何使用 Python 从 PDF 文件中读取简单文本?

】如何使用Python从PDF文件中读取简单文本?【英文标题】:HowtoreadsimpletextfromaPDFfilewithPython?【发布时间】:2020-05-1016:07:07【问题描述】:需要解析一个PDF文件以便只提取文本的前几行,并寻找不同的Python包来完成这项工作,但没... 查看详情