关键词:
# 1. Add some necessary libraries
import scraperwiki
import urllib2, lxml.etree
# 2. The URL/web address where we can find the PDF we want to scrape
url = 'http://cdn.varner.eu/cdn-1ce36b6442a6146/Global/Varner/CSR/Downloads_CSR/Fabrikklister_VarnerGruppen_2013.pdf'
# 3. Grab the file and convert it to an XML document we can work with
pdfdata = urllib2.urlopen(url).read()
xmldata = scraperwiki.pdftoxml(pdfdata)
root = lxml.etree.fromstring(xmldata)
# 4. Have a peek at the XML (click the "more" link in the Console to preview it).
print lxml.etree.tostring(root, pretty_print=True)
# 5. How many pages in the PDF document?
pages = list(root)
print "There are",len(pages),"pages"
'''
# 6. Iterate through the elements in each page, and preview them
for page in pages:
for el in page:
if el.tag == "text":
print el.text, el.attrib
# REPLACE STEP 6 WITH THE FOLLOWING
# 7. We can use the positioning attibutes in the XML data to help us regenerate the rows and columns
for page in pages:
for el in page:
if el.tag == "text":
if int(el.attrib['left']) < 100: print 'Country:', el.text,
elif int(el.attrib['left']) < 250: print 'Factory name:', el.text,
elif int(el.attrib['left']) < 500: print 'Address:', el.text,
elif int(el.attrib['left']) < 1000: print 'City:', el.text,
else:
print 'Region:', el.text
# REPLACE STEP 7 WITH THE FOLLOWING
# 8. Rather than just printing out the data, we can generate and display a data structure representing each row.
# We can also skip the first page, the title page that doesn't contain any of the tabulated information we're after.
for page in pages[1:]:
for el in page:
if el.tag == "text":
if int(el.attrib['left']) < 100: data = 'Country': el.text
elif int(el.attrib['left']) < 250: data['Factory name'] = el.text
elif int(el.attrib['left']) < 500: data['Address'] = el.text
elif int(el.attrib['left']) < 1000: data['City'] = el.text
else:
data['Region'] = el.text
print data
# REPLACE STEP 8 WITH THE FOLLOWING
# 9. This really crude hack ignores data values that correspond to column headers.
# A more elecgant solution would use ignore elements in the first table row on each page.
skiplist=['COUNTRY','FACTORY NAME','ADDRESS','CITY','REGION']
for page in pages[1:]:
for el in page:
if el.tag == "text" and el.text not in skiplist:
if int(el.attrib['left']) < 100: data = 'Country': el.text
elif int(el.attrib['left']) < 250: data['Factory name'] = el.text
elif int(el.attrib['left']) < 500: data['Address'] = el.text
elif int(el.attrib['left']) < 1000: data['City'] = el.text
else:
data['Region'] = el.text
print data
# REPLACE STEP 9 WITH THE FOLLOWING
# 10. A crude way of adding data o the database - write each row as we scrape it.
skiplist=['COUNTRY','FACTORY NAME','ADDRESS','CITY','REGION']
for page in pages[1:]:
for el in page:
if el.tag == "text" and el.text not in skiplist:
if int(el.attrib['left']) < 100: data = 'Country': el.text
elif int(el.attrib['left']) < 250: data['Factory name'] = el.text
elif int(el.attrib['left']) < 500: data['Address'] = el.text
elif int(el.attrib['left']) < 1000: data['City'] = el.text
else:
data['Region'] = el.text
print data
scraperwiki.sqlite.save(unique_keys=[], table_name='fabvarn', data=data)
# REPLACE STEP 10 WITH THE FOLLOWING
# 11. A more efficient way of writing to the database might be to write all the records scraped from a page one page at a time.
skiplist=['COUNTRY','FACTORY NAME','ADDRESS','CITY','REGION']
bigdata=[]
for page in pages[1:]:
for el in page:
if el.tag == "text" and el.text not in skiplist:
if int(el.attrib['left']) < 100: data = 'Country': el.text
elif int(el.attrib['left']) < 250: data['Factory name'] = el.text
elif int(el.attrib['left']) < 500: data['Address'] = el.text
elif int(el.attrib['left']) < 1000: data['City'] = el.text
else:
data['Region'] = el.text
print data
bigdata.append( data.copy() )
scraperwiki.sqlite.save(unique_keys=[], table_name='fabvarn', data=bigdata)
bigdata=[]
'''
# REPLACE STEP 11 WITH THE FOLLOWING
# 12. If necessary, and becuase we are unsing incremental rather than repeat keys,
# we may need to clear the database table before we right to it.
# A utulity function can help us do that.
def dropper(table):
if table!='':
try: scraperwiki.sqlite.execute('drop table "'+table+'"')
except: pass
dropper('fabvarn')
skiplist=['COUNTRY','FACTORY NAME','ADDRESS','CITY','REGION']
bigdata=[]
for page in pages[1:]:
for el in page:
if el.tag == "text" and el.text not in skiplist:
if int(el.attrib['left']) < 100: data = 'Country': el.text
elif int(el.attrib['left']) < 250: data['Factory name'] = el.text
elif int(el.attrib['left']) < 500: data['Address'] = el.text
elif int(el.attrib['left']) < 1000: data['City'] = el.text
else:
data['Region'] = el.text
print data
bigdata.append( data.copy() )
scraperwiki.sqlite.save(unique_keys=[], table_name='fabvarn', data=bigdata)
bigdata=[]'''
PDF 上的光学字符识别 (python)
】PDF上的光学字符识别(python)【英文标题】:OpticalCharacterRecognitiononPDFs(python)【发布时间】:2020-12-2603:32:01【问题描述】:我正在使用ocrmypdf。我正在尝试对竞选财务pdf进行ocr。示例pdf:https://apps1.lavote.net/camp/comm.cfm?&cid=11我的... 查看详情
python图像刮刀,在bing上无法正常工作
】python图像刮刀,在bing上无法正常工作【英文标题】:pythonimagescraper,notworkingproperlyonbing【发布时间】:2020-07-2500:18:19【问题描述】:我正在尝试构建图像抓取工具,我首先在Google上尝试过,但没有图像被抓取所以我尝试了Bing,... 查看详情
使用python处理pdf中的表格
】使用python处理pdf中的表格【英文标题】:Workingontablesinpdfusingpython【发布时间】:2012-04-0416:39:25【问题描述】:我正在处理一个pdf文件。该pdf中有许多表格。根据pdf中给出的表名,我想使用python从该表中获取数据。我从事过html... 查看详情
python链接刮刀(代码片段)
查看详情
python创建刮刀并使用它们(代码片段)
查看详情
python-camelot:用三行代码提取pdf表格数据
...melot-dev/camelotCamelot是什么据项目介绍称,Camelot是一个Python工具,用于将PDF文件中的表格数据提取出来。具体而言,用户可以像使用Pandas那样打开PDF文件,然后利用这个工具提取表格数据,最后再指定输出的形... 查看详情
如何在python中填写PDF表单?
】如何在python中填写PDF表单?【英文标题】:HowtofillPDFforminpython?【发布时间】:2013-07-1909:00:41【问题描述】:我正在寻找用数据库数据表格填充预制pdf并“展平”它的最佳方法。现在我使用pdftk,但它不能正确处理国家字符是否... 查看详情
在 Python 中提取 PDF 文件的文本和表格
】在Python中提取PDF文件的文本和表格【英文标题】:ExtracttextandtablesofaPDFfileinPython【发布时间】:2021-11-1415:18:22【问题描述】:我正在寻找一种从PDF文件中提取文本和表格的解决方案。虽然有些包很适合提取文本,但它们不足以... 查看详情
Python BeautifulSoup 硒刮刀
】PythonBeautifulSoup硒刮刀【英文标题】:PythonBeautifulSoupseleniumscraper【发布时间】:2019-08-3116:55:58【问题描述】:我正在使用以下python脚本从Amazonpages抓取信息。在某些时候,它停止返回页面结果。脚本正在启动,浏览关键字/页面... 查看详情
使用python提取pdf中表格中包含的文本的最佳方法是啥?
】使用python提取pdf中表格中包含的文本的最佳方法是啥?【英文标题】:Whatisthebestwaytoextracttextcontainedwithinatableinapdfusingpython?使用python提取pdf中表格中包含的文本的最佳方法是什么?【发布时间】:2019-12-0405:09:23【问题描述】:... 查看详情
python:带有一个文件源文件的 python 鸡蛋的简单示例?
】python:带有一个文件源文件的python鸡蛋的简单示例?【英文标题】:python:simpleexampleforapythoneggwithaone-filesourcefile?【发布时间】:2011-02-2014:21:54【问题描述】:我不太确定如何构建一个非常简单的单文件源模块。是否有一个可以... 查看详情
ruby使用nokogiri的简单ruby刮刀(代码片段)
查看详情
python解析pdf表格——pdfplumbervscamelot
...。因此考虑尝试解析出PDF文件中的表格,以便后续分析。Python处理PDF文件的程序包,pdfminer、tabula、pdfplumber、camelot……查询资料表明,似乎普遍认为pdfminer的效果不怎么好,而tabula需要java支持,想偷懒于是只试了pdfplumber和camelot... 查看详情
用python开发的pdf抽取excel表格2.0版
前些天向大家介绍了我开发的从PDF抽取表格小工具的使用方法(⬅️点击直达),有同学反馈说有一些问题:一页PDF有多张表,只能抽取第一个有些表格线条是透明的,无法抽取一页一页处理太麻烦,不能一次性抽取针对以上情... 查看详情
龙卷风 python 的简单异步示例
】龙卷风python的简单异步示例【英文标题】:Simpleasyncexamplewithtornadopython【发布时间】:2014-04-0719:05:12【问题描述】:我想找到简单的异步服务器示例。我有很多等待的功能,数据库事务......等等:defblocking_task(n):foriinxrange(n):prin... 查看详情
python使用numba的简单示例。(代码片段)
查看详情
Python PDF/图像表重建选项
】PythonPDF/图像表重建选项【英文标题】:PythonPDF/Imagetablereconstructionoptions【发布时间】:2021-10-2500:05:10【问题描述】:我正在寻找Python中的包以将表格从PDF转换为CSV。我在下面附上了这样一张表格的图片,而原始PDF可以从here下... 查看详情
如何使用 Python 从 PDF 文件中读取简单文本?
】如何使用Python从PDF文件中读取简单文本?【英文标题】:HowtoreadsimpletextfromaPDFfilewithPython?【发布时间】:2020-05-1016:07:07【问题描述】:需要解析一个PDF文件以便只提取文本的前几行,并寻找不同的Python包来完成这项工作,但没... 查看详情