1200字范文,内容丰富有趣,写作的好帮手!
1200字范文 > 提取指定的PDF表格保存到Excel

提取指定的PDF表格保存到Excel

时间:2019-10-26 02:14:56

相关推荐

提取指定的PDF表格保存到Excel

摘要:本文介绍一个提取PDF中的表格内容的程序。首先,程序给出使用示例,最后给出代码开发思路及细节。

0.需求说明

PDF中存在大量表格,需要从表格中提取出指定类型的表格,这些表格主要通过表头和表中的关键字来确定。

1.PDF示例

样例PDF下载地址:样本一、样本二、样本三

2.提取规则

提取规则通过Excel指定,如下示例:

3.提取结果示例

提取的结果保存在Excel中,结果如下:

4.使用方法

首先准备好Demo.xlsx文件(下载),同时下载PDFparser.exe程序(下载),将二者放在同一个目录下,然后将PDF文件准备好放在任意文件夹xxx中,将xxx文件夹和以上两个文件放在同一目录下,双击运行程序即可。

5.代码说明

程序使用pdfplumber模块进行PDF解析以获取表格和文本程序使用xlwt模块和xlrd模块进行Excel的读写程序使用多进程+多线程模式加快速度程序使用re模块来使用Python正则表达式

6.代码细节

PDF解析

# 该类用来实现PDF表格和文字内容的提取class Extractor(object):def __init__(self, file_path, rules):''':param file_path:PDF file path:param rules: extract rules'''self.file_path = file_pathself.rules = rules

<span class="token comment"># 加载PDF文件</span><span class="token keyword">def</span> <span class="token function">parse_pages</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">try</span><span class="token punctuation">:</span>pages <span class="token operator">&#61;</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>pdf <span class="token operator">&#61;</span> pdfplumber<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span>self<span class="token punctuation">.</span>file_path<span class="token punctuation">)</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">&#39;parse file:{} page num:{}&#39;</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>basename<span class="token punctuation">(</span>self<span class="token punctuation">.</span>file_path<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pdf<span class="token punctuation">.</span>pages<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token keyword">for</span> index<span class="token punctuation">,</span> page <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>pdf<span class="token punctuation">.</span>pages<span class="token punctuation">)</span><span class="token punctuation">:</span>tables <span class="token operator">&#61;</span> page<span class="token punctuation">.</span>extract_tables<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>tables<span class="token punctuation">)</span> <span class="token operator">&lt;</span> <span class="token number">1</span><span class="token punctuation">:</span><span class="token keyword">continue</span>pages<span class="token punctuation">.</span>append<span class="token punctuation">(</span><span class="token punctuation">{<!-- --></span><span class="token string">&#39;text&#39;</span><span class="token punctuation">:</span> page<span class="token punctuation">.</span>extract_text<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">&#39;tables&#39;</span><span class="token punctuation">:</span> tables<span class="token punctuation">,</span> <span class="token string">&#39;page&#39;</span><span class="token punctuation">:</span> index <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token keyword">return</span> pages<span class="token keyword">except</span> Exception <span class="token keyword">as</span> e<span class="token punctuation">:</span><span class="token keyword">print</span><span class="token punctuation">(</span>e<span class="token punctuation">)</span><span class="token keyword">return</span> <span class="token boolean">None</span><span class="token comment"># 提取特定类型表头的表格&#xff0c;规则有rules参数指定</span><span class="token keyword">def</span> <span class="token function">extract_table_with_specific_header</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> pages<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">if</span> pages <span class="token keyword">is</span> <span class="token boolean">None</span> <span class="token operator">or</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span> <span class="token operator">&lt;</span> <span class="token number">1</span><span class="token punctuation">:</span><span class="token comment"># print(&#39;no-page...&#39;)</span><span class="token keyword">return</span>target_tables <span class="token operator">&#61;</span> <span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token comment"># 遍历所有页面</span><span class="token keyword">for</span> index<span class="token punctuation">,</span> page <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span><span class="token punctuation">:</span>text <span class="token operator">&#61;</span> page<span class="token punctuation">[</span><span class="token string">&#39;text&#39;</span><span class="token punctuation">]</span>tables <span class="token operator">&#61;</span> page<span class="token punctuation">[</span><span class="token string">&#39;tables&#39;</span><span class="token punctuation">]</span>page_id <span class="token operator">&#61;</span> page<span class="token punctuation">[</span><span class="token string">&#39;page&#39;</span><span class="token punctuation">]</span> <span class="token operator">-</span> <span class="token number">1</span>lines <span class="token operator">&#61;</span> re<span class="token punctuation">.</span>split<span class="token punctuation">(</span>r<span class="token string">&#39;\n&#43;&#39;</span><span class="token punctuation">,</span> text<span class="token punctuation">)</span><span class="token comment"># 遍历当前页面的所有行</span><span class="token keyword">for</span> ind<span class="token punctuation">,</span> line <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>lines<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token comment"># 判定表头符合规则的表格</span><span class="token keyword">if</span> <span class="token builtin">sum</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">1</span> <span class="token keyword">for</span> rule <span class="token keyword">in</span> self<span class="token punctuation">.</span>rules<span class="token punctuation">[</span><span class="token string">&#39;in-header&#39;</span><span class="token punctuation">]</span> <span class="token keyword">if</span> re<span class="token punctuation">.</span>search<span class="token punctuation">(</span>rule<span class="token punctuation">,</span> line<span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">&gt;</span> <span class="token number">0</span> <span class="token operator">and</span> \<span class="token builtin">sum</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">1</span> <span class="token keyword">for</span> rule <span class="token keyword">in</span> self<span class="token punctuation">.</span>rules<span class="token punctuation">[</span><span class="token string">&#39;not-in-header&#39;</span><span class="token punctuation">]</span> <span class="token keyword">if</span> re<span class="token punctuation">.</span>search<span class="token punctuation">(</span>rule<span class="token punctuation">,</span> line<span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">&#61;&#61;</span> <span class="token number">0</span><span class="token punctuation">:</span><span class="token keyword">if</span> ind <span class="token operator">&gt;&#61;</span> <span class="token builtin">len</span><span class="token punctuation">(</span>lines<span class="token punctuation">)</span> <span class="token operator">-</span> <span class="token number">1</span><span class="token punctuation">:</span><span class="token keyword">break</span>cnt <span class="token operator">&#61;</span> ind <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token builtin">next</span> <span class="token operator">&#61;</span> lines<span class="token punctuation">[</span>ind <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token keyword">if</span> ind <span class="token operator">&lt;</span> <span class="token builtin">len</span><span class="token punctuation">(</span>lines<span class="token punctuation">)</span> <span class="token operator">-</span> <span class="token number">2</span> <span class="token operator">and</span> re<span class="token punctuation">.</span>search<span class="token punctuation">(</span>r<span class="token string">&#39;单位[&#xff1a;:]&#39;</span><span class="token punctuation">,</span> <span class="token builtin">next</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token builtin">next</span> <span class="token operator">&#61;</span> lines<span class="token punctuation">[</span>ind <span class="token operator">&#43;</span> <span class="token number">2</span><span class="token punctuation">]</span>cnt <span class="token operator">&#43;&#61;</span> <span class="token number">1</span><span class="token keyword">for</span> ti<span class="token punctuation">,</span> table <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>tables<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">if</span> <span class="token operator">not</span> table<span class="token punctuation">:</span><span class="token keyword">continue</span>first <span class="token operator">&#61;</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token punctuation">[</span>word <span class="token keyword">for</span> word <span class="token keyword">in</span> table<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token keyword">if</span> word <span class="token keyword">is</span> <span class="token operator">not</span> <span class="token boolean">None</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token comment"># 表格是完整的情况</span><span class="token keyword">if</span> first <span class="token operator">&#61;&#61;</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">&#39;\s&#43;&#39;</span><span class="token punctuation">,</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">,</span> <span class="token builtin">next</span><span class="token punctuation">)</span><span class="token punctuation">:</span>tables<span class="token punctuation">[</span>ti<span class="token punctuation">]</span> <span class="token operator">&#61;</span> <span class="token boolean">False</span><span class="token keyword">if</span> index <span class="token operator">&#43;</span> <span class="token number">1</span> <span class="token operator">&lt;</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span> <span class="token operator">and</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">[</span>index <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">&#39;tables&#39;</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">&gt;</span> <span class="token number">0</span><span class="token punctuation">:</span>table_next <span class="token operator">&#61;</span> pages<span class="token punctuation">[</span>index <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">&#39;tables&#39;</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>fi <span class="token operator">&#61;</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token punctuation">[</span>item <span class="token keyword">for</span> item <span class="token keyword">in</span> table_next<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token keyword">if</span> item <span class="token keyword">is</span> <span class="token operator">not</span> <span class="token boolean">None</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token keyword">if</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">&#39;\s&#43;&#39;</span><span class="token punctuation">,</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">,</span> fi<span class="token punctuation">)</span> <span class="token keyword">in</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">&#39;\s&#43;&#39;</span><span class="token punctuation">,</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">,</span> pages<span class="token punctuation">[</span>index <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">&#39;text&#39;</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">:</span>table <span class="token operator">&#43;&#61;</span> table_nexttarget_tables<span class="token punctuation">.</span>append<span class="token punctuation">(</span><span class="token punctuation">{<!-- --></span><span class="token string">&#39;page&#39;</span><span class="token punctuation">:</span> page_id <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">,</span> <span class="token string">&#39;method&#39;</span><span class="token punctuation">:</span> <span class="token string">&#39;exact&#39;</span><span class="token punctuation">,</span> <span class="token string">&#39;table&#39;</span><span class="token punctuation">:</span> table<span class="token punctuation">,</span><span class="token string">&#39;table-id&#39;</span><span class="token punctuation">:</span> <span class="token builtin">str</span><span class="token punctuation">(</span>page_id <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">)</span> <span class="token operator">&#43;</span> <span class="token builtin">str</span><span class="token punctuation">(</span>ti <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token comment"># 表格可能不完整的情况</span><span class="token keyword">elif</span> first <span class="token keyword">in</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">&#39;\s&#43;&#39;</span><span class="token punctuation">,</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">,</span> <span class="token builtin">next</span><span class="token punctuation">)</span><span class="token punctuation">:</span>tables<span class="token punctuation">[</span>ti<span class="token punctuation">]</span> <span class="token operator">&#61;</span> <span class="token boolean">False</span><span class="token keyword">if</span> index <span class="token operator">&#43;</span> <span class="token number">1</span> <span class="token operator">&lt;</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span> <span class="token operator">and</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">[</span>index <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">&#39;tables&#39;</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">&gt;</span> <span class="token number">0</span><span class="token punctuation">:</span>table_next <span class="token operator">&#61;</span> pages<span class="token punctuation">[</span>index <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">&#39;tables&#39;</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>fi <span class="token operator">&#61;</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token punctuation">[</span>item <span class="token keyword">for</span> item <span class="token keyword">in</span> table_next<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token keyword">if</span> item <span class="token keyword">is</span> <span class="token operator">not</span> <span class="token boolean">None</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token keyword">if</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">&#39;\s&#43;&#39;</span><span class="token punctuation">,</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">,</span> fi<span class="token punctuation">)</span> <span class="token keyword">in</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">&#39;\s&#43;&#39;</span><span class="token punctuation">,</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">,</span> pages<span class="token punctuation">[</span>index <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">&#39;text&#39;</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">:</span>table <span class="token operator">&#43;&#61;</span> table_nexttarget_tables<span class="token punctuation">.</span>append<span class="token punctuation">(</span><span class="token punctuation">{<!-- --></span><span class="token string">&#39;page&#39;</span><span class="token punctuation">:</span> page_id <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">,</span> <span class="token string">&#39;method&#39;</span><span class="token punctuation">:</span> <span class="token string">&#39;guess&#39;</span><span class="token punctuation">,</span> <span class="token string">&#39;table&#39;</span><span class="token punctuation">:</span> table<span class="token punctuation">,</span><span class="token string">&#39;table-id&#39;</span><span class="token punctuation">:</span> <span class="token builtin">str</span><span class="token punctuation">(</span>page_id <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">)</span> <span class="token operator">&#43;</span> <span class="token string">&#39;-&#39;</span> <span class="token operator">&#43;</span> <span class="token builtin">str</span><span class="token punctuation">(</span>ti <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token keyword">return</span> target_tables<span class="token comment"># 提取表格中存在指定类型信息的表格&#xff0c;规则由参数rules指定</span><span class="token keyword">def</span> <span class="token function">extract_table_with_specific_info</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> pages<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">if</span> pages <span class="token keyword">is</span> <span class="token boolean">None</span> <span class="token operator">or</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span> <span class="token operator">&lt;</span> <span class="token number">1</span><span class="token punctuation">:</span><span class="token comment"># print(&#39;no-page...&#39;)</span><span class="token keyword">return</span>target_tables <span class="token operator">&#61;</span> <span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token keyword">for</span> index<span class="token punctuation">,</span> page <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span><span class="token punctuation">:</span>tables <span class="token operator">&#61;</span> page<span class="token punctuation">[</span><span class="token string">&#39;tables&#39;</span><span class="token punctuation">]</span>page_id <span class="token operator">&#61;</span> page<span class="token punctuation">[</span><span class="token string">&#39;page&#39;</span><span class="token punctuation">]</span> <span class="token operator">-</span> <span class="token number">1</span><span class="token keyword">for</span> ti<span class="token punctuation">,</span> table <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>tables<span class="token punctuation">)</span><span class="token punctuation">:</span>st <span class="token operator">&#61;</span> <span class="token builtin">str</span><span class="token punctuation">(</span>table<span class="token punctuation">)</span><span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>self<span class="token punctuation">.</span>rules<span class="token punctuation">[</span><span class="token string">&#39;in-table&#39;</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">&#61;&#61;</span> <span class="token builtin">sum</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">1</span> <span class="token keyword">for</span> in_tab <span class="token keyword">in</span> self<span class="token punctuation">.</span>rules<span class="token punctuation">[</span><span class="token string">&#39;in-table&#39;</span><span class="token punctuation">]</span> <span class="token keyword">if</span> in_tab <span class="token keyword">in</span> st<span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>self<span class="token punctuation">.</span>rules<span class="token punctuation">[</span><span class="token string">&#39;not-in-table&#39;</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">&#61;&#61;</span> <span class="token builtin">sum</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">1</span> <span class="token keyword">for</span> not_tab <span class="token keyword">in</span> self<span class="token punctuation">.</span>rules<span class="token punctuation">[</span><span class="token string">&#39;not-in-table&#39;</span><span class="token punctuation">]</span> <span class="token keyword">if</span> <span class="token operator">not</span> not_tab <span class="token keyword">in</span> st<span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">:</span>target_tables<span class="token punctuation">.</span>append<span class="token punctuation">(</span><span class="token punctuation">{<!-- --></span><span class="token string">&#39;page&#39;</span><span class="token punctuation">:</span> page_id <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">,</span> <span class="token string">&#39;method&#39;</span><span class="token punctuation">:</span> <span class="token string">&#39;content-in-table&#39;</span><span class="token punctuation">,</span> <span class="token string">&#39;table&#39;</span><span class="token punctuation">:</span> table<span class="token punctuation">,</span><span class="token string">&#39;table-id&#39;</span><span class="token punctuation">:</span> <span class="token builtin">str</span><span class="token punctuation">(</span>page_id <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">)</span> <span class="token operator">&#43;</span> <span class="token builtin">str</span><span class="token punctuation">(</span>ti <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token keyword">return</span> target_tables<span class="token comment"># 提取存在指定关键词的页面&#xff0c;关键词有rules指定</span><span class="token keyword">def</span> <span class="token function">extract_specific_page</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> pages<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">if</span> pages <span class="token keyword">is</span> <span class="token boolean">None</span> <span class="token operator">or</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span> <span class="token operator">&lt;</span> <span class="token number">1</span><span class="token punctuation">:</span><span class="token comment"># print(&#39;no-page...&#39;)</span><span class="token keyword">return</span>target_pages <span class="token operator">&#61;</span> <span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token keyword">for</span> index<span class="token punctuation">,</span> page <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span><span class="token punctuation">:</span>text <span class="token operator">&#61;</span> page<span class="token punctuation">[</span><span class="token string">&#39;text&#39;</span><span class="token punctuation">]</span>page_id <span class="token operator">&#61;</span> page<span class="token punctuation">[</span><span class="token string">&#39;page&#39;</span><span class="token punctuation">]</span><span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>self<span class="token punctuation">.</span>rules<span class="token punctuation">[</span><span class="token string">&#39;in-page&#39;</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">&#61;&#61;</span> <span class="token builtin">sum</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">1</span> <span class="token keyword">for</span> rule <span class="token keyword">in</span> self<span class="token punctuation">.</span>rules<span class="token punctuation">[</span><span class="token string">&#39;in-page&#39;</span><span class="token punctuation">]</span> <span class="token keyword">if</span> re<span class="token punctuation">.</span>search<span class="token punctuation">(</span>rule<span class="token punctuation">,</span> text<span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">:</span>target_pages<span class="token punctuation">.</span>append<span class="token punctuation">(</span><span class="token punctuation">{<!-- --></span><span class="token string">&#39;page&#39;</span><span class="token punctuation">:</span> page_id<span class="token punctuation">,</span> <span class="token string">&#39;text&#39;</span><span class="token punctuation">:</span> text<span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token keyword">return</span> target_pages<span class="token comment"># 执行以上所有过程&#xff0c;返回提取结果</span><span class="token keyword">def</span> <span class="token function">run</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span><span class="token punctuation">:</span>pages <span class="token operator">&#61;</span> self<span class="token punctuation">.</span>parse_pages<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token keyword">if</span> pages <span class="token keyword">is</span> <span class="token boolean">None</span> <span class="token operator">or</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span> <span class="token operator">&lt;</span> <span class="token number">1</span><span class="token punctuation">:</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">&#39;parse pdf error:&#39;</span><span class="token punctuation">,</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>basename<span class="token punctuation">(</span>self<span class="token punctuation">.</span>file_path<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token keyword">return</span><span class="token keyword">try</span><span class="token punctuation">:</span>target_1 <span class="token operator">&#61;</span> self<span class="token punctuation">.</span>extract_table_with_specific_header<span class="token punctuation">(</span>pages<span class="token punctuation">)</span>target_2 <span class="token operator">&#61;</span> self<span class="token punctuation">.</span>extract_table_with_specific_info<span class="token punctuation">(</span>pages<span class="token punctuation">)</span>target_3 <span class="token operator">&#61;</span> self<span class="token punctuation">.</span>extract_specific_page<span class="token punctuation">(</span>pages<span class="token punctuation">)</span>tables <span class="token operator">&#61;</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>s <span class="token operator">&#61;</span> <span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token keyword">for</span> table <span class="token keyword">in</span> target_1<span class="token punctuation">:</span><span class="token keyword">if</span> table<span class="token punctuation">[</span><span class="token string">&#39;table-id&#39;</span><span class="token punctuation">]</span> <span class="token operator">not</span> <span class="token keyword">in</span> s<span class="token punctuation">:</span>s<span class="token punctuation">.</span>append<span class="token punctuation">(</span>table<span class="token punctuation">[</span><span class="token string">&#39;table-id&#39;</span><span class="token punctuation">]</span><span class="token punctuation">)</span>tables<span class="token punctuation">.</span>append<span class="token punctuation">(</span>table<span class="token punctuation">)</span><span class="token keyword">for</span> table <span class="token keyword">in</span> target_2<span class="token punctuation">:</span><span class="token keyword">if</span> table<span class="token punctuation">[</span><span class="token string">&#39;table-id&#39;</span><span class="token punctuation">]</span> <span class="token operator">not</span> <span class="token keyword">in</span> s<span class="token punctuation">:</span>s<span class="token punctuation">.</span>append<span class="token punctuation">(</span>table<span class="token punctuation">[</span><span class="token string">&#39;table-id&#39;</span><span class="token punctuation">]</span><span class="token punctuation">)</span>tables<span class="token punctuation">.</span>append<span class="token punctuation">(</span>table<span class="token punctuation">)</span><span class="token keyword">return</span> tables<span class="token punctuation">,</span> target_3<span class="token keyword">except</span> Exception <span class="token keyword">as</span> e<span class="token punctuation">:</span><span class="token keyword">print</span><span class="token punctuation">(</span>e<span class="token punctuation">)</span>

加载Excel和缓存结果

# 该类用来加载Excel,遍历地址获取PDF文件路径及缓存结果class Util():def init(self, folder, out, demo):self.folder = folderself.out = outself.demo = demo

<span class="token comment"># 加载Demo文件&#xff0c;获取rules</span><span class="token keyword">def</span> <span class="token function">load_demo</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">&#39;load demo Excel:&#39;</span><span class="token punctuation">,</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>basename<span class="token punctuation">(</span>self<span class="token punctuation">.</span>demo<span class="token punctuation">)</span><span class="token punctuation">)</span>book <span class="token operator">&#61;</span> xlrd<span class="token punctuation">.</span>open_workbook<span class="token punctuation">(</span>self<span class="token punctuation">.</span>demo<span class="token punctuation">)</span>sheet <span class="token operator">&#61;</span> book<span class="token punctuation">.</span>sheet_by_index<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span>in_header <span class="token operator">&#61;</span> sheet<span class="token punctuation">.</span>col_values<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> sheet<span class="token punctuation">.</span>nrows<span class="token punctuation">)</span>not_in_header <span class="token operator">&#61;</span> sheet<span class="token punctuation">.</span>col_values<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> sheet<span class="token punctuation">.</span>nrows<span class="token punctuation">)</span>in_table <span class="token operator">&#61;</span> sheet<span class="token punctuation">.</span>col_values<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> sheet<span class="token punctuation">.</span>nrows<span class="token punctuation">)</span>not_in_table <span class="token operator">&#61;</span> sheet<span class="token punctuation">.</span>col_values<span class="token punctuation">(</span><span class="token number">3</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> sheet<span class="token punctuation">.</span>nrows<span class="token punctuation">)</span>in_page <span class="token operator">&#61;</span> sheet<span class="token punctuation">.</span>col_values<span class="token punctuation">(</span><span class="token number">4</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> sheet<span class="token punctuation">.</span>nrows<span class="token punctuation">)</span>rules <span class="token operator">&#61;</span> <span class="token punctuation">{<!-- --></span><span class="token string">&#39;in-header&#39;</span><span class="token punctuation">:</span> in_header<span class="token punctuation">,</span> <span class="token string">&#39;not-in-header&#39;</span><span class="token punctuation">:</span> not_in_header<span class="token punctuation">,</span> <span class="token string">&#39;in-table&#39;</span><span class="token punctuation">:</span> in_table<span class="token punctuation">,</span><span class="token string">&#39;not-in-table&#39;</span><span class="token punctuation">:</span> not_in_table<span class="token punctuation">,</span> <span class="token string">&#39;in-page&#39;</span><span class="token punctuation">:</span> in_page<span class="token punctuation">}</span><span class="token keyword">for</span> k<span class="token punctuation">,</span> v <span class="token keyword">in</span> rules<span class="token punctuation">.</span>items<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>rules<span class="token punctuation">[</span>k<span class="token punctuation">]</span> <span class="token operator">&#61;</span> <span class="token punctuation">[</span>i <span class="token keyword">for</span> i <span class="token keyword">in</span> v <span class="token keyword">if</span> <span class="token operator">not</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">&#39;\s&#43;&#39;</span><span class="token punctuation">,</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">,</span> i<span class="token punctuation">)</span> <span class="token operator">&#61;&#61;</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">]</span><span class="token keyword">return</span> rules<span class="token comment"># 加载PDF文件&#xff0c;采用迭代遍历</span><span class="token keyword">def</span> <span class="token function">load_folder</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">&#39;load folder:&#39;</span><span class="token punctuation">,</span> self<span class="token punctuation">.</span>folder<span class="token punctuation">)</span>paths <span class="token operator">&#61;</span> <span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token keyword">for</span> dirpath<span class="token punctuation">,</span> dirnames<span class="token punctuation">,</span> filenames <span class="token keyword">in</span> os<span class="token punctuation">.</span>walk<span class="token punctuation">(</span>self<span class="token punctuation">.</span>folder<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">for</span> <span class="token builtin">file</span> <span class="token keyword">in</span> filenames<span class="token punctuation">:</span>path <span class="token operator">&#61;</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>join<span class="token punctuation">(</span>dirpath<span class="token punctuation">,</span> <span class="token builtin">file</span><span class="token punctuation">)</span><span class="token keyword">if</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>isfile<span class="token punctuation">(</span>path<span class="token punctuation">)</span> <span class="token operator">and</span> <span class="token punctuation">(</span>os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>splitext<span class="token punctuation">(</span>path<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span> <span class="token operator">&#61;&#61;</span> <span class="token string">&#39;.pdf&#39;</span> <span class="token operator">or</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>splitext<span class="token punctuation">(</span>path<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span> <span class="token operator">&#61;&#61;</span> <span class="token string">&#39;.PDF&#39;</span><span class="token punctuation">)</span><span class="token punctuation">:</span>paths<span class="token punctuation">.</span>append<span class="token punctuation">(</span>path<span class="token punctuation">)</span><span class="token keyword">return</span> paths<span class="token comment"># 缓存结果</span><span class="token keyword">def</span> <span class="token function">save_tmp</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> info<span class="token punctuation">,</span> name<span class="token punctuation">,</span> code<span class="token punctuation">,</span> year<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">&#39;save tmp file:&#39;</span><span class="token punctuation">,</span> name<span class="token punctuation">)</span><span class="token keyword">if</span> <span class="token operator">not</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>isdir<span class="token punctuation">(</span><span class="token string">&#39;tmp&#39;</span><span class="token punctuation">)</span><span class="token punctuation">:</span>os<span class="token punctuation">.</span>mkdir<span class="token punctuation">(</span><span class="token string">&#39;tmp&#39;</span><span class="token punctuation">)</span>tables <span class="token operator">&#61;</span> info<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>pages <span class="token operator">&#61;</span> info<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token comment"># Excel样式</span>style <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>XFStyle<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token comment"># background color</span>pattern <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">(</span><span class="token punctuation">)</span>pattern<span class="token punctuation">.</span>pattern <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">.</span>SOLID_PATTERNpattern<span class="token punctuation">.</span>pattern_fore_colour <span class="token operator">&#61;</span> <span class="token number">5</span>style<span class="token punctuation">.</span>pattern <span class="token operator">&#61;</span> pattern<span class="token comment"># border</span>borders <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">(</span><span class="token punctuation">)</span>borders<span class="token punctuation">.</span>left <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>right <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>top <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>bottom <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKstyle<span class="token punctuation">.</span>borders <span class="token operator">&#61;</span> borders<span class="token comment"># font</span>font <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Font<span class="token punctuation">(</span><span class="token punctuation">)</span>font<span class="token punctuation">.</span>name <span class="token operator">&#61;</span> <span class="token string">&#39;Times New Roman&#39;</span>font<span class="token punctuation">.</span>bold <span class="token operator">&#61;</span> <span class="token boolean">True</span>font<span class="token punctuation">.</span>underline <span class="token operator">&#61;</span> <span class="token boolean">False</span>font<span class="token punctuation">.</span>italic <span class="token operator">&#61;</span> <span class="token boolean">False</span>style<span class="token punctuation">.</span>font <span class="token operator">&#61;</span> font<span class="token comment"># sheet style-2</span>style2 <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>XFStyle<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token comment"># background color</span>pattern <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">(</span><span class="token punctuation">)</span>pattern<span class="token punctuation">.</span>pattern <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">.</span>SOLID_PATTERNpattern<span class="token punctuation">.</span>pattern_fore_colour <span class="token operator">&#61;</span> <span class="token number">22</span>style2<span class="token punctuation">.</span>pattern <span class="token operator">&#61;</span> pattern<span class="token comment"># border</span>borders <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">(</span><span class="token punctuation">)</span>borders<span class="token punctuation">.</span>left <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>right <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>top <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>bottom <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKstyle2<span class="token punctuation">.</span>borders <span class="token operator">&#61;</span> borders<span class="token comment"># font</span>font <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Font<span class="token punctuation">(</span><span class="token punctuation">)</span>font<span class="token punctuation">.</span>name <span class="token operator">&#61;</span> <span class="token string">&#39;Times New Roman&#39;</span>font<span class="token punctuation">.</span>bold <span class="token operator">&#61;</span> <span class="token boolean">True</span>font<span class="token punctuation">.</span>underline <span class="token operator">&#61;</span> <span class="token boolean">False</span>font<span class="token punctuation">.</span>italic <span class="token operator">&#61;</span> <span class="token boolean">False</span>style2<span class="token punctuation">.</span>font <span class="token operator">&#61;</span> font<span class="token comment"># sheet style-3</span>style3 <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>XFStyle<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token comment"># background color</span>pattern2 <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">(</span><span class="token punctuation">)</span>pattern2<span class="token punctuation">.</span>pattern <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">.</span>SOLID_PATTERNpattern2<span class="token punctuation">.</span>pattern_fore_colour <span class="token operator">&#61;</span> <span class="token number">3</span>style3<span class="token punctuation">.</span>pattern <span class="token operator">&#61;</span> pattern2<span class="token comment"># border</span>borders <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">(</span><span class="token punctuation">)</span>borders<span class="token punctuation">.</span>left <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>right <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>top <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>bottom <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKstyle3<span class="token punctuation">.</span>borders <span class="token operator">&#61;</span> borders<span class="token comment"># font</span>font <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Font<span class="token punctuation">(</span><span class="token punctuation">)</span>font<span class="token punctuation">.</span>name <span class="token operator">&#61;</span> <span class="token string">&#39;Times New Roman&#39;</span>font<span class="token punctuation">.</span>bold <span class="token operator">&#61;</span> <span class="token boolean">True</span>font<span class="token punctuation">.</span>underline <span class="token operator">&#61;</span> <span class="token boolean">False</span>font<span class="token punctuation">.</span>italic <span class="token operator">&#61;</span> <span class="token boolean">False</span>style3<span class="token punctuation">.</span>font <span class="token operator">&#61;</span> font<span class="token comment"># 将数据写如Excel</span>book <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Workbook<span class="token punctuation">(</span><span class="token punctuation">)</span>sheet1 <span class="token operator">&#61;</span> book<span class="token punctuation">.</span>add_sheet<span class="token punctuation">(</span><span class="token string">&#39;tables&#39;</span><span class="token punctuation">)</span>sheet2 <span class="token operator">&#61;</span> book<span class="token punctuation">.</span>add_sheet<span class="token punctuation">(</span><span class="token string">&#39;pages&#39;</span><span class="token punctuation">)</span><span class="token keyword">for</span> ind<span class="token punctuation">,</span> page <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span><span class="token punctuation">:</span>page_num <span class="token operator">&#61;</span> page<span class="token punctuation">[</span><span class="token string">&#39;page&#39;</span><span class="token punctuation">]</span>text <span class="token operator">&#61;</span> page<span class="token punctuation">[</span><span class="token string">&#39;text&#39;</span><span class="token punctuation">]</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span>ind<span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">,</span> name<span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span>ind<span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">,</span> code<span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span>ind<span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> year<span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span>ind<span class="token punctuation">,</span> <span class="token number">3</span><span class="token punctuation">,</span> page_num<span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span>ind<span class="token punctuation">,</span> <span class="token number">4</span><span class="token punctuation">,</span> <span class="token string">&#39;search page&#39;</span><span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span>ind<span class="token punctuation">,</span> <span class="token number">5</span><span class="token punctuation">,</span> text<span class="token punctuation">,</span> style<span class="token punctuation">)</span><span class="token comment"># save table</span>i <span class="token operator">&#61;</span> <span class="token number">0</span><span class="token keyword">for</span> ti<span class="token punctuation">,</span> table <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>tables<span class="token punctuation">)</span><span class="token punctuation">:</span>page <span class="token operator">&#61;</span> table<span class="token punctuation">[</span><span class="token string">&#39;page&#39;</span><span class="token punctuation">]</span>method <span class="token operator">&#61;</span> table<span class="token punctuation">[</span><span class="token string">&#39;method&#39;</span><span class="token punctuation">]</span>table_content <span class="token operator">&#61;</span> table<span class="token punctuation">[</span><span class="token string">&#39;table&#39;</span><span class="token punctuation">]</span><span class="token keyword">if</span> method <span class="token operator">&#61;&#61;</span> <span class="token string">&#39;exact&#39;</span><span class="token punctuation">:</span>sty <span class="token operator">&#61;</span> style<span class="token keyword">elif</span> method <span class="token operator">&#61;&#61;</span> <span class="token string">&#39;guess&#39;</span><span class="token punctuation">:</span>sty <span class="token operator">&#61;</span> style2<span class="token keyword">else</span><span class="token punctuation">:</span>sty <span class="token operator">&#61;</span> style3<span class="token keyword">for</span> index<span class="token punctuation">,</span> row <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>table_content<span class="token punctuation">)</span><span class="token punctuation">:</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span>i<span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">,</span> name<span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span>i<span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">,</span> code<span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span>i<span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> year<span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span>i<span class="token punctuation">,</span> <span class="token number">3</span><span class="token punctuation">,</span> page<span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span>i<span class="token punctuation">,</span> <span class="token number">4</span><span class="token punctuation">,</span> method<span class="token punctuation">)</span><span class="token keyword">for</span> ind<span class="token punctuation">,</span> one <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">:</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span>i<span class="token punctuation">,</span> <span class="token number">5</span> <span class="token operator">&#43;</span> ind<span class="token punctuation">,</span> one <span class="token keyword">if</span> one <span class="token keyword">is</span> <span class="token operator">not</span> <span class="token boolean">None</span> <span class="token keyword">else</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">,</span> sty<span class="token punctuation">)</span>i <span class="token operator">&#43;&#61;</span> <span class="token number">1</span>i <span class="token operator">&#43;&#61;</span> <span class="token number">1</span>book<span class="token punctuation">.</span>save<span class="token punctuation">(</span><span class="token string">&#39;tmp\\&#39;</span> <span class="token operator">&#43;</span> name <span class="token operator">&#43;</span> <span class="token string">&#39;.tmp.xls&#39;</span><span class="token punctuation">)</span>

执行提取的函数

# 单个文件运行的完整流程,从加载文件到缓存结果的全过程,如果只想使用单线程运行程序,则在主函数中调用该函数即可def run(rules, file, util):extractor = Extractor(file, rules)info = extractor.run()code = re.findall(r’\d{6}’, os.path.basename(file))[0]year = re.findall(r’\d{8}’, os.path.basename(file))[0]if info is None or len(info[0]) < 1:with open(‘noResult.txt’, ‘a’, encoding=‘utf-8’) as fp:fp.write(file + ‘\n’)else:util.save_tmp(info, os.path.basename(file), code, year)

多进程+多线程

# -------------------# 以下两个函数是为了加快执行速度而启用的多线程+多进程模式,计算密集型任务状态下进程越多越好(不多于机器CPU核心数)# -----------------# 多线程:每次会启动跟files数量相对应的线程来执行,但只能执行在一个CPU核心中# multiple threadsdef batch_processor(func, rules, files, util):thread_pool = []for index, file in enumerate(files):th = threading.Thread(target=func, args=(rules, file, util))# print(‘running thread:’, th.name)th.start()thread_pool.append(th)for th in thread_pool:# print(‘waiting for thread:’, th.name)th.join()

# 多进程:启动4个进程执行,每个进程中运行多线程,CPU有几个核心就使用几个进程,一般机器多为双核心四进程,此时4进程可占满CPU运行,效能最大

# multiple processors

def multi_processor_run(func, sub_func, files, rules, util):

pool = multiprocessing.Pool(processes=4)

cnt = 0

batch_size = 5

while cnt < len(files):

rear = cnt + batch_size

if rear > len(files):

rear = len(files)

batch = files[cnt + 0:rear]

pool.apply_async(func, (sub_func, rules, batch, util))

cnt += batch_size

pool.close()

pool.join()

整理结果并保存

# 该函数将缓存在本地目录tmp文件夹下的所有临时Excel文件结果整合到一个Excel中# re-format resultdef re_format(sheet_size):print(‘re-format file…’)files = os.listdir(‘tmp’)paths = []new_book = xlwt.Workbook()for file in files:if os.path.isfile(os.path.join(‘tmp’, file)) and ‘.tmp.xls’ in file:paths.append(os.path.join(‘tmp’, file))

style <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>XFStyle<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token comment"># background color</span>pattern <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">(</span><span class="token punctuation">)</span>pattern<span class="token punctuation">.</span>pattern <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">.</span>SOLID_PATTERNpattern<span class="token punctuation">.</span>pattern_fore_colour <span class="token operator">&#61;</span> <span class="token number">5</span>style<span class="token punctuation">.</span>pattern <span class="token operator">&#61;</span> pattern<span class="token comment"># border</span>borders <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">(</span><span class="token punctuation">)</span>borders<span class="token punctuation">.</span>left <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>right <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>top <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>bottom <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKstyle<span class="token punctuation">.</span>borders <span class="token operator">&#61;</span> borders<span class="token comment"># font</span>font <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Font<span class="token punctuation">(</span><span class="token punctuation">)</span>font<span class="token punctuation">.</span>name <span class="token operator">&#61;</span> <span class="token string">&#39;Times New Roman&#39;</span>font<span class="token punctuation">.</span>bold <span class="token operator">&#61;</span> <span class="token boolean">True</span>font<span class="token punctuation">.</span>underline <span class="token operator">&#61;</span> <span class="token boolean">False</span>font<span class="token punctuation">.</span>italic <span class="token operator">&#61;</span> <span class="token boolean">False</span>style<span class="token punctuation">.</span>font <span class="token operator">&#61;</span> font<span class="token comment"># sheet style-2</span>style2 <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>XFStyle<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token comment"># background color</span>pattern <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">(</span><span class="token punctuation">)</span>pattern<span class="token punctuation">.</span>pattern <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">.</span>SOLID_PATTERNpattern<span class="token punctuation">.</span>pattern_fore_colour <span class="token operator">&#61;</span> <span class="token number">22</span>style2<span class="token punctuation">.</span>pattern <span class="token operator">&#61;</span> pattern<span class="token comment"># border</span>borders <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">(</span><span class="token punctuation">)</span>borders<span class="token punctuation">.</span>left <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>right <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>top <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>bottom <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKstyle2<span class="token punctuation">.</span>borders <span class="token operator">&#61;</span> borders<span class="token comment"># font</span>font <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Font<span class="token punctuation">(</span><span class="token punctuation">)</span>font<span class="token punctuation">.</span>name <span class="token operator">&#61;</span> <span class="token string">&#39;Times New Roman&#39;</span>font<span class="token punctuation">.</span>bold <span class="token operator">&#61;</span> <span class="token boolean">True</span>font<span class="token punctuation">.</span>underline <span class="token operator">&#61;</span> <span class="token boolean">False</span>font<span class="token punctuation">.</span>italic <span class="token operator">&#61;</span> <span class="token boolean">False</span>style2<span class="token punctuation">.</span>font <span class="token operator">&#61;</span> font<span class="token comment"># sheet style-3</span>style3 <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>XFStyle<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token comment"># background color</span>pattern2 <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">(</span><span class="token punctuation">)</span>pattern2<span class="token punctuation">.</span>pattern <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Pattern<span class="token punctuation">.</span>SOLID_PATTERNpattern2<span class="token punctuation">.</span>pattern_fore_colour <span class="token operator">&#61;</span> <span class="token number">3</span>style3<span class="token punctuation">.</span>pattern <span class="token operator">&#61;</span> pattern2<span class="token comment"># border</span>borders <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">(</span><span class="token punctuation">)</span>borders<span class="token punctuation">.</span>left <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>right <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>top <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKborders<span class="token punctuation">.</span>bottom <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Borders<span class="token punctuation">.</span>THICKstyle3<span class="token punctuation">.</span>borders <span class="token operator">&#61;</span> borders<span class="token comment"># font</span>font <span class="token operator">&#61;</span> xlwt<span class="token punctuation">.</span>Font<span class="token punctuation">(</span><span class="token punctuation">)</span>font<span class="token punctuation">.</span>name <span class="token operator">&#61;</span> <span class="token string">&#39;Times New Roman&#39;</span>font<span class="token punctuation">.</span>bold <span class="token operator">&#61;</span> <span class="token boolean">True</span>font<span class="token punctuation">.</span>underline <span class="token operator">&#61;</span> <span class="token boolean">False</span>font<span class="token punctuation">.</span>italic <span class="token operator">&#61;</span> <span class="token boolean">False</span>style3<span class="token punctuation">.</span>font <span class="token operator">&#61;</span> fonttab_cnt <span class="token operator">&#61;</span> <span class="token number">1</span>page_cnt <span class="token operator">&#61;</span> <span class="token number">1</span>tab_rows <span class="token operator">&#61;</span> <span class="token number">0</span>page_rows <span class="token operator">&#61;</span> <span class="token number">0</span>sheet2 <span class="token operator">&#61;</span> new_book<span class="token punctuation">.</span>add_sheet<span class="token punctuation">(</span><span class="token string">&#39;pages-&#39;</span> <span class="token operator">&#43;</span> <span class="token builtin">str</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">)</span>sheet1 <span class="token operator">&#61;</span> new_book<span class="token punctuation">.</span>add_sheet<span class="token punctuation">(</span><span class="token string">&#39;tables-&#39;</span> <span class="token operator">&#43;</span> <span class="token builtin">str</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">,</span> <span class="token string">&#39;File&#39;</span><span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">,</span> <span class="token string">&#39;Code&#39;</span><span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> <span class="token string">&#39;Date&#39;</span><span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">3</span><span class="token punctuation">,</span> <span class="token string">&#39;Page&#39;</span><span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">4</span><span class="token punctuation">,</span> <span class="token string">&#39;Method&#39;</span><span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">,</span> <span class="token string">&#39;File&#39;</span><span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">,</span> <span class="token string">&#39;Code&#39;</span><span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> <span class="token string">&#39;Date&#39;</span><span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">3</span><span class="token punctuation">,</span> <span class="token string">&#39;Page&#39;</span><span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">4</span><span class="token punctuation">,</span> <span class="token string">&#39;Method&#39;</span><span class="token punctuation">)</span><span class="token keyword">for</span> index<span class="token punctuation">,</span> <span class="token builtin">file</span> <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>paths<span class="token punctuation">)</span><span class="token punctuation">:</span>book <span class="token operator">&#61;</span> xlrd<span class="token punctuation">.</span>open_workbook<span class="token punctuation">(</span><span class="token builtin">file</span><span class="token punctuation">)</span>sheet <span class="token operator">&#61;</span> book<span class="token punctuation">.</span>sheet_by_name<span class="token punctuation">(</span><span class="token string">&#39;tables&#39;</span><span class="token punctuation">)</span>sheet_pages <span class="token operator">&#61;</span> book<span class="token punctuation">.</span>sheet_by_name<span class="token punctuation">(</span><span class="token string">&#39;pages&#39;</span><span class="token punctuation">)</span>tab_rows <span class="token operator">&#43;&#61;</span> sheet<span class="token punctuation">.</span>nrowspage_rows <span class="token operator">&#43;&#61;</span> sheet_pages<span class="token punctuation">.</span>nrows<span class="token keyword">for</span> row <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span>sheet<span class="token punctuation">.</span>nrows<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>sheet<span class="token punctuation">.</span>row_values<span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">&lt;</span> <span class="token number">5</span><span class="token punctuation">:</span>sty1 <span class="token operator">&#61;</span> <span class="token boolean">None</span><span class="token keyword">elif</span> sheet<span class="token punctuation">.</span>row_values<span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">4</span><span class="token punctuation">]</span> <span class="token operator">&#61;&#61;</span> <span class="token string">&#39;exact&#39;</span><span class="token punctuation">:</span>sty1 <span class="token operator">&#61;</span> style<span class="token keyword">elif</span> sheet<span class="token punctuation">.</span>row_values<span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">4</span><span class="token punctuation">]</span> <span class="token operator">&#61;&#61;</span> <span class="token string">&#39;guess&#39;</span><span class="token punctuation">:</span>sty1 <span class="token operator">&#61;</span> style2<span class="token keyword">elif</span> sheet<span class="token punctuation">.</span>row_values<span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">4</span><span class="token punctuation">]</span> <span class="token operator">&#61;&#61;</span> <span class="token string">&#39;content-in-table&#39;</span><span class="token punctuation">:</span>sty1 <span class="token operator">&#61;</span> style3<span class="token keyword">else</span><span class="token punctuation">:</span>sty1 <span class="token operator">&#61;</span> <span class="token boolean">None</span><span class="token keyword">for</span> col<span class="token punctuation">,</span> val <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>sheet<span class="token punctuation">.</span>row_values<span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">if</span> col <span class="token operator">&gt;</span> <span class="token number">4</span><span class="token punctuation">:</span><span class="token keyword">if</span> sty1 <span class="token keyword">is</span> <span class="token operator">not</span> <span class="token boolean">None</span><span class="token punctuation">:</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span>tab_cnt<span class="token punctuation">,</span> col<span class="token punctuation">,</span> val<span class="token punctuation">,</span> sty1<span class="token punctuation">)</span><span class="token keyword">else</span><span class="token punctuation">:</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span>tab_cnt<span class="token punctuation">,</span> col<span class="token punctuation">,</span> val<span class="token punctuation">)</span><span class="token keyword">else</span><span class="token punctuation">:</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span>tab_cnt<span class="token punctuation">,</span> col<span class="token punctuation">,</span> val<span class="token punctuation">)</span>tab_cnt <span class="token operator">&#43;&#61;</span> <span class="token number">1</span>tab_cnt <span class="token operator">&#43;&#61;</span> <span class="token number">1</span><span class="token keyword">for</span> row <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span>sheet_pages<span class="token punctuation">.</span>nrows<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>sheet_pages<span class="token punctuation">.</span>row_values<span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">&lt;</span> <span class="token number">5</span><span class="token punctuation">:</span>sty2 <span class="token operator">&#61;</span> <span class="token boolean">None</span><span class="token keyword">elif</span> sheet_pages<span class="token punctuation">.</span>row_values<span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">4</span><span class="token punctuation">]</span> <span class="token operator">&#61;&#61;</span> <span class="token string">&#39;exact&#39;</span><span class="token punctuation">:</span>sty2 <span class="token operator">&#61;</span> style<span class="token keyword">elif</span> sheet_pages<span class="token punctuation">.</span>row_values<span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">4</span><span class="token punctuation">]</span> <span class="token operator">&#61;&#61;</span> <span class="token string">&#39;guess&#39;</span><span class="token punctuation">:</span>sty2 <span class="token operator">&#61;</span> style2<span class="token keyword">elif</span> sheet_pages<span class="token punctuation">.</span>row_values<span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">4</span><span class="token punctuation">]</span> <span class="token operator">&#61;&#61;</span> <span class="token string">&#39;content-in-table&#39;</span><span class="token punctuation">:</span>sty2 <span class="token operator">&#61;</span> style3<span class="token keyword">else</span><span class="token punctuation">:</span>sty2 <span class="token operator">&#61;</span> <span class="token boolean">None</span><span class="token keyword">for</span> col<span class="token punctuation">,</span> val <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>sheet_pages<span class="token punctuation">.</span>row_values<span class="token punctuation">(</span>row<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token keyword">if</span> col <span class="token operator">&gt;</span> <span class="token number">4</span><span class="token punctuation">:</span><span class="token keyword">if</span> sty2 <span class="token keyword">is</span> <span class="token operator">not</span> <span class="token boolean">None</span><span class="token punctuation">:</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span>page_cnt<span class="token punctuation">,</span> col<span class="token punctuation">,</span> val<span class="token punctuation">,</span> sty2<span class="token punctuation">)</span><span class="token keyword">else</span><span class="token punctuation">:</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span>page_cnt<span class="token punctuation">,</span> col<span class="token punctuation">,</span> val<span class="token punctuation">)</span><span class="token keyword">else</span><span class="token punctuation">:</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span>page_cnt<span class="token punctuation">,</span> col<span class="token punctuation">,</span> val<span class="token punctuation">)</span>page_cnt <span class="token operator">&#43;&#61;</span> <span class="token number">1</span>page_cnt <span class="token operator">&#43;&#61;</span> <span class="token number">1</span><span class="token keyword">if</span> tab_rows <span class="token operator">&gt;&#61;</span> sheet_size<span class="token punctuation">:</span>tab_rows <span class="token operator">&#61;</span> <span class="token number">0</span>tab_cnt <span class="token operator">&#61;</span> <span class="token number">1</span>sheet1 <span class="token operator">&#61;</span> new_book<span class="token punctuation">.</span>add_sheet<span class="token punctuation">(</span><span class="token string">&#39;tables-&#39;</span> <span class="token operator">&#43;</span> <span class="token builtin">str</span><span class="token punctuation">(</span>index<span class="token punctuation">)</span><span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">,</span> <span class="token string">&#39;File&#39;</span><span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">,</span> <span class="token string">&#39;Code&#39;</span><span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> <span class="token string">&#39;Date&#39;</span><span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">3</span><span class="token punctuation">,</span> <span class="token string">&#39;Page&#39;</span><span class="token punctuation">)</span>sheet1<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">4</span><span class="token punctuation">,</span> <span class="token string">&#39;Method&#39;</span><span class="token punctuation">)</span><span class="token keyword">if</span> page_rows <span class="token operator">&gt;&#61;</span> sheet_size<span class="token punctuation">:</span>page_rows <span class="token operator">&#61;</span> <span class="token number">0</span>page_cnt <span class="token operator">&#61;</span> <span class="token number">1</span>sheet2 <span class="token operator">&#61;</span> new_book<span class="token punctuation">.</span>add_sheet<span class="token punctuation">(</span><span class="token string">&#39;pages-&#39;</span> <span class="token operator">&#43;</span> <span class="token builtin">str</span><span class="token punctuation">(</span>index<span class="token punctuation">)</span><span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">,</span> <span class="token string">&#39;File&#39;</span><span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">,</span> <span class="token string">&#39;Code&#39;</span><span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> <span class="token string">&#39;Date&#39;</span><span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">3</span><span class="token punctuation">,</span> <span class="token string">&#39;Page&#39;</span><span class="token punctuation">)</span>sheet2<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">4</span><span class="token punctuation">,</span> <span class="token string">&#39;Method&#39;</span><span class="token punctuation">)</span>new_book<span class="token punctuation">.</span>save<span class="token punctuation">(</span><span class="token string">&#39;tables.xls&#39;</span><span class="token punctuation">)</span>

函数入口

# 程序执行入口:主函数if name == ‘main’:# 此命令是为在Windows环境下打包exe时正确引入多进程模块而添加的,在Python解释器中运行代码这一行是不必要的,当然添加之后也无妨multiprocessing.freeze_support()# 程序运行需要的参数# parasbase_dir = r’./’ # 程序工作目录设定为本程序所在的目录out_path = base_dir + r’\result.xls’ # 输出结果文件名称demo = base_dir + r’\Demo.xlsx’ # Demo文件名称

<span class="token comment"># 新建noResult.txt文件&#xff0c;用来保存没有结果的PDF文件名称</span><span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">&#39;noResult.txt&#39;</span><span class="token punctuation">,</span> <span class="token string">&#39;w&#39;</span><span class="token punctuation">,</span> encoding<span class="token operator">&#61;</span><span class="token string">&#39;utf-8&#39;</span><span class="token punctuation">)</span><span class="token keyword">as</span> fp<span class="token punctuation">:</span>fp<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token builtin">str</span><span class="token punctuation">(</span>datetime<span class="token punctuation">.</span>now<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">&#43;</span> <span class="token string">&#39;\n&#39;</span><span class="token punctuation">)</span><span class="token comment"># 初始化Util类</span>util <span class="token operator">&#61;</span> Util<span class="token punctuation">(</span>base_dir <span class="token operator">&#43;</span> <span class="token string">&#39;\\test&#39;</span><span class="token punctuation">,</span> out_path<span class="token punctuation">,</span> demo<span class="token punctuation">)</span>rules <span class="token operator">&#61;</span> util<span class="token punctuation">.</span>load_demo<span class="token punctuation">(</span><span class="token punctuation">)</span>folder <span class="token operator">&#61;</span> util<span class="token punctuation">.</span>load_folder<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token comment"># 执行多进程&#xff0c;&#xff0c;但仅执行单线程模式时这里可替换为run函数</span>multi_processor_run<span class="token punctuation">(</span>batch_processor<span class="token punctuation">,</span> run<span class="token punctuation">,</span> folder<span class="token punctuation">,</span> rules<span class="token punctuation">,</span> util<span class="token punctuation">)</span><span class="token comment"># 保存结果&#xff1a;5000代表每个Excel的单个sheet最多5000行&#xff0c;超过则会新建sheet</span><span class="token comment"># save...</span>re_format<span class="token punctuation">(</span><span class="token number">5000</span><span class="token punctuation">)</span><span class="token comment"># 移除临时文件&#xff0c;这些临时文件在程序运行过程中会保存在当前目录的tmp文件夹内&#xff0c;其中每个Excel文件保存的是单个PDF文件的结果&#xff0c;最终这些结果将会通过re_format函数整合到一个Excel中&#xff0c;当想要保留这些结果时&#xff0c;可将下面一行代码注释掉</span>shutil<span class="token punctuation">.</span>rmtree<span class="token punctuation">(</span><span class="token string">&#39;tmp&#39;</span><span class="token punctuation">)</span>

7.完整项目代码下载地址

/yooongchun/PDFParser/blob/master/PDFTable/ExtractTables.py

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。