c语言sscanf函数的用法是什么
269
2022-08-26
python pdf转换为word
今天尝试了一下用pdf转换为word的操作,也是借鉴的别人的代码,地址为:osfrom configparser import ConfigParserfrom io import StringIOfrom io import openfrom concurrent.futures import ProcessPoolExecutorfrom pdfminer.pdfinterp import PDFResourceManagerfrom pdfminer.pdfinterp import process_pdffrom pdfminer.converter import TextConverterfrom pdfminer.layout import LAParamsfrom docx import Documentimport logginglogging.Logger.propagate = Falselogging.getLogger().setLevel(logging.ERROR)import redef read_from_pdf(file_path): with open(file_path, 'rb') as file: resource_manager = PDFResourceManager() return_str = StringIO() lap_params = LAParams() device = TextConverter( resource_manager, return_str, laparams=lap_params) process_pdf(resource_manager, device, file) device.close() content = return_str.getvalue() return_str.close() return contentdef save_text_to_word(content, file_path): doc = Document() for line in content.split('\n'): paragraph = doc.add_paragraph() paragraph.add_run(remove_control_characters(line)) doc.save(file_path)def remove_control_characters(content): mpa = dict.fromkeys(range(32)) return content.translate(mpa)def pdf_to_word(pdf_file_path, word_file_path): content = read_from_pdf(pdf_file_path) content = re.compile(r'([0-9a-zA-Z_])\n([0-9a-zA-Z_])').sub(r'\1 \2', content) content0 = re.compile(r'(-)\n([0-9a-zA-Z_])').sub(r'\2', content) content1 = re.compile(r' \n ').sub(r'', content0) content_2 = re.compile(r'([^.])\n').sub(r'\1', content1) content_compile = re.compile(r'\(cid:\d{1,2}\)').sub(r'', content_2) save_text_to_word(content, word_file_path)if __name__ == "__main__": root_path='pdf' for file in os.listdir(root_path): extension_name = os.path.splitext(file)[1] if extension_name != '.pdf': continue file_name = os.path.splitext(file)[0] pdf_file = root_path + '/' + file word_file = root_path + '/' + file_name + '.docx' print('正在处理: ', file) pdf_to_word(pdf_file,word_file)
我是在mac环境下测试的,改天分享一下扫描版的pdf解析,当然使用范围也是有局限性的啦。
参考文献
[1].PYTHON代码教你批量将PDF转为WORD
[2].使用Python将PDF转化为word. https://jianshu.com/p/49c5abfee649
版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。
发表评论
暂时没有评论,来抢沙发吧~