实用脚本能批量提取吗？

wen 实用脚本 2026-06-10 11

本文目录导读：

实用脚本能批量提取吗？

从多个PDF/Word/Excel文件中批量提取特定文本
从网页批量提取数据（爬虫）
批量提取图片中的文字（OCR）
从多个压缩包中批量提取文件
批量提取文件名/路径到Excel
从日志文件中批量提取错误信息
总结：如何选择最实用的方法？

当然可以！批量提取正是脚本最擅长的领域之一，根据你要提取的内容（文本、文件、数据、图片等）和来源（网页、PDF、Excel、文件夹等）,可以使用不同的实用脚本。

下面我给你几个最常用、最实用的批量提取脚本示例,涵盖不同场景：

从多个PDF/Word/Excel文件中批量提取特定文本

场景：你有100份简历PDF，想批量提取里面的“手机号”和“邮箱”。

实用脚本（Python + pdfplumber + re）：

import pdfplumber
import re
import os
def extract_info_from_pdf(pdf_path):
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text()
    # 提取手机号（国内格式）
    phones = re.findall(r'1[3-9]\d{9}', text)
    # 提取邮箱
    emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', text)
    return phones, emails
# 批量处理文件夹内所有PDF
folder = "/path/to/your/pdf_folder"
for filename in os.listdir(folder):
    if filename.endswith(".pdf"):
        phones, emails = extract_info_from_pdf(os.path.join(folder, filename))
        print(f"{filename}: 手机号={phones}, 邮箱={emails}")

实用工具：如果不想写代码，可以用 Tabula（PDF表格提取）或 Adobe Acrobat Pro 的批量导出功能。

从网页批量提取数据（爬虫）

场景：你想批量提取某个电商网站上所有商品的价格和名称。

实用脚本（Python + requests + BeautifulSoup）：

import requests
from bs4 import BeautifulSoup
import csv
base_url = "https://example.com/products?page={}"
all_data = []
for page in range(1, 11):  # 爬取1-10页
    url = base_url.format(page)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    products = soup.find_all('div', class_='product-item')
    for product in products:
        name = product.find('h2').text.strip()
        price = product.find('span', class_='price').text.strip()
        all_data.append([name, price])
# 写入CSV
with open('products.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['名称', '价格'])
    writer.writerows(all_data)

实用工具：更简单的是用 Web Scraper（浏览器插件） 或 Octoparse（可视化爬虫）。

批量提取图片中的文字（OCR）

场景：你有100张截图或扫描件,想提取里面的文字。

实用脚本（Python + pytesseract）：

import pytesseract
from PIL import Image
import os
def extract_text_from_image(image_path):
    image = Image.open(image_path)
    text = pytesseract.image_to_string(image, lang='chi_sim')
    return text
folder = "/path/to/your/images"
for filename in os.listdir(folder):
    if filename.endswith(('.png', '.jpg', '.jpeg')):
        text = extract_text_from_image(os.path.join(folder, filename))
        # 保存到同名的txt文件
        with open(filename + '.txt', 'w', encoding='utf-8') as f:
            f.write(text)
        print(f"已提取: {filename}")

实用工具：在线工具 OnlineOCR.net 或桌面软件 ABBYY FineReader 都支持批量。

从多个压缩包中批量提取文件

场景：你有100个zip文件，每个里面都有一张图片,想一次性解压到同一个文件夹。

实用脚本（Python + zipfile）：

import zipfile
import os
folder = "/path/to/your/zips"
output = "/path/to/output"
for filename in os.listdir(folder):
    if filename.endswith('.zip'):
        with zipfile.ZipFile(os.path.join(folder, filename), 'r') as zip_ref:
            zip_ref.extractall(output)
        print(f"已解压: {filename}")

实用工具：7-Zip 或 WinRAR 都支持批量解压（全选 → 右键解压）。

批量提取文件名/路径到Excel

场景：你想把一个文件夹里所有文件的名称列出来,方便核对。

实用脚本（Python）：

import os
import pandas as pd
folder = "/path/to/your/folder"
data = []
for root, dirs, files in os.walk(folder):
    for file in files:
        full_path = os.path.join(root, file)
        size = os.path.getsize(full_path)
        data.append([file, full_path, size])
df = pd.DataFrame(data, columns=['文件名', '完整路径', '大小(字节)'])
df.to_excel('文件清单.xlsx', index=False)

实用工具：在 Windows 下使用 dir /s /b > filelist.txt 命令即可，Mac/Linux 用 find . -type f > filelist.txt。

从日志文件中批量提取错误信息

场景：你有几十个日志文件，想提取所有包含“ERROR”的行。

实用脚本（Python）：

import os
folder = "/path/to/logs"
output_file = "errors.txt"
with open(output_file, 'w') as out:
    for filename in os.listdir(folder):
        if filename.endswith('.log'):
            with open(os.path.join(folder, filename), 'r') as f:
                for line in f:
                    if 'ERROR' in line:
                        out.write(f"[{filename}] {line}")

实用工具：Linux/Mac 用户可以用一行命令搞定：

grep "ERROR" /path/to/logs/*.log > errors.txt

如何选择最实用的方法？

场景	推荐方案	难度
PDF/Word/Excel 提取	Python + pdfplumber/pandas
网页数据提取	浏览器插件 (Web Scraper)
图片OCR	在线工具或桌面软件
压缩包/文件批量处理	系统自带命令或Python
日志/文本批量提取	系统自带 `grep` 或 Python

你具体想提取什么类型的内容？ 告诉我来源和格式，我可以给你一个更精准、可直接运行的脚本！