Python 自動化拆分筆記

自動化拆分筆記，釋放 Zettelkasten 筆記法的真正力量

背景

在這個資訊爆炸的時代，管理和整理知識變得尤為重要。Zettelkasten 筆記法提倡將知識分割成最小的單位來促進學習和創新。這種方法強調將大量的資訊分解成易於管理和檢索的小部分。然而，當面對一份龐大的 Markdown 文件時，手動將其拆分成多個小單位既耗時又低效。因此，開發一個自動化工具來分割 Markdown 文件變得非常必要，以便更好地遵循 Zettelkasten 的原則。

方法

在這裡，我們介紹了一個名為split_by_h2的 Python 腳本，由 Hsieh-Ting Lin 開發。這個腳本的目的是自動化地將一份 Markdown 文件分割成由二級標題（H2）定義的多個小單位。這個過程大致可以分為以下幾個步驟：

讀取 Markdown 文件：腳本首先讀取指定的 Markdown 文件，並將其內容儲存起來。
處理內容：通過正則表達式的應用，腳本刪除特定的段落（例如，與同級標題相關的內容），並根據一級和二級標題分割文件內容。
生成新的 Markdown 文件：對於每個二級標題及其相應的內容，腳本創建一個新的 Markdown 文件。每個新文件包括一個 YAML 頭部、來源信息、標題和內容，並在文末添加與其他文件的連接列表，以促進相互之間的聯繫。
更新原文件：在原始 Markdown 文件中，腳本更新一級標題下的內容，並在底部添加一個連接到新創建文件的列表，方便跨文件導航。

import os
import re
from datetime import datetime


class MarkdownProcessor:
    def __init__(self, file_path):
        self.file_path = file_path
        self.original_filename = os.path.basename(file_path).replace(".md", "")
        self.content = self._read_markdown_file()

    def _read_markdown_file(self):
        with open(self.file_path, "r", encoding="utf-8") as f:
            return f.read()

    def _create_new_md_file(self, title, content, yaml_extra=""):
        current_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        yaml_header = f"""---
title: "{title}"
date: "{current_time}"
enableToc: false
tags:
  - building
---

"""
        info_block = f"> [!info]\n>\n> 🌱來自：[[{self.original_filename}]]\n\n"
        return yaml_header + info_block + f"# {title}\n\n{content}"

    def _search_siblings_to_next_heading(
        self, pattern=r"### Siblings.*?(?=\n[#]+ |$)", flags=re.DOTALL
    ):
        match = re.search(pattern, self.content, flags=flags)
        return match.group(0).strip() if match else ""

    def _delete_siblings_from_text(
        self, pattern=r"### Siblings.*?(?=\n[#]+ |$)", flags=re.DOTALL
    ):
        self.content = re.sub(pattern, "", self.content, flags=flags)

    def _get_content_before_first_h2(self):
        match = re.search(r"##(?!#)", self.content)
        if match:
            return self.content[: match.start()].strip()
        else:
            return self.content.strip()

    # def _process_markdown_content(self):
    #     self._delete_siblings_from_text()
    #     pre_h1_content, h1_and_following_content = self.content.split("# ", 1)
    #     h1_title, post_h1_content = h1_and_following_content.split("\n", 1)
    #     between_h1_h2 = self._get_content_before_first_h2()
    #     level_2_heading_pattern = re.compile(r"## (.+?)\n(.*?)(?=\n## |\Z)", re.DOTALL)
    #     level_2_headings = level_2_heading_pattern.findall(post_h1_content)
    #     return pre_h1_content, h1_title, between_h1_h2, level_2_headings
    def _process_markdown_content(self):
        self._delete_siblings_from_text()
        # 使用正則表達式直接分割和提取所需內容
        pre_h1_content, h1_title, post_h1_content = re.split(
            r"#\s(.+?)\n", self.content, 1
        )
        between_h1_h2 = self._get_content_before_first_h2()
        # 提取二級標題及其後的內容
        level_2_headings = re.findall(
            r"##\s(.+?)\n(.*?)(?=\n## |\Z)", post_h1_content, re.DOTALL
        )
        return pre_h1_content, h1_title, between_h1_h2, level_2_headings

    def save_new_markdown_files(self):
        pre_h1_content, h1_title, between_h1_h2, level_2_headings = (
            self._process_markdown_content()
        )
        new_md_files = {
            heading: self._create_new_md_file(heading, content)
            for heading, content in level_2_headings
        }

        wikilink_list = "\n".join(
            [
                f"- [[{heading.lower().replace(' ', '_')}.md|{heading}]]"
                for heading in new_md_files
            ]
        )

        for filename, content in new_md_files.items():
            with open(
                f"{filename.lower().replace(' ', '_')}.md", "w", encoding="utf-8"
            ) as f:
                to_write = f"{content}\n\n### Siblings\n\n{wikilink_list}\n\n"
                f.write(to_write)

        main_content_updated = (
            f"{pre_h1_content}# {h1_title}\n{between_h1_h2}\n\n{wikilink_list}\n\n"
        )
        self._update_original_file(main_content_updated)

    def _update_original_file(self, content):
        with open(self.file_path, "w", encoding="utf-8") as f:
            f.write(content)


if __name__ == "__main__":
    import sys

    if len(sys.argv) < 2:
        print("Usage: python script.py <input_file>")
        sys.exit(1)
    file_path = sys.argv[1]
    processor = MarkdownProcessor(file_path)
    processor.save_new_markdown_files()

討論

使用split_by_h2腳本自動化分割 Markdown 文件具有多方面的好處。首先，它加速了將大文件拆分成 Zettelkasten 方法推薦的小單位的過程，從而節省了大量時間和精力。其次，這種方法增強了資訊的可檢索性和可連接性，使得知識的組織和管理更加高效。最後，自動化工具的使用減少了人為錯誤，確保了分割過程的一致性和準確性。

結論

split_by_h2腳本是一個強大的工具，用於將大型 Markdown 文件自動化地分割成遵循 Zettelkasten 筆記法的小單位。通過簡化分割過程，它不僅提高了知識管理的效率，而且還增強了學習和創新的能力。此腳本對於那些尋求優化他們的筆記和知識管理系統的人來說，是一個有價值的資源。

Published Mar 6, 2024

血液腫瘤科專研醫師，任職於和信治癌中心醫院。林協霆醫師 on Twitter