博客目录生成工具(Python)

1.用途
2.原理
3.代码
4.联动

1.用途

在之前写了一个批量处理博客图片的脚本，Github是这里，博客是这里。可能是五一放假没什么事情，所以就写了个用于生成博客目录的工具。这样就再也不用自己手动一点点地去编辑目录了，也算是提升了效率吧。像当年这篇博客，自己一点点复制标题和链接写出来也是算是很有毅力了，如下图。

所以为了提高“生产力”，节省出更多的时间可以做自己想做的事情，就写了这个脚本。理论上来说，不仅仅是我自己的博客，只要严格按照Markdown语法写的内容，都可以用这个脚本自动提取并生成目录。

2.原理

总的来说，实现的原理非常简单。首先读取Markdown内容，找出各个标题。然后按照一定规则生成各标题对应的HTML标签的id，这是后续目录点击跳转的锚点。根据文件名，按照Jekyll指定的规则生成当前博客的网址。最后，将博客网址和标题锚点按照Markdown超链接的格式拼接起来，完成目录内容的生成。最后，利用Markdown的分级列表，将分级目录生成出来。

3.代码

下面贴出代码，代码也放到了Github上，叫TOCgenerator。

# coding=utf-8

def deleteLeftParentheses(string):
    if string.__contains__('('):
        left_col = string.find('(')
        string = string[:left_col] + string[left_col + 1:]
        return string


def deleteRightParentheses(string):
    if string.__contains__(')'):
        right_col = string.find(')')
        string = string[:right_col] + string[right_col + 1:]
        return string


def deleteDot(string):
    if string.__contains__('.'):
        dot = string.find('.')
        string = string[:dot] + string[dot + 1:]
        return string


def deleteSingleQuoteMark(string):
    if string.__contains__('`'):
        pie = string.find('`')
        string = string[:pie] + string[pie + 1:]
        return string


def deleteLeftDoubleQuoteMark(string):
    if string.__contains__('“'):
        left_double = string.find('“')
        string = string[:left_double] + string[left_double + 3:]
        return string


def deleteRightDoubleQuoteMark(string):
    if string.__contains__('”'):
        right_double = string.find('”')
        string = string[:right_double] + string[right_double + 3:]
        return string


def deleteDunHao(string):
    if string.__contains__('、'):
        dun = string.find('、')
        string = string[:dun] + string[dun + 3:]
        return string


def deleteLeftBracket(string):
    if string.__contains__('['):
        left_bracket = string.find('[')
        string = string[:left_bracket] + string[left_bracket + 1:]
        return string


def deleteRightBracket(string):
    if string.__contains__(']'):
        right_bracket = string.find(']')
        string = string[:right_bracket] + string[right_bracket + 1:]
        return string


def deleteAdd(string):
    if string.__contains__('+'):
        add_plus = string.find('+')
        string = string[:add_plus] + string[add_plus + 1:]
        return string


def deleteAnd(string):
    if string.__contains__('&'):
        And = string.find('&')
        string = string[:And] + string[And + 1:]
        return string


def deleteZuoKuohao(string):
    if string.__contains__('（'):
        zuo = string.find('（')
        string = string[:zuo] + string[zuo + 3:]
        return string


def deleteYouKuohao(string):
    if string.__contains__('）'):
        you = string.find('）')
        string = string[:you] + string[you + 3:]
        return string


def deleteXiexian(string):
    if string.__contains__('/'):
        xie = string.find('/')
        string = string[:xie] + string[xie + 1:]
        return string


def deleteDouhao(string):
    if string.__contains__(','):
        dou = string.find(',')
        string = string[:dou] + string[dou + 1:]
        return string


def deleteDouhaoZH(string):
    if string.__contains__('，'):
        dou = string.find('，')
        string = string[:dou] + string[dou + 3:]
        return string


def deleteMaohaoZH(string):
    if string.__contains__('：'):
        mao = string.find('：')
        string = string[:mao] + string[mao + 3:]
        return string


def replaceDunHao(string, replace):
    if string.__contains__('、'):
        dun = string.find('、')
        # 顿号是中文字符，占4个字节，所以加3
        string = string[:dun] + replace + string[dun + 3:]
        return string


def replaceSharp(string, replace):
    if string.__contains__('#'):
        sharp = string.find('#')
        string = string[:sharp] + replace + string[sharp + 1:]
        return string


def replaceSpace(string, replace):
    if string.__contains__(' '):
        space = string.find(' ')
        string = string[:space] + replace + string[space + 1:]
        return string


def replaceColon(string, replace):
    if string.__contains__('：'):
        colon = string.find('：')
        string = string[:colon] + replace + string[colon + 3:]
        return string


def getTitle(raw_str):
    # 1.分离#与标题内容，并确定标题等级
    res = raw_str.split(' ')
    level = res[0].count('#')
    # 重新拼接标题内容，解决包含标题中包含多个空格时获取的标题内容不全的问题
    title = ""
    for i in range(1, res.__len__()):
        # 解决重新拼接标题时，标题内的空格被删掉的问题
        title = title + " " + res[i]
    # 去除标题行首的空格
    title = title.lstrip()
    # 去除标题行尾的换行
    title = title.strip('\n')
    title_for_show = title

    # 2.删除特殊字符
    while title.__contains__('.'):
        title = deleteDot(title)
    while title.__contains__('('):
        title = deleteLeftParentheses(title)
    while title.__contains__(')'):
        title = deleteRightParentheses(title)
    while title.__contains__('`'):
        title = deleteSingleQuoteMark(title)
    while title.__contains__('、'):
        title = deleteDunHao(title)
    while title.__contains__('“'):
        title = deleteLeftDoubleQuoteMark(title)
    while title.__contains__('”'):
        title = deleteRightDoubleQuoteMark(title)
    while title.__contains__('['):
        title = deleteLeftBracket(title)
    while title.__contains__(']'):
        title = deleteRightBracket(title)
    while title.__contains__('+'):
        title = deleteAdd(title)
    while title.__contains__('&'):
        title = deleteAnd(title)
    while title.__contains__('（'):
        title = deleteZuoKuohao(title)
    while title.__contains__('）'):
        title = deleteYouKuohao(title)
    while title.__contains__('/'):
        title = deleteXiexian(title)
    while title.__contains__(','):
        title = deleteDouhao(title)
    while title.__contains__('，'):
        title = deleteDouhaoZH(title)
    while title.__contains__('：'):
        title = deleteMaohaoZH(title)
    while title.__contains__(' '):
        title = replaceSpace(title, '-')

    # 3.英文大写变小写
    res = title.lower()

    return level, title_for_show, res


def getLink(base, string):
    while string.__contains__('、'):
        string = replaceDunHao(string, '-')
    while string.__contains__('#'):
        string = replaceSharp(string, '-')
    while string.__contains__(' '):
        string = replaceSpace(string, '-')
    while string.__contains__('：'):
        string = replaceColon(string, '-')
    return base + "#" + string


def getBase(year, month, day, filename, part):
    res = part + "/" + year + "/" + month + "/" + day + "/" + filename + ".html"
    while res.__contains__('、'):
        res = replaceDunHao(res, '-')
    while res.__contains__('#'):
        res = replaceSharp(res, '-')
    while res.__contains__(' '):
        res = replaceSpace(res, '-')
    while res.__contains__('：'):
        res = replaceColon(res, '-')
    return res


def splitInfo(file_path):
    # 文件名有固定的格式，xxxx-xx-xx-xxxxx.md
    index = file_path.rfind('\\')
    filename = file_path[index + 1:]
    temp = filename.split('-')
    year = filename[0:4]
    month = filename[5:7]
    day = filename[8:10]
    filename = filename[11:]
    return year, month, day, filename


def generateTOC(level, content):
    if level == 0:
        content = "- " + content
    elif level == 1:
        content = "\t- " + content
    elif level == 2:
        content = "\t\t- " + content
    elif level == 3:
        content = "\t\t\t- " + content
    elif level == 4:
        content = "\t\t\t\t- " + content
    elif level == 5:
        content = "\t\t\t\t\t- " + content
    elif level == 6:
        content = "\t\t\t\t\t\t- " + content
    return content


def execFunction(input_path):
    # 博客网址的共有部分，可以替换成你自己的
    part = "http://zhaoxuhui.top/blog"

    # 判断手动输入还是自动传入
    flag = raw_input("Auto input file path?y/n\n")
    if flag == "y":
        path = input_path
    else:
        path = raw_input("Input path of file:\n")

    # 利用decode函数解决文件名中含有中文字符的问题
    f = open(path.decode('utf8'), 'r')
    headers = []
    lines = []
    line = f.readline()
    lines.append(line)
    while line:
        line = f.readline()
        lines.append(line)
        # 通过每一行中含有的井号数量计算，因为在博客中一般不采用一、二级标题(太大了)
        # 不好看，所以认为如果一行之中包含有连续两个井号，就认为是标题
        if line.__contains__("##"):
            headers.append(line.decode('utf-8').encode('utf-8'))
    f.close()
    # 如果没找到标题，程序退出
    if headers.__len__() == 0:
        print("No title.")
        exit()

    if flag != "y":
        correct_path = path[:-3]
    else:
        # 获取除去`_auto`后缀的正确的名字
        correct_path = path[:-8] + ".md"
    year, month, day, name = splitInfo(correct_path)
    base = getBase(year, month, day, name, part)

    formatted_title = []
    links = []
    bookmarks = []
    format_mark = []
    levels = []
    for item in headers:
        # 获取标题对应的id
        lev, show, res = getTitle(item)
        formatted_title.append(res)
        # 由每一篇博客的网址和标题id信息拼接url
        link = getLink(base, res)
        links.append(link)
        levels.append(lev)
        # 拼接Markdown格式的超链接
        content = "[" + show + "](" + link + ")"
        bookmarks.append(content)

    # 寻找标题最大等级(数字最小)，以此作为一级列表
    # 因为在博客中很多都是直接从3级甚至4级标题开始的
    # 因此没必要空出来1、2级标题的层次，非常难看，直接把3或4当作第一级
    min_level = (min(levels))
    for i in range(levels.__len__()):
        levels[i] = levels[i] - min_level

    # 基于标题不同等级，按照Markdown语法生成TOC
    for i in range(bookmarks.__len__()):
        format_mark.append(generateTOC(levels[i], bookmarks[i]))

    new_lines = []
    # 按照我自己的post格式，将TOC插入在前12行之后
    for i in range(12):
        new_lines.append(lines[i])
    # 写入TOC
    for item in format_mark:
        new_lines.append(item + "\n")
    # 添加TOC与正文之间的分隔线
    new_lines.append("<hr style=\"margin:0em 0em 1.75em 0em;\">\n")
    # 与博客配套的自定义的目录与正文的分隔符
    new_lines.append("<!--break-->")
    # 写入正文剩余部分内容
    for i in range(11, lines.__len__()):
        new_lines.append(lines[i])

    # 将重新生成的post内容输出到md文件中，覆盖原文件
    out_toc = ""
    for item in new_lines:
        out_toc = out_toc + item
    save_path = correct_path + "_toc.md"
    fout = open(save_path.decode('utf8'), 'w')
    fout.writelines(out_toc)
    fout.close()

    print("Success!")

代码的测试其实不用做了，看看这篇博客开头的目录，就是用这个脚本自动生成的。当然，通过这个脚本也学到了挺多小知识的，也都写在注释里了。包括字符串开头空格怎么去除、行尾换行符怎么去除、怎么处理中文路径、字符编码等等小问题。

如果考虑的更细致一点，可以对目录的层级做个限制，如最多3级，多于3级的就不显示，也是可以的。实现起来也比较简单，在代码里加几个判断就可以了。不过考虑到可能读者阅读时恰巧就是对某个四级标题对应的内容感兴趣，就是他想要阅读的东西。而如果只是显示到3级标题，可能会让他错过他感兴趣的东西，找半天才找到。所以综合上述考虑，决定把所有目录内容都显示出来。虽然这样做会让目录可能看起来很长、很大，有点不好看。但是考虑到读者的阅读效率和给予读者的信息，好看不好看可能就不是那么重要了。

在代码中对于正文和目录做了专门的分割，原因是如果不分割的话，那么Jekyll的引擎会自动读取Markdown文件中的一部分字符(如我指定的是前150个字)。这样的结果就是博客的摘要全变成目录了，这显然不是想要的结果。因此利用指定分隔符将目录与正文分开，让Jekyll只读分开后的正文部分，这样就和之前没加目录前一样了。实现分割是利用Liquid语言写的，Jekyll引擎在生成网页的时候会运行我写的这句代码，从而将内容分开。

# for next post
page.next.content  || split:'<!--Break-->' | last | strip_html | replace:"#","" | truncate:150

# for previous post
page.previous.content  || split:'<!--Break-->' | last | strip_html | truncate:150

这里之所以对Next Post进行一个井号的替换，是因为不知道为什么，利用page.next.content获取到的是Markdown文件的原始内容，也就是包含一些Markdown格式符的文本。这样所有标题前面都会有很多个井号，比较影像效果。所以这里对井号进行了处理，去掉所有井号，不显示在“下一页”的摘要中，否则这么多无意义的井号也没有价值。而page.previous.content获取到的文本就是解析过的，不含Markdown格式符。比较神奇，尚不清楚为什么会这样。所以只能采用了这个“曲线救国”的办法，让“下一页”的摘要看起来好看一点。

更多关于Liquid语言的知识可以参考其官网以及这篇博客就可以对Liquid有个简单的了解了。

4.联动

由于想让博客发表的流程更加简单，所以考虑将这次的TOC脚本和之前写的图片处理的脚本联动起来，合成一个脚本，这样用起来会更加方便。联动代码很简单，这里贴一下，也挂在了Github上，叫BlogFormatter。

import toc
import BlogImages

flag0 = raw_input("Auto generate IMG?y/n\n")
if flag0 == "y":
    print("Format images...\n")
    out_res = BlogImages.execImgs()

flag1 = raw_input("\nInsert TOC?y/n\n")
if flag1 == "y":
    print("\nInsert TOC...\n")
    toc.execFunction(out_res)
else:
    exit()

这里考虑了两个功能的联动与拆分，可以单独只做某一个，可以两个都做，按需选择即可。

[2019-09-12更新]

针对博客新的生成流程，对原代码进行了升级，实现了全流程自动化，比之前更加方便。所有代码在上面的Github项目中的new文件夹下。包含图片格式化、TOC插入、索引文件生成、文件复制、Github提交五个步骤。

图片格式化主要包含对静态图片(jpg、png)、动态图片(gif)和视频(mp4、avi)的文件名格式化，对静态图片的缩放、压缩，以及标签生成与插入步骤。 TOC插入主要用于对文档生成目录。索引文件生成主要用于随机探索功能。文件复制则将文件复制到指定的目录。 Github提交则是将更改提交到网上。

[更新结束]

Menu

博客目录生成工具(Python)

May 1,2018 9411 words 34 min

1.用途

2.原理

3.代码

4.联动