SuperMemo实践闭环(4)-交互式处理网页材料

接上一节,这节我们讲解网页材料在SuperMemo中的处理方法,首先回顾下我们之前流程步骤:

如上图示,在之前的学习中,我们有了解到,可以使用Obsidian进行知识点链接,在Obsidian中创建新节点,关联已存在节点的形式,来获取及扩充我们的学习材料范围.本文正是对此部分内容,对前文进行了工程实践探讨. 如果你还不太清楚相关的概念或流程,你可以先参考我之前的原文大体了解,前文链接如下:

一只小胖子：SuperMemo实践闭环(1)-学习流程及时间管理19 赞同 · 14 评论文章

我们主要的学习材料常见为PDF/视频/笔记/网页等四种. 对于PDF/视频/笔记类的材料一般比较好处理,获取到资源后直接把对应的路径信息整理好,放置入SM学习软件即可(操作见上文).但对于网页类的信息处理则比较麻烦,因此本文也把主要的目标放置在介绍网页类材料的处理操作.

在正常的学习过程中,网页类型的信息比较丰富,常见来源如RSS订阅 / 公众号订阅 / 知乎关注/ 引擎工具搜索 / 网址收藏夹等.所有的这些信息都是通过网址链接的形式呈现的.这些网址如果按文件夹结构批量放置于收藏夹来操作会比较麻烦,因为通过文件夹方式管理会遇到怎么命名文件夹的纠结,也会遇到网址信息过多后,内容各种冗余等常见的问题.

这里我先放置一张最终的效果图:(此方案流程及意义: 通过这种方法,我们可以指定关键字检索来批量处理网页材料,对要添加至学习的网页则在左下角保存记录,最终生成的文件可以直接拷贝内容或者改后缀为网页直接用SuperMemo软件来处理,大大的提高了网页材料处理的效率.)

在本文中,我们通过批量对网址链接进行处理,创建了一个交互网页,通过交互式展现,实现快速检索 / 分类 / 整理多个网址.其中具体使用到了streamlit / pyecharts 的 python组件,通过streamlit编写交互式脚本,pyechars进行词云图展示,whoosh进行全文检索.我这里放置了对应的官网链接.

Streamlit • The fastest way to build and share data appsstreamlit.io/

https://pypi.org/project/Whoosh/pypi.org/project/Whoosh/

步骤一: 获取多个网址链接,这里通过Edge演示,使用了Copy All URLs插件,具体安装使用如图:

使用插件来获取多个网址

步骤二: 拷贝获取到的链接信息到脚本文件,并通过命令行运行脚本streamlit run Gist2.py,程序会自动打开一个网页.即上面的效果图网页.

先放置获取到的多个网址

脚本运行后链接自动打开

按关键字搜索使用即可

可以在右上角设置中,设置宽屏及实时运行模式.

设置项配置宽屏及实时模式

步骤三: 直接放置代码了,按需安装对应的Python包,放置多网址链接,命令行直接运行脚本即可.

最新的Gist脚本可通过GitHub访问: https://gist.github.com/ef56f43040244978fd2714608dc3d115

#!/usr/bin/env python# -*- coding: utf-8 -*-# 批量网页分析处理# 作者:一只小胖子# 版本:V0.1# 知乎:https://www.zhihu.com/people/lxf-8868# 使用:# 1.Copy All URLs 插件获取多个网页地址# 2.命令行执行streamlit run Gist2.pyurl_texts = """ 提示: 在这里放置多个网址信息"""# ===== 一.使用pyecharts生成词云图 =====# 参考：朱卫军# 链接：https://zhuanlan.zhihu.com/p/113312256# https://blog.csdn.net/zx1245773445/article/details/98043120import jiebafrom collections import Counterimport pyecharts.options as optsfrom pyecharts.charts import WordCloud# # 读取内容来源,返回文本数组# def get_text(goods, evaluation):#     if evaluation == '好评':#         evaluation = 1#     else:#         evaluation = 0#     path = 'excel/comments.csv'#     with open(path, encoding='utf-8') as f:#         data = pd.read_csv(f)#     # 商品种类#     types = data['类型'].unique()#     # 获取文本#     # text = data[(data['类型']==goods)&(data['标签']==evaluation)]['内容'].values.tolist()#     text = data['内容'].values.tolist()#     text = str(text)[1:-1]  # 去符号 []#     print(types)#     return text### stext = get_text('1', '好评')# print(stext)## 结巴分词字典加载 对文本内容进行jieba分词 https://zhuanlan.zhihu.com/p/41032295def split_word(text):
    word_list = list(jieba.cut(text))
    print(len(word_list))
    # 去掉一些无意义的词和符号，我这里自己整理了停用词库
    with open('停用词库.txt') as f:
        meaningless_word = f.read().splitlines()
        # print(meaningless_word)
    result = []
    # 筛选词语
    for i in word_list:
        if i not in meaningless_word:
            result.append(i.replace(' ', ''))
    return result# collections 的使用 https://zhuanlan.zhihu.com/p/108713135# 统计词频def word_counter(words):
    # 词频统计,使用Count计数方法
    words_counter = Counter(words)
    # 将Counter类型转换为列表
    words_list = words_counter.most_common(2000)
    return words_list# 制作词云图def word_cloud(data):
    (
        WordCloud().add(
            series_name="热点分析",
            # 添加数据
            data_pair=data,
            # 字间隙rue
            word_gap=5,
            # 调整字大小范围
            word_size_range=[15, 80],
            shape="cursive",
            # 选择背景图，也可以不加该参数，使用默认背景
            # mask_image='购物车.jpg')
        ).set_global_opts(
            # title_opts=opts.TitleOpts(
            #     title="热点分析", title_textstyle_opts=opts.TextStyleOpts(font_size=12)
            # ),
            tooltip_opts=opts.TooltipOpts(is_show=True),
        ).render("basic.html")  # 输出为html格式
    )# [测试Demo]:# stext = ''' '书籍1做父母一定要有刘墉这样的心态，不断地学习，不断地进步，不断地给自己补充新鲜血液，让自己保持.',# '书籍1作者真有英国人严谨的风格，提出观点、进行论述论证，尽管本人对物理学了解不深，但是仍然能感受到.书籍', '1作者长篇大论借用详细报告数据处理工作和计算结果支持其新观点。为什么荷兰曾经县有欧洲最高的生产.. 1',# '书籍1作者在战几时之前用了“拥抱"令人叫绝.日本如果没有战败，就有会有美军的占领，没胡官僚主义的延.书籍1作者在少年时即喜阅读，能看出他精读了无数经典，因而他有一个庞大的内心世界。他的作品最难能可贵..',# '书籍1作者有一种专业的谨慎，若能有幸学习原版也许会更好，简体版的书中的印刷错误比较多，影响学者理解.',# '书籍1作者用诗一样的语言把如水般清澈透明的思想娓娓道来，像一个经验丰富的智慧老人为我们解开一个又一.书籍1作者提出了一种工作和生活的方式，作为咨询界的元老，不仅能提出理念，而且能够身体力行地实践，并.'# sword = split_word(stext)# print(sword)# word_stat = word_counter(sword)# print(word_stat)# word_cloud(word_stat)# show_WordCounter()# ===== 二.使用Whoosh进行全文检索 =====# 参考：酷python# 链接：https://zhuanlan.zhihu.com/p/172348363# https://www.cnblogs.com/mydriverc/articles/4136754.htmlimport os, errnofrom whoosh.qparser import QueryParser, MultifieldParser# from whoosh.fields import TEXT, SchemaClassfrom whoosh.query import compound, Term, Queryfrom whoosh.index import create_infrom whoosh.index import open_dirfrom whoosh.fields import *from jieba.analyse import ChineseAnalyzerimport htmlimport reimport jsonimport streamlit as st# 而对于Python 3.X（X >= 2）版本，os.makedirs 函数还有第三个参数 exist_ok，该参数为真时执行mkdir -p，# 但如果给出了mode参数，目标目录已经存在并且与即将创建的目录权限不一致时，会抛出OSError异常def mkdir_p(path):
    try:
        os.makedirs(path)
    except OSError as exc:  # Python >2.5 (except OSError, exc: for Python <2.5)
        if exc.errno == errno.EEXIST and os.path.isdir(path):
            pass
        else:
            raise# 存储schema信息至indexdir目录index_dir = 'es/index_dir_1/'if not os.path.exists(index_dir):
    mkdir_p(index_dir)# 就如同定义一张mysql里的表，你需要指出需要存储哪些字段，以及这些字段的类型class ArticleSchema(SchemaClass):
    title = TEXT(stored=True, analyzer=ChineseAnalyzer())
    content = TEXT(stored=True, analyzer=ChineseAnalyzer())
    author = TEXT(stored=False, analyzer=ChineseAnalyzer())# create_in会创建一个名为index_dir的文件夹，添加文档时，一定要根据你所定义的索引模式进行添加，# 这样就创建好了索引，添加文档的过程，就如同向mysql的表里写入数据。schema = ArticleSchema()ix = create_in(index_dir, schema, indexname='article_index')if not ix:
    ix = open_dir(index_dir, indexname='article_index')# 处理文档writer = ix.writer()s_url_arr = url_texts.split("<br/>")print("url待处理项: {}".format(len(s_url_arr)))for i in range(len(s_url_arr)):
    # 网页格式
    # reg_arr = re.findall("">(\w.*)</a><br/>", s_url_arr[i])
    if str(s_url_arr[i]).__contains__("href"):
        reg_href = re.findall('href="(.*)"', s_url_arr[i])[0]
        reg_text = re.findall(">(.*)<", s_url_arr[i])[0]
    # 其它格式
    #
    if reg_href or reg_text:
        # print(reg_href, html.unescape(reg_text))
        # 更新也会添加重复内容!
        # writer.update_document(title= reg_href, author="admin") # , content=html.unescape(reg_text)) # add_document
        # 添加内容
        reg_title = html.unescape(reg_text)  # .encode('unicode-escape')
        writer.add_document(title=reg_href, author="admin", content=reg_title)  # add_document
    # print(json.dumps(json_str, sort_keys=True, indent=4, separators=(',', ': '),ensure_ascii=False))# 删除文档# Because "path" is marked as unique,calling update_document with path = u"/a" will# delete any existing document where the path field contains /a writer.delete_by_term("author", "admin")writer.commit()# 设置iframe长宽高r_width = 1200r_height = 400r_scrolling = True# 展示词云图def show_WordCounter():
    st_file_arr = []
    st_file_lines = open("./basic.html").readlines()
    for st_file_str in st_file_lines:
        st_file_arr.append(st_file_str.strip(""))
    st_file_arr_str = " ".join(st_file_arr)
    # 显示云图
    st.components.v1.html(st_file_arr_str, width=r_width, height=r_height, scrolling=r_scrolling)# 文本输入及展示search_key = "简书"search_key = st.text_input('[1].请输入查询关键词:', search_key)# st.write('你输入的关键词为:', search_key)# st.text('输入关键词为:' + search_key)if not ix:
    ix = open_dir(index_dir, indexname='article_index')title_lists = []content_list = []href_title_dict = {}with ix.searcher() as searcher:
    # author_query = [Term('author', 'admin'), Term('author', 'admin')]
    # content_query = [Term('content', 'python'), Term('content', 'jupyter')]
    # query = compound.Or([compound.Or(author_query), compound.Or(content_query)])
    # content_query = [Term("content", "playwright"), Term("content", "jupyter")]
    # query = compound.Or(content_query)

    # 多条件查询
    # query = QueryParser("content", ix.schema).parse("简书")
    # query = MultifieldParser(["content"], ix.schema).parse("知乎")  # default_set()
    # query = _NullQuery()

    # 搜索所有内容
    results = searcher.documents()
    # print(results)
    content_all = []
    for data in results:
        content_all.append(data["content"])
    sword = split_word("".join(content_all))
    print(sword)
    word_stat = word_counter(sword)
    print(word_stat)

    # 生成词云图
    word_cloud(word_stat)
    # 展示词云图
    show_WordCounter()

    if not search_key:
        st.error("请输入查询条件!")
    else:
        # 按关键词查询
        query = MultifieldParser(["content"], ix.schema).parse(search_key)
        print("查询条件:", query)
        results = searcher.search(query)
        # print(results[0].fields())
        print(query, '一共发现%d份文档。' % len(results))

        # 高亮效果
        # if len(results) > 0:
        #     data = results[0]
        #     text = data.highlights("content","title")
        #     print(text)

        for data in results:
            # json_text = json.dumps(data.fields()["title"], ensure_ascii=False)
            # print(data.fields()["title"])
            reg_href = data.fields()["title"]
            reg_title = data.fields()["content"]
            # 网页高亮展示
            # reg_title = data.highlights("content")
            if reg_href not in title_lists and reg_title not in content_list:
                title_lists.append(reg_href)
                content_list.append(reg_title)
                href_title_dict[reg_title] = reg_href
            # print(data.fields())ix.close()st.text("总共查询到 {} 项".format(len(href_title_dict)))# 写入内容reg_href_s = ""  # 选择的URL记录save_file_path = "备注数据.txt"# 下拉框展示select_box_list = list(href_title_dict.keys())if len(select_box_list) > 0:
    reg_title = st.selectbox('[2].选择要打开的网址:', select_box_list)
    reg_href = href_title_dict[reg_title]
    reg_href_s = "{} : {}   {}".format(search_key, reg_title, reg_href)
    st.text('当前选择: {}'.format(reg_href))
    # 可通过以下两种方式加载
    # url_display = f'<embed type="text/html" src="' + reg_href + '" width="1200" height="600">'  # iframe
    # st.markdown(url_display, unsafe_allow_html=True)
    st.components.v1.iframe(reg_href, width=r_width, height=r_height, scrolling=r_scrolling)

    # 按钮
    if st.button('保存此条记录'):
        if not os.path.exists(save_file_path):
            with open(save_file_path, 'w') as file_:
                file_.writelines("搜索项 ----- 标题 ------ 链接 -----")
                pass
        reg_href_s2_arr = []  # 要写入的内容
        with open(save_file_path, 'r') as file_:
            # 不添加重复内容
            search_arr = re.findall(re.escape(reg_href_s), "".join(file_.readlines()), re.I | re.M)
            print(search_arr)
            if len(search_arr) == 0:
                reg_href_s2_arr.append("
" + reg_href_s + "<p>")
                st.write("写入成功!")
            elif len(search_arr) == 1 and str(search_arr[0]).strip(" ") == "":
                st.write("无查询值!")
            else:
                st.write("已经存在!")
        with open(save_file_path, 'a') as file_:
            if len(reg_href_s2_arr) > 0:
                file_.writelines("".join(reg_href_s2_arr))
        with open(save_file_path, "r") as file_:
            st_content = ("
".join(file_.readlines()))  # <br>
            # st.components.v1.html(st_content)  # 网页高亮展示
            st.write(st_content)# 默认展示内容if st.button('加载默认文件'):
    if os.path.exists(save_file_path):
        with open(save_file_path, "r") as file_:
            st.write("
".join(file_.readlines()))
    else:
        st.write("还未保存记录,请先保存!")if st.button('清空文件内容'):
    # https://blog.csdn.net/weixin_36118143/article/details/111988403
    if os.path.exists(save_file_path):
        os.remove(save_file_path)
    else:
        # os.mknod(save_file_path)
        pass

步骤四: 交互式方案 ipywidgets/ Streamlit/ Plotly Dash , 其它有价值的参考链接.

https://www.biaodianfu.com/streamlit.htmlwww.biaodianfu.com/streamlit.html

Python机器学习工具开发框架：Streamlit-Python学习网www.py.cn/toutiao/15437.html

我是一只热爱学习的小胖子,如果你也热爱学习,并且对SuperMemo感兴趣,欢迎转发和评论!

SuperMemo实践闭环(4)-交互式处理网页材料

SuperMemo实践闭环(4)-交互式处理网页材料

相关阅读更多精彩内容

友情链接更多精彩内容