参考学习资料:https://github.com/karpathy/arxiv-sanity-preserver#arxiv-sanity-preserver
这是一个论文检索引擎
先来一段介绍:
arxiv sanity preserver
This project is a web interface that attempts to tame the overwhelming flood of papers on Arxiv. It allows researchers to keep track of recent papers, search for papers, sort papers by similarity to any paper, see recent popular papers, to add papers to a personal library, and to get personalized recommendations of (new or old) Arxiv papers. This code is currently running live at www.arxiv-sanity.com/, where it's serving 25,000+ Arxiv papers from Machine Learning (cs.[CV|AI|CL|LG|NE]/stat.ML) over the last ~3 years. With this code base you could replicate the website to any of your favorite subsets of Arxiv by simply changing the categories infetch_papers.py
.
以上介绍的大概意思就是说这个搜索引擎很智能,想关注什么领域的最新进展就把喜欢的主题词在infetch_papers.py
做一下更改即可,这是机器学习的杰作等等。
几秒钟就能注册成功,跟你打字速度一样快,进入之后是这么个界面:
代码布局
代码有两大部分:
索引代码。使用 Arxiv API 下载任何你喜欢的类别的最新论文,然后下载所有论文,提取所有文本,根据每篇论文的内容创建 tfidf 向量。因此,此代码与后端抓取和计算有关:建立 arxiv 论文数据库、计算内容向量、创建缩略图、为人计算 SVM 等。
用户界面。然后是一个网络服务器(基于Flask/Tornado/sqlite),允许通过数据库搜索和过滤相似文件,等等。
Dependencies
Several: You will need numpy
, feedparser
(to process xml files), scikit learn
(for tfidf vectorizer, training of SVM), flask
(for serving the results), flask_limiter
, and tornado
(if you want to run the flask server in production). Also dateutil
, and scipy
. And sqlite3
for database (accounts, library support, etc.). Most of these are easy to get through pip
, e.g.:
$ virtualenv env # optional: use virtualenv
$ source env/bin/activate # optional: use virtualenv
$ pip install -r requirements.txt
此外还可能需要 ImageMagick 和 pdftotext, 可通过Ubuntu 系统指令 sudo apt-get install imagemagick poppler-utils
完成,好多的依赖。
流程如下,最好是按顺序来:
- Run
fetch_papers.py
to query arxiv API and create a filedb.p
that contains all information for each paper. This script is where you would modify the query, indicating which parts of arxiv you'd like to use. Note that if you're trying to pull too many papers arxiv will start to rate limit you. You may have to run the script multiple times, and I recommend using the arg--start-index
to restart where you left off when you were last interrupted by arxiv. - Run
download_pdfs.py
, which iterates over all papers in parsed pickle and downloads the papers into folderpdf
- Run
parse_pdf_to_text.py
to export all text from pdfs to files intxt
- Run
thumb_pdf.py
to export thumbnails of all pdfs tothumb
- Run
analyze.py
to compute tfidf vectors for all documents based on bigrams. Saves atfidf.p
,tfidf_meta.p
andsim_dict.p
pickle files. - Run
buildsvm.py
to train SVMs for all users (if any), exports a pickleuser_sim.p
- Run
make_cache.py
for various preprocessing so that server starts faster (and make sure to runsqlite3 as.db < schema.sql
if this is the very first time ever you're starting arxiv-sanity, which initializes an empty database). - Start the mongodb daemon in the background. Mongodb can be installed by following the instructions here - https://docs.mongodb.com/tutorials/install-mongodb-on-ubuntu/.
- Start the mongodb server with -
sudo service mongod start
. - Verify if the server is running in the background : The last line of /var/log/mongodb/mongod.log file must be -
[initandlisten] waiting for connections on port <port>
- Run the flask server with
serve.py
. Visit localhost:5000 and enjoy sane viewing of papers!
可选项: 你也可以运行twitter_daemon.py
在screen session, 使用Twitter API credentials (stored in twitter.txt) Twitter periodically looking for mentions of papers in the database, 并且可以把搜索结果写入twitter.p
.
作者说还有一个简单的shell脚本,通过逐个运行这些命令,他会每天运行这个脚本来获取新论文,将它们合并到数据库中,并重新计算所有tfidf矢量/分类器。有关此过程的更多详细信息,请参阅下文。
protip: numpy/BLAS: 脚本analyze.py
与numpy
执行大量繁重的工作。作者建议小心地设置你的numpy使用BLAS(例如OpenBLAS),否则计算将需要很长时间。该脚本拥有 25,000 篇论文和 5000 名用户,使用与 BLAS 链接的 numpy
在他的计算机上运行了几个小时。
Running online
If you'd like to run the flask server online (e.g. AWS) run it as python serve.py --prod
.
You also want to create a secret_key.txt
file and fill it with random text (see top of serve.py
).
Current workflow
作者说他这个运作现在还不是全自动的,那他怎么让代码活到现在呢,他通过一个脚本,在 arxiv 出来后(~midnight PST) 执行了以下更新:
python fetch_papers.py
python download_pdfs.py
python parse_pdf_to_text.py
python thumb_pdf.py
python analyze.py
python buildsvm.py
python make_cache.py
作者使用的 screen session,所以设置screen -S serve
参数 (或-r
to reattach to it) 然后在运行:
python serve.py --prod --port 80
服务器将加载新文件并开始托管站点。请注意,在某些系统上,如果没有 sudo
,您无法使用端口 80。两个选项是使用iptables
重置路由端口,或者可以使用 setcap来授予运行serve.py
的python
解释器的权限。在这种情况下,我建议谨慎对待权限,也许可以尝试用虚拟机?(不是太明白这个设置,应该是怕资料泄露之类的)等等。
因为还没有系统的学习过python,暂时还不敢随意尝试。
ImageMagick
这里提到的依赖工具其中一个是个类似作弊器一样的东西(美图秀秀+全能扫描王?)http://www.imagemagick.org/script/index.php
也是个开源的免费软件目前版本是ImageMagick 7.0.9-2. 兼容 Linux, Windows, Mac Os X, iOS, Android OS, 及其他.
可参考ImageMagick使用实例来使用ImageMagick用 command-line 完成任务. 也可参见 Fred's ImageMagick Scripts: 里面包括执行几何变换、模糊、锐化、边缘、降噪和颜色操作的大量命令行脚本。也可以用参考Magick.NET,使用ImageMagick可不用安装客户端。
下载安装参考:http://www.imagemagick.org/script/download.php
另一个是个读PDF并转为文档的工具 pdftotext
在开源的XpdfReader代码上做了修饰的一个工具http://www.xpdfreader.com/