Python的开发跟其他的一些语言是有很大不同的. 她和Ruby, Perl一样都是解释型语言,所以开发者能够交互式编程环境来实时的测试执行代码. Python的这一特性意味着她在不用编译, 能够用来快速地开发和调试代码原型. Python类似于Scala和Javascript都包含了很多实用的开发工具来帮助脚本式的开发. 但Python同时又是像Java和C++一样, 具有很强扩展性, 能模块化编程的面向对象编程语言,而不仅仅是简单的执行脚本.

一般Python用来快速执行的单一脚本, 用类似Django这样的大型可扩展框架来开发网站应用, 用Celery做数据处理等, 甚至科学计算科学处理都占Python应用的一大部分.Python这门轻量级高效的编程语言, 在大多系统都是默认就安装了的, 所以呢, 用她来做数据分析,数据处理, 任务分析等任务就是第一选择了.

然而, Python的一大缺点就是没有一整套的开发流程, 自然也是没有一个标准的IDE或者是开发框架. 大部分Python参考资料都是教你怎么去使用这门脚本语言,完全忽略了一个重点,那就是如何去构建一个大型的Python项目. 这篇文章就是来介绍下用Python来构建大型数据类型项目的一个流程.

开发环境

那么, 要完成成功的开发数据项目这目标, 你应该需要些什么呢? 很简单的两点:

文本编辑器, Notepad++, Vim, Emacs 或者 Text Wrangler等都行.(译注: Sublime)
终端, 当然你得把环境变量设好.(译注: 把Python的path加入PATH环境变量中)

对, 只需要这两个! 当然也有很多带调试, 代码补全和语法高亮的开发环境. 然而这些东西归根到底, 都只是把文本编辑器和终端结合, 然后添加了一些使用的功能. 如果你执意要使用IDE, 那么我推荐一些的一些:

IDLE -这个对于Windows用户可能会很熟悉, 因为通常他们的第一个Python程序就是在这里完工的. 虽然她很简单, 但是Python自带的而且效率也还不错.
Komodo Edit - 这款免费IDE是由ActiveState公司操刀的, 提供了很多的工具和实用的功能.
PyCharm - 虽然收费, 但是绝对值, 用起来和 IntelliJ 一样.
Aptana Studio - 虽然她是助攻 Web 开发的, 但是也内置了对 Python 的支持.
Spyder - 专注于科学计算.
iPython - 交互式开发环境, 可以保存运行的 Python 代码和数据.

然而, 即使你使用了这些工具, 你还是会回到下面要讲的基本开发流程. Sublime Text 3具有很多巧妙而又强大的特性, 语法高亮也只需要添加pdb文件, 同时还有命令行, 所以很多独立开发者都是使用她作为他们的首要工具.

随着你项目的增大, 你也会使用到下面的一些工具:

Git/Github.com - 版本控制, 代码托管.
pip - Python 第三方工具, 库的包管理
virtualenv and virtualenvwrapper - 虚拟开发环境, 各个项目的包依赖就不用混乱了.

还有很多使用的辅助开发工具, 但是这三个工具在当前 Python 开发中是比较重要而且比较常用的, 下面我还会进一步讲到.

第三方库

在开发的过程中, 不可避免,你肯定会或多或少地使用到第三方库, 特别是在做数据处理时需要像 Numpy, Pandas等其它的工具. 安装这些库在你的系统上通常只需要使用pip-python的包管理工具.使用pip会帮你解决不少麻烦,节省时间, 当然你得在你的机子上先安装好她!

requests.py 是一个很简单的HTTP库, 很容易实现请求web数据. 要安装她只需要使用下面简单的命令:

$ pip install requests

安装,卸载,更新都是用pip这个命令. pip freeze能够查看你系统上安装的python库. 要搜索可用的库,到这里 Python Package Index (PyPI).

虚拟环境

当你开发的东西越来越多, 你会发现有一些特殊版本的工具或者工具是很难运行起来, 特定的项目要特定版本的库或工具, 有时候还有和其他项目用到的库发生冲突. 当开发Python2 和Python3 两个版本时, 甚至 Python 本省就有问题, 有可能(很小)你在开发的时候系统崩溃.

解决办法是用给开发包一个专门的虚拟环境, 然后在这个环境下开发项目. 虚拟环境可用可以创建一个包含特定版本Python,pip, 以及第三方包的目录. 这个虚拟环境在命令行中启用和停止, 允许用户创建自己的虚拟环境. 而且她还能个匹配特定的生产环境(通常是Linux).

Virtualenvwrapper 是另外一个能够让你管理多喝虚拟环境并把他们关联成一个特定项目的库. 这个工具同样必不可少的. 用下面的命令来安装这两个工具:

$ pip install virtualenv virtualenvwrapper

然后在你的家目录下编辑.profile文件,并在最后添加下面下面几行:

export WORKON_HOME=$HOME/.virtualenvs
export PROJECT_HOME=$HOME/Projects
source /usr/local/bin/virtualenvwrapper.sh

你所有的虚拟环境都会存在一个叫virtualenvs的隐藏目录下, 你的项目目录就是用来存放你代码的地方, 我在下面来讨论这块.为了更方便的使用, 我给irtualenv脚本做了很多别号,可以在Ben's VirtualEnv Cheat Sheet查看扩充.

注意: Windows 用户可能需要每个系统有所差别.

代码构建流程

有一下两种形式的代码构建和执行:

把代码写到文本文件中,然后用python执行
把代码写到文本文件中,然后导入到交互式编程环境中.

Generally speaking, developers do both. Python programs are intended to be executed on the command line via thepython
binary, and the thing that is executed is usually an entry point to a much larger library of code that is imported. The difference between importing and execution is subtle, but as you do more Python it becomes more important.

With either of these workflows, you create your code in as modular a fashion as possible and, during the creation process, you execute it in one of the methods described above to check it's working. Most Python developers are back and forth between their terminal and the editor, and can do fine grained testing of every single line of code as they're writing it. This is the rapid prototyping aspect of Python.
So let's start with a simple example.
Open a terminal window (see your specific operating system for instructions on how to do this).
NOTE: Commands are in bash (Linux/Mac) or Windows Powershell

Create a workspace for yourself. A workspace, in this sense, is just an empty directory where you can get ready to start doing development work. You should probably also keep your various projects (here, a synonym for workspace) in their own directory as well, for now we'll just call it "Projects" and assume it is in your home directory. Our first project will be called "myproject", but you'd just name this whatever you'd like.
$ cd ~/Projects$ mkdir myproject$ cd myproject

Let's create our first Python script. You can either open your favorite editor and save the file into your workspace (the ~/Projects/myproject directory), or you can touch it and then open that file with your editor.
$ touch foo.py

PRO TIP: If you're using Sublime Text 3 and have thesubl
command line tool installed (See Sublime Text installation instructions), you can use the following command to open up the current directory in the editor:
$ subl . &

I use this so much that I've aliased the command toe
.

So here's where you should be: You should have a text editor open and editing the file at~/Projects/myproject/foo.py
, and you should have a terminal window open whose current working directory is~/Projects/myproject
. You're now ready to develop. Add the following code to foo.py:

!/usr/bin/env pythonimport csvdef dataset(path): with open(path, 'rU') as data: reader = csv.reader(data) for row in reader: row[2] = int(row[2]) yield row

This code is very simple. It just implements a function that accepts a path and returns an iterator so that you can access every row of a CSV file, while also converting the third item in every row to an integer.
PRO TIP: The#!
(pronounced "shebang") line must appear at the very beginning of an executable Python script with nothing before it. It will tell your computer that this is a Python file and execute the script correctly if run from the command line as a standalone app. This line doesn't need to appear in library modules, that is, Python code that you plan to import rather than execute.

Create some data so that we can use our function. Let's keep all of our data in a fixtures directory in our project.
$ mkdir fixtures$ touch fixtures/calories.csv

Using your editor, add this data to the calories.csv file:
butter,tbsp,102cheddar cheese,slice,113whole milk,cup,148hamburger,item,254

Ok, now it's time to use our code. First, let's try to execute the code in the interpreter. Open up the REPL as follows:
$ python>>>

You should now be presented with the Python prompt (>>>
). Anything you type in now should be in Python, not bash. Always note the prompts in the instructions. A prompt with$
means type in command line instructions (bash), a prompt that says>>>
means type in Python on the REPL, and if there is no prompt, you're probably editing a file. Import your code:

from foo import dataset>>> for row in dataset('fixtures/calories.csv'):... print row[0]buttercheddar cheesewhole milkhamburger>>>

A lot happened here, so let's inspect it. First, when you imported the dataset function from foo, Python looked in your current working directory and found thefoo.py
file, and that's where it imported it from. Where you are on the command line and what your Python path is matters!
When you import the dataset function the way we did, the module is loaded and executed all at once and provided to the interpreter's namespace. You can now use it by writing a for loop to go through every row and print the first item. Note the...
prompt. This means that Python is expecting an indented block. To exit the block, hit enter twice. The print results appear right in the screen, and then you're returned to the prompt.
But what if you make a change in the code, for example, capitalizing the first letter of the words in first item of each row? The changes you write in your file won't show up in the REPL. This is because Python has already loaded the code once. To get the changes, you either have to exit the REPL and restart or you have to import in a different way:

import foo>>> for row in foo.dataset('fixtures/calories.csv'):...

Now you can reload the foo module and get your code changes:

reload(foo)

This can get pretty unwieldy as code gets larger and more changes happen, so let's shift our development strategy over to executing Python files. Inside foo.py, add the following to the end of the file:
if name == 'main': for row in dataset('fixtures/calories.csv'): print row[0]

To execute this code, you simply type the following on the command line:
$ python foo.pybuttercheddar cheesewhole milkhamburger

Theif name == 'main':
statement means that the code will only get executed if the code is run directly, not imported. In fact, if you open up the REPL and type inimport foo
, nothing will be printed to your screen. This is incredibly useful. It means that you can put test code inside your script as you're developing it without worrying that it will interfere with your project. Not only that, it documents to other developers how the code in that file should be used and provides a simple test to check to make sure that you're not creating errors.
In larger projects, you'll see that most developers put test and debugging code under so called "ifmain" statements at the bottom of their files. You should do this too!

With this example, hopefully you have learned the workflow for developing Python programs both by executing scripts and using "ifmain" as well as importing and reloading scripts in the REPL. Most developers use both methods interchangeably, using whatever is needed at the time.
Structuring Larger Projects
Ok, so how do you write an actual Python program and move from experimenting with short snippets of code to larger programs? The first thing you have to do is organize your code into a project. Unfortunately there is really nothing to do this for you automatically, but most developers follow a well known pattern that was introduce by Zed Shaw in his book Learn Python the Hard Way.
In order to create a new project, you'll implement the "Python project skeleton," a set of directories and files that belong in every single project you create. The project skeleton is very familiar to Python developers, and you'll quickly start to recognize it as you investigate the code of other Python developers (which you should be doing). The basic skeleton is implemented inside of a project directory, which are stored in your workspace as described above. The directory structure is as follows (for an example project calledmyproject
):
$ myproject.├── README.md├── LICENSE.txt├── requirements.txt├── setup.py├── bin| └── myapp.py├── docs| ├── _build| ├── conf.py| ├── index.rst| └── Makefile├── fixtures├── foo| └── init.py└── tests └── init.py

This is a lot, but don't be intimidated. This structure implements many tools including packaging for distribution, documentation with Sphinx, testing, and more.
Let's go through the pieces one by one. Project documentation is the first part, implemented asREADME.md
andLICENSE.txt
files. The README file is a markdown document that you can add developer-specific documentation to your project. The LICENSE can be any open source license, or a Copyright statement in the case of proprietary code. Both of these files are typically generated for you if you create your project in Github. If you do create your file in Github, you should also use the Python.gitignore
that Github provides, which helps keep your repositories clean.
Thesetup.py
script is a Python setuptools or distutils installation script and will allow you to configure your project for deployment. It will use therequirements.txt
to specify the third party dependencies required to implement your project. Other developers will also use these files to create their development environments.
Thedocs
directory contains the Sphinx documentation generator, Python documentation is written in restructuredText, a Markup language similar to Markdown and others. This documentation should be more extensive and should be for both users and developers. Thebin
directory will contain any executable scripts you intend to build. Data scientists also typically also have afixtures
directory in which to store data files.
Thefoo
andtests
directories are actually Python modules since they contain the__init__.py
file. You'll put your code in foo and your tests in tests. Once you start developing inside your foo directory, note that when you open up the REPL, you have to import everything from the 'foo' namespace. You can put import statements in your__init__.py
files to make things easier to import as well. You can still also execute your scripts in the foo directory using the "ifmain" method.
Setting Up Your First Project
You don't have to manually create the structure above, many tools will help you build this environment. For example the Cookiecutter project will help you manage project templates and quickly build them. The spinx-quickstart command will generate your documentation directory. Github will add theREADME.md
andLICENSE.txt
stubs. Finally,pip freeze
will generate therequirements.txt
file.
Starting a Python project is a ritual, however, so I will take you through my process for starting one. Light a candle, roll up your sleeves, and get a coffee. It's time.
Inside of your Projects directory, create a directory for your workspace (project). Let's pretend that we're building a project that will generate a social network from emails, we'll call it "emailgraph."
$ mkdir ~/Projects/emailgraph$ cd ~/Projects/emailgraph

Initialize your repository with Git.
$ git init

Initialize your virtualenv with virtualenv wrapper.
$ mkvirtualenv -a $(pwd) emailgraph

This will create the virtual environment in ~/.virtualenvs/emailgraph and automatically activate it for you. At any time and at any place on the command line, you can issue theworkon emailgraph
command and you'll be taken to your project directory (the-a
flag specifies that this is the project directory for this virtualenv).

Create the various directories that you'll require:
(emailgraph)$ mkdir bin tests emailgraph docs fixtures

And then create the various files that are needed:
(emailgraph)$ touch tests/init.py(emailgraph)$ touch emailgraph/init.py(emailgraph)$ touch setup.py README.md LICENSE.txt .gitignore(emailgraph)$ touch bin/emailgraph-admin.py

Generate the documentation usingsphinx-quickstart
:
(emailgraph)$ sphinx-quickstart

You can safely use the defaults, but make sure that you do accept the Makefile at the end to quickly and easily generate the documentation. This should create an index.rst and conf.py file in yourdocs
directory.

Install nose and coverage to begin your test harness:
(emailgraph)$ pip install nose coverage

Open up thetests/init.py
file with your favorite editor, and add the following initialization tests:
import unittestclass InitializationTests(unittest.TestCase): def test_initialization(self): """ Check the test suite runs by affirming 2+2=4 """ self.assertEqual(2+2, 4) def test_import(self): """ Ensure the test suite can import our module """ try: import emailgraph except ImportError: self.fail("Was not able to import the emailgraph")

From your project directory, you can now run the test suite, with coverage as follows:
(emailgraph)$ nosetests -v --with-coverage --cover-package=emailgraph \ --cover-inclusive --cover-erase tests

You should see two tests passing along with a 100% test coverage report.

Open up thesetup.py
file and add the following lines:

!/usr/bin/env pythonraise NotImplementedError("Setup not implemented yet.")

Setting up your app for deployment is the topic of another post, but this will alert other developers to the fact that you haven't gotten around to it yet.

Create therequirements.txt
file usingpip freeze
:
(emailgraph)$ pip freeze > requirements.txt

Finally, commit all the work you've done to email graph to the repository.
(emailgraph)$ git add --all(emailgraph)$ git statusOn branch masterInitial commitChanges to be committed: (use "git rm --cached <file>..." to unstage) new file: LICENSE.txt new file: README.md new file: bin/emailgraph-admin.py new file: docs/Makefile new file: docs/conf.py new file: docs/index.rst new file: emailgraph/init.py new file: requirements.txt new file: setup.py new file: tests/init.py(emailgraph)$ git commit -m "Initial repository setup"

With that you should have your project all setup and ready to go. Get some more coffee, it's time to start work!
Conclusion
With this post, hopefully you've discovered some best practices and workflows for Python development. Structuring both your code and projects this way will help keep you organized and will also help others quickly understand what you've built, which is critical when working on projects involving more than one person. More importantly, this project structure is the preparation for deployment and the base for larger applications and professional, production grade software. Whether you're scripting or writing apps, I hope that these workflows will be useful.
If you'd like to explore further how to include professional grade tools into your Python development, check out some of the following tools:
Travis-CI is a continuing integration service that will automatically run your test harness when you commit to Github. It will make sure that all of your tests are passing before you push to production!
Waffle.io will turn your Github issues into a full Agile board allowing you to track milestones and sprints, and better coordinate your team.
Pylint will automatically check for good coding standards, error detection, and even draw UML diagrams for your code!

If you're having trouble with anything we've covered or you find any errors, please leave us a comment! Also, all developers are as different as they are the same, so if you have a workflow that you think others would benefit from, please let us know in the code!
If you liked this post and found it helpful, go to the blog home page and click the Subscribe button so that you don't miss any of the awesome posts we have coming up.

扩展阅读

笨方法学Python:Learn Python the Hard Way
Python 学习, 第五版:Learning Python, 5th Edition
Python 编程:Programming Python
数据科学实用手册:Practical Data Science Cookbook
Python for you and me
Easy-Python
Awesome Python
Python Free Books
Python Ecosystem An Introduction
Full Stack Python
Talk Python FM

[翻译]高质量Python代码技巧