UCI机器学习数据库使用说明(转)(2012-04-16 16:26:52)

UCI机器学习数据库使用说明

UCI机器学习数据库的网址:http://archive.ics.uci.edu/ml/

数据库不断更新至2010年,是所有学习人工智能都需要用到的数据库,是看文章、写论文、测试算法的必备工具。数据库种类涉及生活、工程、科学各个领域,记录数也是从少到多,最多达几十万条。

UCI数据可以使用matlab的dlmread或textread读取,不过,需要先将不是数字的类别用数字,比如1/2/3等替换,否则读入不了数值,当字符了。

UCI数据库使用说明

转自:http://www.aiseminar.cn/bbs/thread-37-1-1.html

此目录包含数据集和相关领域知识(后面以简短的列表形式进行的注释),这些数据已经或能用于评价学习 算法 。

每个数据文件 (*.data)包含以“属性-值”对形式描述的很多个体样本的记录。对应的*.info文件包含的大量的文档资料。(有些文件_generate_ databases;他们不包含*.data文件。)作为数据集和领域知识的补充,在utilities目录里包含了一些在使用这一数据集时的有用资料。

地址http://www.ics.uci.edu/~mlearn/MLRepository.html,这里的UCI数据集可以看作是通过web的远程拷贝。作为选择

,这些数据同样可以通过ftp获得,ftp://ftp.ics.uci.edu. 可是使用匿名登陆ftp。可以在pub/machine-learning-databases目录中找到。

注意:

UCI一直都在寻找可加入的新数据,这些数据将被写入incoming子目录中。希望您能贡献您的数据,并提供相应的文档。谢谢——贡献过程可以参考DOC-REQUIREMENTS文件。目前,多数数据使用下面的格式:一个实例一行,没有空格,属性值之间使用逗号“,”隔开,并且缺少的值使用问号“?”表示。并请在做出您的贡献后提醒一下站点管理员:ml-repository@ics.uci.edu

下面以UCI中IRIS为例介绍一下数据集:

ucidata\iris中有三个文件:

Index

iris.data

iris.names

index为文件夹目录,列出了本文件夹里的所有文件,如iris中index的内容如下:

Index of iris

18 Mar 1996105 Index

08 Mar 19934551 iris.data

30 May 19892604 iris.names

iris.data为iris数据文件,内容如下:

5.1,3.5,1.4,0.2,Iris-setosa

4.9,3.0,1.4,0.2,Iris-setosa

4.7,3.2,1.3,0.2,Iris-setosa

……

7.0,3.2,4.7,1.4,Iris-versicolor

6.4,3.2,4.5,1.5,Iris-versicolor

6.9,3.1,4.9,1.5,Iris-versicolor

……

6.3,3.3,6.0,2.5,Iris-virginica

5.8,2.7,5.1,1.9,Iris-virginica

7.1,3.0,5.9,2.1,Iris-virginica

……

如上,属性直接以逗号隔开,中间没有空格(5.1,3.5,1.4,0.2,),最后一列为本行属性对应的值,即决策属性Iris-setosa

iris.names介绍了irir数据的一些相关信息,如数据标题、数据来源、以前使用情况、最近信息、实例数目、实例的属性等,如下所示部分:

……

7. Attribute Information:

1. sepal length in cm

2. sepal width in cm

3. petal length in cm

4. petal width in cm

5. class:

-- Iris Setosa

-- Iris Versicolour

-- Iris Virginica

……

9. Class Distribution: 33.3% for each of 3 classes.

本数据的使用实例请参考其他论文,或本站后面的内容。

对应的英文有:

This is the UCI Repository Of Machine Learning Databases and Domain

Theories

============================================================================

This is the UCI Repository Of Machine Learning Databases and Domain Theories

4 December 1995

ftp.ics.uci.edu: pub/machine-learning-databases

http://www.ics.uci.edu/~mlearn/MLRepository.html

Librarian: Patrick M. Murphy (ml-repository@ics.uci.edu)

111 databases and domain theories (36MB)

============================================================================

This directory contains data sets and domain theories (the latter have been

annotated as such in the following brief listing) that have been or can be

used to evaluate learning algorithms. Each data file (*.data) contains

individual records described in terms of attribute-value pairs.The

corresponding *.info file contains voluminous documentation.(Some files

_generate_ databases; they do not have *.data files.)

In addition to data sets and domain theories, the "utilities/" directory

contains utilities that you may find useful when using datasets in this

repository.

The contents of this repository can be viewed and remotely copied over

the web.The address ishttp://www.ics.uci.edu/~mlearn/MLRepository.html.

Alternatively, the contents of this repository can be remotely copied via

ftp to ftp.ics.uci.edu.Enter "anonymous" for user id, and e-mail address

([email=user@host]user@host[/email]) for password.These databases can be found by executing

"cd pub/machine-learning-databases".

Notes:

1. We're always looking for addition al databases, which can be

written to the sub-directory named "/incoming". Please send yours, with

documentation.Thanks -- See DOC-REQUIREMENTS for suggested documentation

procedures. Presently, most databases have the following format: 1

instance per line, no spaces, commas separate attribute values, and

missing values are denoted by "?".Also, please notify the site librarian

(ml-repository@ics.uci.edu) after making a donation.

2. Ivan Bratko requested that the databases he donated from the Ljubljana

Oncology Institute (e.g., breast-cancer, lymphography, and primary-tumor)

have restricted access. We are allowed to share them with academic

institutions upon request. These databases (like several others) require

providing proper citations be made in published articles that use them.

Citation requirements are in each database's corresponding *.doc file.

To access any of these databases, send email toml-repository@ics.uci.edu.

To aid you in deciding if you want any of these databases, the

documentation files are available.

3. An archive server may now be used to recieve via e-mail files in this

repository.Installed on ics, it provides email access to files in

our anonymous ftp/uucp area (~ftp).If people have no other access to

our archives, then they can send mail to:

archive-server@ics.uci.edu

Commands to the server may be given in the body.Some commands are:

help

send

find

The help command replies with a useful help message.

If you publish material based on databases obtained from this repository,

then, in your acknowledgements, please note the assistance you received by

using this repository.Thanks -- this will help others to obtain the same

data sets and replicate your experiments.We suggest the following pseudo-APA

reference format for referring to this repository (LaTeX'd):

Murphy,~P.~M., \& Aha,~D.~W. (1994). {\it UCI Repository of machine

learning databases} [http://www.ics.uci.edu/~mlearn/MLRepository.html].

Irvine, CA: University of California, Department of Information and Computer

Science.

Patrick M. Murphy (Repository Librarian)

----------------------------------------------------------------------

Brief Overview of Databases and Domain Theories:

Quick Listing:

1. annealing (David Sterling and Wray Buntine)

2. Artificial Characters Database & DT (donated by Attilio Giordana)

3-4. audiology (Ray Bareiss and Bruce Porter, used in Protos)

1. Original Version

2. Standardized-Attribute Version of the Original.

5. auto-mpg (from CMU StatLib library)

6. autos (Jeff Schlimmer)

7. badges (Haym Hirsh)

8. balance-scale (Tim Hume)

9. balloons (Michael Pazzani)

10. breast-cancer (Ljubljana Institute of Ontcology, restricted access)

11. breast-cancer-wisconsin (Wisconsin Breast Cancer D'base, Olvi Mangasarian)

1. Original version

2. Diagnostic data set

3. Prognostic data set

12. bridges (Yoram Reich)

13-21. chess

1. Partial generator of Quinlan's chess-end-game data (kr-vs-kn) (Schlimmer)

2. Shapiros' endgame database (kr-vs-kp) (Rob Holte)

3. king-rook-vs-king (Michael Bain, Arthur van Hoff)

4-9. Six domain theories (Nick Flann)

22. Bach Chorales (time-series) database (Darrell Conklin)

23. Connect-4 Database (John Tromp)

24-25. Credit Screening Database

1. Japanese Credit Screening Data and domain theory (Chiharu Sano)

2. Credit Card Application Approval Database (Ross Quinlan)

26. Ein-Dor and Feldmesser's cpu-performance database (David Aha)

27. Diabetes Data (Serdar Uckun, AI-M94)

28. dgp-2 data generation program (Powell Benedict)

29. Document Understanding (Donato Malerba)

30. Nine small EBL domain theories and examples in sub-directory ebl

31. Evlin Kinney's echocardiogram database (Steven Salzberg)

32. flags (Richard Forsyth)

33. function-finding (Cullen Schafer's 352 case studies)

34. glass (Vina Spiehler)

35. hayes-roth (from Hayes-Roth^2's paper)

36-39. heart-disease (Robert Detrano)

40. hepatitis (G. Gong)

41. horse colic database (Mary McLeish & Matt Cecile)

42. (Boston) Housing database (from CMU StatLib library)

43. ICU data (Serdar Uckun, AIM-94)

44. Image segmentation database (Carla Brodley)

45. ionosphere information (Vince Sigillito)

46. iris (R.A. Fisher, 1936)

47. isolet (Ron Cole and Mark Fanty's database donated by Tom Dietterich)

48. kinship (J. Ross Quinlan)

49. labor-negotiations (Stan Matwin)

50-51. led-display-creator (from the CART book)

52. lenses (Cendrowska's database donated by Benoit Julien)

53. letter-recognition database (created and donated by David Slate)

54. liver-disorders (BUPA Medical's database donated by Richard Forsyth)

55. logic-theorist (Paul O'Rorke)

56. lung cancer (Stefan Aeberhard)

57. lymphography (Ljubjana Institute of Oncology, restricted access)

58-59. mechanical-analysis (Francesco Bergadano)

1. Original Mechanical Analysis Data Set

2. PUMPS DATA SET

60 mobile robots (donated by Klingspor, Morik and Rieger)

61-64. molecular-biology

1. promoter sequences (Towell, Shavlik, & Noordewier, domain theory also)

2. splice-junction sequences (Towell, Noordewier, & Shavlik,

domain theory also)

3. protein secondary structure database (Qian and Sejnowski)

4. protein secondary structure domain theory (Jude Shavlik & Rich Maclin)

65. MONK's Problems (donated by Sebastian Thrun)

66. Moral Reasoner Database (donated by James Wogulis)

67. mushroom (Jeff Schlimmer)

68. MUSK databases (2) (donated by Tom Dietterich)

69. othello domain theory (Tom Fawcett)

70. Page Blocks Classification (Donato Malerba)

71. Pima Indians diabetes diagnoses (Vince Sigillito)

72. Postoperative Patient data (Jerzy W. Grzymala-Busse)

73. Primary Tumor (Ljubjana Institute of Oncology, restricted access)

74. Qualitative Structure Activity Relationships (QSARs) (Ross King)

75. Quadraped Animals (John H. Gennari)

76. Servo data (Ross Quinlan)

77. shuttle-landing-control (Bojan Cestnik)

78. solar flare (Gary Bradshaw)

79-80. soybean (from Ryszard Michalski's groups)

81. space shuttle databases (David Draper)

82. spectrometer (Infra-Red Astronomy Satellite Project Database, John Stutz)

83. Sponge Database (Iosune Uriz and Marta Domingo)

84. Statlog Project databases (7) (from Ross King,...)

85Student Loan relational database (from Michael Pazzani)

86. tic-tac-toe endgame database (Turing Institute, David W. Aha)

87-97. thyroid-disease (Garavan Institute, J. Ross Quinlan; Stefan Aeberhard)

98. trains database (David Aha & Eric Bloedorn)

99-104. Undocumented databases: sub-directory undocumented

1. Economic sanctions database (domain theory included, Mike Pazzani)

2. Cloud cover images (Philippe Collard)

3. DNA secondary structure (Qian and Sejnowski, donated by Vince Sigillito)

4. Nettalk data (Sejnowski and Rosenberg, taken from connectionist-bench)

5. Sonar data (Gorman and Sejnowski, taken from connectionist-bench)

6. Vowel data (Qian, Sejnowski and Turney, taken from connectionist-bench)

105. university (Michael Lebowitz, donated by Steve Souders)

106. voting-records (Jeff Schlimmer)

107. water treatement plant data (donated by Javier Bejar and Ulises Cortes)

108-109. Waveform domain (taken from CART book)

110. Wine Recognition Database (donated by Stefan Aeberhard)

111. Zoological database (Richard Forsyth)

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,884评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,755评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,369评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,799评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,910评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,096评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,159评论 3 411
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,917评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,360评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,673评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,814评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,509评论 4 334
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,156评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,882评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,123评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,641评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,728评论 2 351

推荐阅读更多精彩内容