Thomas G. Dietterich
Distinguished Professor and Director of Intelligent Systems
School of Electrical Engineering and Computer Science
Corvallis, Oregon 97331-5501
E-mail:tgd@cs.orst.eduPhone: +1-541-737-5559 Office: KEC 2067
(Last updated June 5, 2015.)
Page Contents:ResearchProspective StudentsPublicationsTalksCVSoftwareStudents and StaffCourse MaterialsBio SketchConferences
"If you invent a breakthrough in artificial intelligence, so machines can learn," Mr. Gates responded, "that is worth 10 Microsofts." (Quoted in NY Times, Monday March 3, 2004)
The focus of my research is machine learning (and the associated areas of Data Science and Big Data). How can we make computer systems that adapt and learn from their experience? How can we combine machine learning with other advances in AI to build Integrated Intelligent Systems? How can we combine human knowledge with massive data sets to expand scientific knowledge and build more useful computer applications? My laboratory combines research on machine learning and AI fundamentals with applications to problems in science and engineering.
Scientific Projects
Ecosystem Informatics and Computational Sustainability:Oregon State University is a leader in combining computer science and the ecological sciences to build the new discipline of Ecosystem Informatics. Ecosystem Informatics studies methods for collecting, analyzing, and visualizing data on the structure and function of ecosystems. It is an instance of an important new direction in science: Data Exploration Science (see Jim Gray's2003 KDD talk).
Oregon State is also part of theInstitute for Computational Sustainabilityled by Cornell University. This effort seeks to develop novel computational methods to address problems in ecosystem science and sustainable management of the biosphere.
My group is involved in many Ecosystem Informatics and Computational Sustainability activities:
Machine Learning for Species Distribution.One of the central goals of ecology is to understand and predict the distribution of species (including the bugs that we are studying in the Insect Identification project). Given a data set that records observations of the presence (or absence) of multiple species at multiple locations, we wish to develop models that can predict their presence/absence elsewhere. We are interested not only in static distribution models, but also in process models that capture the temporal and spatial of species distributions (e.g., bird migration, flight times of moths, return of salmon, spread of invasive species, survival of endangered species, etc.). Our species distribution team includes faculty members (Matt Betts, myself, and Weng-Keen Wong), post-docs Rebecca Hutchinson and Selina Chu, and graduate students Arwen Lettkeman and Liping Liu. We collaborate very closely with theCornell Laboratory of Ornithologyand with theDataONE Datanet. In particular, we are studying methods for dealing with the many shortcomings of the citizen science data collected by the Lab of Ornithology in theireBird projectincluding (a) partial detection, (b) wide range of birder expertise, and (c) highly biased spatial distribution of observations.
BirdCast.Another special case of species distribution modeling is understanding bird migration. With the Lab of Ornithology, we are developing methods for reconstructing and predicting bird migration across North America. Our goal is to understand what signals birds use to decide when to migrate and to provide daily forecasts of bird migration by combining eBird reports, weather radar, acoustic monitoring of flight calls, and weather forecasts. The project web site is availablehere.
Approximate Optimization for Bio-Economic Models.Many sustainability applications require solving large spatio-temporal optimization problems under uncertainty. We are collaborating with economistsJo AlbersandClaire Montgomeryon methods for approximate solution of spatio-temporal optimization problems involving land management for wildfire control and counter-measures for controlling invasive species.
Project TAHMO: Deployment, Cleaning, and Analysis of Sensor Network Data.We are part of theProject TAHMOthat seeks to construct and deploy a network of 20,000 hydro-meteorological stations in Africa. We are developing algorithms for sensor placement, data cleaning, recovery from damaged sensors, and analysis of the resulting data. We are building on our previous work with Ethan Dereszynski on dynamic Bayesian network models for sensor data cleaning.
Arthropod Identification.Our current understanding of complex ecosystems is limited by a lack of data. One particularly useful kind of data is population counts of "bugs" (small arthropods that live in soils, lakes, streams, and the ocean). TheBugID projectseeks to develop devices for capturing, imaging, and sorting bugs combined with general image processing/machine learning/pattern recognition tools for counting and classifying them. We hope to transform the ability of scientists to measure the health of forests, streams, and estuaries. More generally, we are interested in developing a wide range of novel instruments for expanding the quality, quantity, and spatio-temporal resolution of ecologically-relevant data. Our research also contributes to computer vision and object recognition more generally.
NIPS 2012 Posner Lecture: Challenges for Machine Learning in Computational Sustainability.
ICML 2011 Tutorial on Machine Learning in Ecology and Ecosystem Management
Intelligent Desktop Assistants. We have been involved in two large efforts to develop intelligent assistants for the computer desktop.
TaskTracer. When you come into work in the morning, you don't want to say to your computer "I want to run Word", but rather, "I want to work on my CS534 homework". In other words, you would like a user interface that was organized around your projects and activities rather than around application programs, files, folders, etc. You would also like all of your information in one place rather than scattered across the local file system, network file systems, web sites, email folders, calendar, contacts, etc. TaskTracer extends the Windows UI to provide exactly this functionality. This research is supported by DAPRA with previous support from Google, Intel, and the DARPA CALO project.OSU News Service story.Project Web Site.
CALO. The goal of theCALOproject was to develop an AI personal assistant that can help you find relevant documents, prepare for meetings, keep track of what is going on during meetings, and autonomously execute tasks such as arranging travel, scheduling meetings, executing administrative workflows (e.g., purchasing and staffing), and so on. Our work on CALO focused on developing methods for integrating multiple, separately-engineered components into a single learning and reasoning system. We also prototyped a novel system that employs programming-by-demonstration to define new learning tasks for CALO to solve autonomously. We are currently editing a book describing the results of the CALO project.
Next Generation Phenomics. An important goal in biology is to reconstruct the tree of life. As part of theProject AVATOLteam, we are developing computer vision and machine learning methods to automatically discover and score phenotype characters (features) from images of biological specimens. These scores can then be combined with other information (e.g., genetic sequences, functional measurements) to reconstruct phylogenetic trees. Phenomic information is particularly valuable for sets of closely-related species (where DNA differences may not reflect functional differences) and for extinct species known only through fossil specimens.
The computer science challenges involve learning to score known characters, which typically include shape, texture, color, and topological features of specimens, from weakly-labeled data and discovering new characters that are shared across some taxonomic groups but not others.
Fundamental Machine Learning and Artificial Intelligence Research
Machine Reading and Deep Reading. In collaboration with researchers at BBN, CMU, University of Washington, ISI, and UMass, we are studying methods for extracting knowledge from text to support inference. Our focus is on learning rules (e.g., Horn clauses) and scripts (e.g., logical hidden Markov models) from noisy and incomplete training data extracted from reading text. Funding provided by the DARPA Machine Reading and DEFT programs.
Anomaly Detection. An important capability for AI systems is to be able to detect when an input situation is unusual. For example, anomaly detection can allow machine learning systems to detect when an input case is very different from the training data and hence could lead to extrapolation and poor performance. Anomaly detection methods are also important for detecting novel failures in sensor networks and novel attacks on computer systems. We are developing a range of algorithms for anomaly detection under the DARPA ADAMS program.
We have also developed a new benchmarking methodology for comparing anomaly detection methods. Benchmark data sets and scripts are available fordownload.
Flexible Latent Variable Modeling. Many problems in machine learning require learning models of hidden ("latent") processes. Such latent variable models can be easily represented using graphical models. However, such models are typically expressed using parametric probability distributions, which limits their ability to adapt to the complexity of the process and the amount of data. Our research seeks to integrate flexible machine learning methods (such as boosted regression trees) into latent process models. Postdoc Rebecca Hutchinson and graduate student Liping Liu are developing an R package that integrates boosted regression trees into certain latent variable models common in species distribution modeling.
Learning Individual Models from Aggregate Data. Most data in ecology (and other fields) records information in aggregated form (e.g., population counts, census figures). Often, we wish to fit models of individual behavior using such aggregated data. One example is the problem of predicting bird migration from eBird counts. Former Postdoc Dan Sheldon has developed a new formalism, the Collective Grahical Model, that directly transforms individual models to aggregate models that can then be easily linked to aggregated data.
Sample-Efficient Algorithms for Solving Spatial Markov Decision Processes. We are developing algorithms for solving MDPs in which the state consists of a landscape of patches, and each patch has its own state. This means that the state space is enormous. At each time step, an action must be specified for each patch, so the action space is also enormous. In these problems, phenomena (such as fires, infections, species spread) propagate spatially, so we cannot treat the patches as independent. Our algorithms seek exact (or bounded approximate) solutions for small problem instances, and satisfactory solutions for real-sized problem instances. Our algorithms optimize both expected (discounted) cumulative reward and also risk-sensitive reward measures. The transition probabilities for these models are specified in the form of an expensive simulator which can be invoked with a given state and action to obtain a sample of the resulting state and the reward. An important goal is to minimize the number of calls to the simulator.
Software Engineering Methods and Tools for Machine Learning Systems. Creating and maintaining a successful deployed machine learning system is still largely an art that requires a Ph.D. The goal of this research is to develop software engineering methods and tools for creating and maintaining deployed machine learning systems.
Reviews, tutorials, and books. I have written several review articles and tutorials on machine learning.
My group meets Mondays 3-4pm in KEC 2057. Each week, one member gives a one-hour presentation of a paper relevant to their work. There are two other reading groups meeting this quarter. I'm participating in the group organized by Alan Fern on the topic of Monte Carlo Tree Search and MDPs. We meet Thursdays 3-4pm KEC 2057. There is also an AI Seminar scheduled from time to time (contact Alan Fern for details).
Information for Prospective Students
If you are seeking a research career in machine learning, data mining, artificial intelligence and related areas, and you have a strong background in mathematics and programming, please read myInformation for Prospective Studentspage.
If you are interested in robotics, I encourage you to visit theRobotics Team Pagesto learn more about our excellent robotics program.
Publications, Curriculum Vita, Software, and Data
Downloadable Collection of Error Correcting Codesfor use with the error-correcting output coding technique.
Release 1.0 of MAXQ Hierarchical Reinforcement Learning code.
RSW (Recurrent Sliding Window) Package for WEKAfor sequential supervised learning.
Implementation of the PCBR region detector developed as part of the BugID project.
Implementation of TreeCRF system and supporting materials from our JMLR paper.Java implementation by Brad Block.
Starcraft Scouting dataset (for UAI 2012 paper)
Professional Service, Journals, and Book Series
I am President of theAssociation for the Advancement of Artificial Intelligence.
I am a member of the Editorial Board for theJournal of Machine Learning Research, which is an electronic (and hardcopy) journal covering all areas of machine learning.
I edit the MIT Press book series onAdaptive Computation and Machine Learning.
I am the moderator for the machine learning (CS.LG) part ofCoRR, which is the computer science sub-part ofarXiv.
I am past President of theInternational Machine Learning Society. I currently serve on the IMLS Board.
Entrepreneurial Activities
I am a co-founder ofStrands(formerly MyStrands; formerly MusicStrands), a recommendation company.
I am a co-founder ofSmart Desktop. Smart Desktop is now part ofDecho, Inc., which is a "cloud computing" effort ofEMC. Decho is commercializing technology developed as part of the TaskTracer system.
I am a co-founder and Chief Scientist ofBigML. The goal of this startup is to develop large scale cloud-based machine learning services.
Andrew Emmott, Graduate Student.
Rebecca Hutchinson, SEES Postdoctoral Fellow (jointly mentored with Matt Betts).
Jesse Hostetler, Graduate Student (jointly advised with Alan Fern).
Jed Irvine, Senior Faculty Research Assistant (Software Engineer).
Michael Lam, Graduate Student (jointly advised with Sinisa Todorovic).
Liping Liu, Graduate Student.
Sean McGregor, Graduate Student.
Michael Slater, Graduate Student (On leave from Faculty Research Assistant).
Majid Alkaee Taleghan, Graduate Student.
Shahed Sorower, Graduate Student.
Pat Sullivan, Assistant and Grants Coordinator.
Tadesse Zemicheal, Graduate Student.
Former Students and Staff
Hussein Almuallim, Oil and Energy Professional, Calgary, Canada.
Eric Altendorf,Google.
Adam Ashenfelter, BigML, Inc., Corvallis, Oregon.
Ghulum Bakiri, President at MicroCenter, Bahrain.
Christian Baumberger. Software Engineer at Zuehlke Group
Xinlong Bao. Google Pittsburgh.
Brian Breck.
Waranun Bunjongsat.
Giuseppe Cerbone. Independent Information Services Professional, Milan, Italy.
Martha Chamberlin.
Hei Chan.
Richard Charon.
Eric Chown, Full Professor, Bowdoin College.
Selina Chu, JPL, Pasadena, CA.
Dan Corpron
Mark Crowley, Assistant Professor, Department of Electrical and Computer Engineering, University of Waterloo.
Diane Damon,Damon Consulting, Portland, OR.
Ethan Dereszynski, Research Scientist, WebTrends, Portland, OR.
Phuoc Do,Vida Lab.
Nicholas FlannAssociate Professor, Utah State University
Greg Foltz.
Dan Forrest.
Tony Fountain, Director of the Cyberinfrastructure Lab for Environmental Observing Systems (CLEOS), UC San Diego.
Ashit Gandhi, Founder and Vice-President,Prism Gem, LLC- The Art of Diamond Coloring.
Colin Gerety, Fort Collins, CO.
Brandon Harvey,Symantec and Linn-Benton Community College.
Guohua Hao, Senior Data Scientist at iHeartRadio.
Hermann Hild, President,SMI Cognitive Software GmbH.
Saket Joshi, Member of Technical Staff at Cycorp.
Varad Joshi,Director of Engineering at Elemental Technologies.
Caroline Koff, Hewlett-Packard Corporation, Fort Collins, CO.
Victoria Keiser, Research Programmer, CMU.Masters Thesis (PDF).
Michael Kelm, Research Scientist, Siemens Healthcare.
Eun Bae Kong, Professor, Computer Science, Chungnam National University, South Korea
Bill Langford, Research Associate at RMIT, Melbourne, Australia.
Junyuan Lin, VMWare, Seattle.
Dragos Margineantu, The Boeing Company.
Gonzalo Martinez, Assistant Professor, Autonomous University of Madrid.
Prafulla Mishra, Software Development Manager at eBay.
Avis Ng.
Soumya Ray, Assistant Professor, Case-Western Reserve University.
Angelo Restificar, Principal Machine Learning Engineer, EBay, Seattle.
Ritchey Ruff, Senior SDET, Microsoft.
Dan Sheldon, Assistant Professor, University of Massachusetts, Amherst.
Jianqiang Shen. Research Scientist, PARC.Doctoral dissertation.
Rongkun Shen. Post-doc, Oregon Health and Science University, Portland.
Michael Shindler, Lecturer at the University of Southern California
Shriprakash Sinha. Ph.D. student TU Delft.
Simone Stumpf. Senior Lecturer, City University London.
Tao Sun, Graduate Student at UMass Amherst.
Dan Vega, Senior Software Engineer at Valley Inception, LLC.
Mark Vulfson. Microsoft Corporation.
Kiri Wagstaff, Principle Researcher at JPL.
Xin Wang, Senior Scientist at Inome (Intelius).
Dietrich Wettschereck. Consultant, Cologne, Germany.
Pengcheng Wu.
Michael Wynkoop, Qualcomm.
Qing Yao, College of Informatics and Electronics. Zhejiang Sci-Tech University. Hangzhou, China.
Wei Zhang, The Boeing Company.
Wei Zhang. Senior Software Engineer, Google.Doctoral Dissertation (PDF).
Valentina Zubek, Principal Statistician, Boehringer Ingelheim.
Previous Courses and Courseware
CS519/GEO599: Principles of Ecosystem Informatics, 2004-2005.
CS 534, Spring 2005, Machine Learning.
CS430, Fall 2003, Introduction to Artificial Intelligence
CS539, Fall 2003, Seminar: Probabilistic Relational Models
CS 533, Applied Artificial Intelligence for Engineeers.
CS 539, Winter 2000, Selected Topics in Artificial Intelligence: Probabilistic Agents
CS 430/530, Fall 1999, Artificial Intelligence Programming Techniques.
CS 519, Fall 1996. Research Methods in Computer Science.
CS 450/550, Winter 1996, Introduction to Computer Graphics.
Machine Learning Resources
ML-Newsis an email list operated by the IMLS for conference annoouncements, job positions (including graduate student and postdoc offers), and other items of relevance to the machine learning community. To join, you must have a google account. Log in to google and then go to theML-News main pageand click on "Apply to join group". Please include your affiliation in your join request.
How to be a Graduate Student. A great web page with pointers to lots of good resources for graduate students.
Standard Proofreading Symbolsthat I use when marking corrections on papers.
Computing Research Repository (part of arXiv). I am the curator for machine learning.
Research Index at Penn State. An invaluable resource for finding online articles, citations, etc.
Journal of Machine Learning Research (JMLR).(Free electronic and hardcopy journal.)
Machine Learning. (Expensive hardcopy journal.)
Journal of Artificial Intelligence Research (JAIR).(Free electronic and hardcopy journal.)
The Machine Learning Database Repositoryat UC Irvine.
The Machine Learning Programs Repositoryat UC Irvine.
StatLibcontaining data, algorithms, and other information relevant to statistics.
Knowledge Discovery Minecontaining information about knowledge discovery in databases.
CMU Reinforcement learning group.
Bibliographies on Artificial Intelligence
Computing Research Repository (CoRR)CS Preprint Service and Archive (part of arXiv).
The DBLP Computer Science Bibliography.
My Family's Musical Activities
yOya on MySpaceandyOya home page. My son Noah writes songs and plays keyboards for this band.
Jubilate: The Women's Choir of Corvallis. My wife Carol sings in this choir.
Tom Dietterich, tgd@cs.orst.edu