MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Research Abstracts Home

CSAIL Digital Archive

Research Activities

CSAIL Home

horizontal line

Research Abstracts - 2007
horizontal line

horizontal line

Automated Computer Forensics

Simson L. Garfinkel

Today's computer forensic tools provide advanced visualization and search capabilities for trained forensic investigators. These tools do not scale to the massive amount of data generated by ongoing intelligence operations, nor are they usable by ``mere mortal'' computer users.

Developing automated tools id difficult because of the massive amounts of data necessary to develop and validate algorithms, and the difficulty of keeping up with the ever-expanding variety of file types and formats.

To make progress, we are engaged in a two-pronged research project. The first prong is designed to ``prime the pump'' by creating forensic corpora that can be used by current researchers with few if any restrictions. The second prong will pursue targeted developments in forensic file formats, knowledge representation, inference techniques, and the presentation of forensic results.

Corpora Creation

Computer forensics research has been severely hobbled by the lack of realistic data on which to develop and validate new techniques. Real data that is used by forensic tools is highly sensitive by its very nature. Researchers need large quantities of email messages, word processing files, disk images, and network traffic to build and test their tools. But in many cases this data simply cannot be collected due to privacy concerns.

In the absence of such corpora, researchers have made due with data primarily generated by self-experimentation. Disk forensic tools are developed using a few file systems from the developer''s own computer system. Network forensic systems are based on packets monitored from the developer''s own Internet connection. Documents are based on the range of Microsoft Word and Adobe Acrobat files that can found with Google and freely downloaded over the Internet. Because these collections are not made available due to privacy concerns, researchers at different organizations must waste time and money amassing their own low-quality corpora. And because those corpora are not standardized, it is very difficult to compare published research.

We are investigating four approaches for creating data sets that sound, exploitable, interesting and deep:

We are legally acquire material on the secondary market that can be used for research. When computer equipment is sold, privacy rights in the data that the equipment includes are legally forfeit. We have collected more than a thousand hard drives since 1998 and hope to increase our collection to 3,000 before the end of 2008.[3]
We are developing anonymization tools to create anonymized corpora from actual data.
We are investigating the possibility of hiring consenting adults to have their computers monitored, and thereby create corpora. Such data would need to be scrubbed to eliminate private data from non-consenting parties.
We hope to develop improved models to create simulated data.

Targeted Forensic Research Development

Using the hard drive corpus, we are developing a series of technologies and techniques to enable automated forensic processes. These include:

The Advanced Forensic Format (AFF), a new file format for archiving disk images and associated metadata. [2]
Cross-Drive Analysis, a technique for automatically identifying and correlating pseudo-unique information across forensically relevant media. We have developed algorithms based on CDA that can automatically determine the owner of a hard drive, and automatically identify multiple drives in a collection that were used by the same organization.[1]
Carving Fragmented Files with Object Validation, a technique for reassembling files that have been split into multiple pieces in a disk image without reference to the file system metadata.

References:

[1] Garfinkel, S., "Forensic Feature Extraction and Cross-Drive Analysis," The 6th Annual Digital Forensic Research Workshop Lafayette, Indiana, August 14-16, 2006.

[2] Garfinkel, S., Malan, D., Dubec, K., Stevens, C, Pham, C., "Disk Imaging with the Advanced Forensics Format, Library and Tools," The Second Annual IFIP WG 11.9 International Conference on Digital Forensics, National Center for Forensic Science, Orlando, Florida, USA January 29 - February 1 2006.

[3] Garfinkel, S. and Shelat, A., "Remembrance of Data Passed: A Study of Disk Sanitization Practices," IEEE Security and Privacy, January/February 2003.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu