Monday, September 5, 2011

Introduction to Kindle Forensics

If you haven't figured it out already, this isn't a technical blog. This post is as technical as this blog will get in the foreseeable future. I've been meaning to post this for a while. I worked on this for a few weeks last fall, and I haven't done anything with it since I am moving in a different direction with my thesis. So, I thought I would put this out there for anyone to get a start in Kindle forensics since it's not published anywhere. For more technical forensics of the Kindle, check out Allyn Stott's blog.  

This method uses a jailbreaking process. It may no longer work with the newer Kindle updates. This installs software onto the Kindle and may not be forensically sound, but Cellebrite installs software on some phones in order to retrieve data. I am not familiar with that process (again I am not a technical guy), but if you care to explain it and let me know why it holds up in the courts, please write up a comment.

Future work should include the Kindle mobile app and other eReaders.

I also have a teardown video.

ABSTRACT
The Amazon Kindle is becoming an increasingly popular e-book reader.  This popularity will lead criminals to use the Kindle as an accessory to their crime.  No publications exist at the time of this writing, but various blogs on the Internet attempt to scratch the surface of Kindle forensics.  For this research a populated Kindle was imaged with FTK and compared to the same Kindle after set to factory defaults to determine evidence recoverability.  Analysis of the image exposed the apparent inconsistency of naming conventions of items added to the Kindle.  The reset Kindle image recovered most of the deleted data as picture files.  Another technique was used to gain access to the system partitions, which revealed user metadata; however, some statistics were not located due to the limitations of the technique.  For future work, other challenges of Kindle forensics are identified and recommendations and considerations are given for the digital forensic science community.
Keywords
Amazon Kindle, digital forensics

1.     INTRODUCTION

Jeff Bezos reports that the Kindle is the bestselling, most wished for, and most gifted product on Amazon.com[9].  Bezos is, of course, the founder, president, CEO, and chairman of the board of Amazon.com.  He also reported that Kindle books outsold paper books for the first time on Christmas Day 2009[1].  Bezos says he will not release sales figures because it is a trade secret[13], but most estimates put Kindle sales in the millions[4][5][13][17].  In addition to its sales, 810000 books are available to purchase of which 610,000 are less than ten dollars[9].  This does not include newspapers, magazines, blogs, and another 1.8 million out-of-copyright e-books available for download.  Furthermore, a user can download in over 100 countries and territories and synchronize to other Kindle applications for phones and computers using Amazon’s Whispernet through 3G networks[9].  It seems the Kindle is well grounded and will not disappear from the consumer market any time soon. 

The Kindle has extended functionality that one may not expect.  A user can use the Kindle to play music, play games, browse the web, and store about three gigabytes of data, and not necessarily e-books.  It supports conversion for .doc, .docx, .txt, .rtf, .html, .htm, .jpeg, .jpg, .gif, .png, .bmp, and .zip[2].  It also has native support for .pdf and can store any other file much like a flash drive.  Currently, the Kindle Development Kit (KDK) is in beta testing[10] to allow users to develop their own active content, potentially games, calendars, or photo galleries.  This will give the Kindle even more functionality in the future blurring the line between an e-reader and a PDA or iPod Touch.  These facts lead the author to believe it is a matter of time before Kindles are a means of criminal activity and become sources of evidence.
This examination of the Kindle is important for investigators who have seized a Kindle who wish to jump-start their analysis of its contents.  Books and other files contained can be considered associative evidence, which can give insight to a suspect, victim, or person of interest or can help build a case in conjunction with other evidence.  John Wayne Gacy who raped and killed thirty-three male teenagers possessed books in his home such as 21 Abnormal Sex Cases, The American Bi-Centennial Gay Guide, The Great White Swallow, Heads and Tails, Pederasty: Sex Between Men and Boys, The Rights of Gay People, and Tight Teenagers[16].  Justin Barber was convicted in 2006 for the murder of his wife[7].  He had downloaded the song “Used to Love Her” by Guns N’ Roses on the day of the murder[7].  Its lyrics include, “I used to lover her, but had to kill her/ She bitched so much/ She drove me nuts/ And now I’m happier this way[15].”  Robert Ressler profiled a suspect in the 1980s who killed two boys in Nebraska.  In his profile Ressler wrote that the killer was likely to have read detective magazines because he cut away bite marks from his victims showing knowledge of forensic practices[8].  Twenty-four detective magazines were found in the possession of the killer, John Joubert[8].  The Kindle can contain more 3500 books[9], hundreds of .mp3 songs, or other files.  It will give the investigator a large picture of an active user.

2.     RESEARCH

For this research, a populated latest generation Kindle was imaged using FTK Imager 2.9.0.1385 with a Tableau forensic USB bridge.  This image included fifty-nine books, three games, forty-five converted .pdf files, sixteen Kindle screenshots, two audio books, two book samples, one blog subscription, one magazine subscription, and one newspaper subscription.  However, an unpopulated Kindle was not imaged because the research was performed as a result of Kindle use rather than the Kindle being purchased for research.  This research includes the same Kindle set to factory defaults to determine deleted evidence recoverability.  This paper gives a base to Kindle Forensics and gives a general outline for items of interest.  Other detailed analysis should be completed in future work.  The author assumes legal, crime scene, and other forensic considerations, and chose not to echo most of these methods outlined in other digital evidence papers.



The file system of the Kindle is FAT32, and the operating system is based on Linux.  As shown in the Kindle image summary in Figure 1, the image is 3130 MB, but the Kindle is known to have 4 GB of storage.  The inaccessible storage totaling 682 MB in FTK, the system partition, consists of three other partitions as shown in Figure 2.  These partitions were discovered by privilege escalation with the method described in the following section that is not endorsed by Amazon and may void the warranty.  One caveat of this method is that it directly breaks rules of evidence by writing to the user partition.  The system storage was accessed using the method to determine if there was valuable information within describing the user, but one of the system partitions was not able to be fully analyzed, mmcblk0p1.  During analysis, a telnet session was established between a computer and the Kindle to create an image using dd.  However, after a short time the telnet session would report that the connection to the host was lost seemingly causing the imaging process to cease.  The author assumed the imaging process would continue despite the connection loss because the command was issued to write the image to itself.  This was done to eliminate any networking issues and give proof of concept.  The author allowed time for the imaging process to complete, but a full image of mmcblk0p1 was never obtained. 



3.     METHOD[6]

•      Downloaded kindle-jailbreak-.4.N.zip[12]
•      Content was extracted
•      Connected Kindle to computer
•      Copied update_jailbreak_0.4.N_0.4.N_k3g_install.bin to the root directory of the Kindle
•      Ejected the Kindle from the computer
•      On the Kindle:
        o      Selected Menu | Settings | Menu | Update Your Kindle
        o      Selected Ok to confirm update
•      Downloaded kindle-usbnetwork-0.30.N.zip[12]
•      Copied update_usbnetwork_0.30.N_k3g_install.bin to the root directory of the Kindle
•      On the Kindle:
        o      Selected Menu | Settings | Menu | Update Your Kindle
        o      Selected Ok
        o      Typed any letter to open the search box
        o      Deleted the letter
        o      Typed “;debugOn” and pressed the center of the five-way directional pad
        o      Typed “~usbNetwork” and pressed the center of the five-way directional pad
•      Connected the Kindle to the computer
•      Navigated to Computer Management | Device Manager | Network Adapters
•      Right-clicked USB Ethernet/RNDIS Gadget and selected Update Driver Software…
•      Selected Browse my computer for driver software
•      Selected Let me pick from a list of device drivers on my computer
•      Selected Network adapters
•      Uncheck Show compatible hardware
•      Selected Microsoft Corporation as the Manufacturer and Remote NDIS based Internet Sharing Device as the Network Adapter
•      Navigated to Network and Sharing Center
•      Set the IP Address of the new adapter to “192.168.2.1” and the Subnet Mask to “255.255.255.0”
•      Navigated to Programs and Features
•      Selected Turn Windows features on or off
•      Checked Telnet Client
•      Navigated to Start | Run
•      Entered “Telnet 192.168.2.2”
•      dd if=<source> of=<destination>

For this research the source was /dev/mmcblk0p<1,2,3, or 4>, and the destination was /mnt/base-us/mmcblk0p<1,2,3, or 4>.  The author assumed /mnt/base-us/ was the same as /mnt/us/ for the destination image creation.  Files appeared to be the same in both locations when navigating the file structure, so the author arbitrarily chose one to write to the user partition.

4.     RESULTS

The following in Table 1 outlines what file evidence the author sought, found, and location.  Some content type naming conventions or file extensions did not hold true for all other Kindle files of that type.  For example, one of two sample books downloaded had a .tan extension associated with it, and some personal documents had the word “converted” within the file name despite each of the personal documents were converted through the same process.  Only one type of notice was on the device, notifying the user that documents must be downloaded over WIFI.  Other notices or types of notices are unknown to the author and may appear over time in other conditions.  The SHA-1 hashes in the file names were not successfully reverse engineered.  The relative path, full path, and document title strings were hashed, but resulted in no matches.  Other number associations in the file names were not identified.

Table 2 outlines the location of various Kindle statistics some of which are found in multiple locations.  Much of this incomplete table is unknown at this point, and some of this information is believed to be stored in the system partitions.  Other statistics can be viewed non-forensically in the Kindle from the 411, 611, and the 711 pages by entering the settings menu and typing Alt+R Alt+Q Alt+Q, Alt+Y Alt+Q Alt+Q or Alt+U Alt+Q Alt+Q, respectively[11].

After the Kindle was populated, imaged, and analyzed, it was set to factory defaults.  After data carving, thousands of artifacts were discovered pointing to what books and documents were once on the Kindle.  However, most of the files were images with numbers as the file name, so these must be viewed one at a time.  No traces of any user created directories were found.  It should be noted that books cannot be permanently deleted from a user’s account from the Kindle itself, but only through Amazon.com as shown in Figure 3 in the Appendix.  If needed, many items may be of evidentiary value on the user’s account on Amazon.com could be obtained through a subpoena.  Some items found in the “Manage Your Kindle” section of the user’s account are shown in the Appendix.

Table 1: Kindle Files

Content
Location
Content
Location
Active Content
Kindle-FAT32\.active-content-data\<SHA-1>
Personal Documents

Kindle-FAT32\documents\<title>-asin_<SHA-1>-<0-8>-converted-azw-type_PDOC-v_0.mbp
Audio
Kindle-FAT32\audible\<title>-asin__<Amazon Standard Identification Number>-type_AUDI-v_0.aax_<number>
Kindle-FAT32\documents\<title> azw-asin_<SHA-1>-<0-8>-azw-type_PDOC-v_0.azw
Kindle-FAT32\audible\<title>-asin__<Amazon Standard Identification Number>-type_AUDI-v_0.pos
Kindle-FAT32\documents\<title> azw-asin_<SHA-1>-<0-8>-azw-type_PDOC-v_0.mbp
Books Downloaded
Kindle-FAT32\documents\<title>-asin_<Amazon Standard Identification Number>-type_EBOK-v_0.azw
Notice
Kindle-FAT32\documents\<title> W-asin_<SHA-1>-<number>-<number>-DEVICE_WIFI-wifi-type_PDOC-v_0.azw  
Kindle-FAT32\documents\<title>-asin_<Amazon Standard Identification Number>-type_EBOK-v_0.phl
Non-converted PDF
Kindle-FAT32\documents\<file name>.pdf
Blogs
Kindle-FAT32\documents\<title>-asin_<Amazon Standard Identification Number>-type_FEED-v_65746.azw
Sample Books Downloaded
Kindle-FAT32\documents\<title>-asin_<Amazon Standard Identification Number>-type_EBSP-v_0.azw
Magazines
Kindle-FAT32\documents\<Magazine Title><Date>-asin_<Amazon Standard Identification Number>-type_MAGZ-v_2.azw
Kindle-FAT32\documents\<title>-asin_<Amazon Standard Identification Number>-type_EBSP-v_0.tan
Newspapers
Kindle-FAT32\documents\<Newspaper Title><Date>-asin_<Amazon Standard Identification Number>-type_NWPR-v_6.azw
Screen Saver Pictures
mmblk0p1\NONAME-ext3\opt\screen_saver
Kindle-FAT32\documents\<Newspaper Title><Date>-asin_<Amazon Standard Identification Number>-type_NWPR-v_6.mbp
Screenshots
Kindle-FAT32\documents\screen_shot-<number>.gif
Personal Documents

Kindle-FAT32\documents\<title>-asin_<SHA-1>-<0-8>-azw-type_PDOC-v_0.azw
Thank You Letter
Kindle-FAT32\documents\Thank You Letter-asin_ThankYouLetter_ ATVPDKIKX0DER_A1VC38T7YXB528-type_PSNL-v_0.azw
Kindle-FAT32\documents\<title>-asin_<SHA-1>-<0-8>-azw-type_PDOC-v_0.mbp
Kindle-FAT32\documents\Thank You Letter-asin_ThankYouLetter_ ATVPDKIKX0DER_A1VC38T7YXB528-type_PSNL-v_0.mbp
Kindle-FAT32\documents\<title>-asin_<SHA-1>-<0-8>-converted-azw-type_PDOC-v_0.azw
User Highlights and Notes
Kindle-FAT32\documents\My Clippings.txt



Table 2: Kindle Statistics

Statistic
Location
3G/WIFI
Kindle-FAT32\system\Audible Activation.sys
B006xxxxxxxxxxxx = 3G, B008xxxxxxxxxxxx = WIFI only, B00Axxxxxxxxxxxx = 3G Europe[3]
mmblk0p2\LocalVars-ext3\java\prefs\com.amazon.ebook.framework\Features
mmblk0p2\LocalVars-ext3\wan\info
Book Collections
Kindle-FAT32\system\collections.json
Bookmarks
mmblk0p2\LocalVars-ext3\java\prefs\browser\bookmarks_wv
Browser Cookies
mmblk0p2\LocalVars-ext3\browser\cookies
Browser Settings
mmblk0p2\LocalVars-ext3\java\prefs\browser\settings_wv
Current Location in Last Book Read
Kindle-FAT32\system\userannotlog
Device Email Address
mmblk0p2\LocalVars-ext3\java\prefs\com.amazon.ebook.reader\social-clipping\social-prefs
mmblk0p2\LocalVars-ext3\java\prefs\reginfo
Device Name
mmblk0p2\LocalVars-ext3\java\prefs\reginfo
Device Password/Hint
mmblk0p2\LocalVars-ext3\java\prefs\DevicePasswork.pw
Device Settings
mmblk0p2\LocalVars-ext3\java\prefs\com.amazon.ebook.framework\prefs
Firmware Version
Kindle-FAT32\Update_<previous version>-<current version>.bin
Keywords searched by user
Kindle-FAT32\system\Searched Indexes (didn't find meaningful info in here, but should look into this more)
Kindle Time
Kindle-FAT32\system\com.amazon.ebook.booklet.reader\reader.pref
Last Book Read
Kindle-FAT32\system\com.amazon.ebook.booklet.reader\reader.pref
Personal Info
mmblk0p2\LocalVars-ext3\java\prefs\com.amazon.ebook.booklet.home\com.amazon.ebook.booklet.home.prefs
Registered User
mmblk0p2\LocalVars-ext3\java\prefs\reginfo
Serial Number
Kindle-FAT32\system\AudibleActivation.sys
Time last listened to Audio Book
mmblk0p2\LocalVars-ext3\java\prefs\audiofilecache
APs Accessed
Unknown location
IMEI
Unknown location
IP Address
Unknown location
MAC Address
Unknown location
Social Networks
Unknown location
Web Browsing History
Unknown location



5.     CONCLUSIONS

The unknown locations shown in Table 2 present a forensic challenge.  A telnet session was established between a computer and the Kindle using the privilege escalation method.  However, after a short time the telnet session would report that the connection to the host was lost seemingly causing the imaging process to cease.  Further research should be conducted to discover if the connection loss is a result of the programming in the Kindle updates used as described in this paper.  Other methods should be explored to gain root access to the Kindle because this method writes to the user partition and was not designed for forensic acquisition, or it should be determined if root access is needed at all for forensic analysis.  Is the information within the system partitions necessary?  It is the opinion of the author that much of the information found within these system partitions can add critical evidence to an investigation.

Another forensic challenge is the consistency of files.  Further research must be conducted in order to understand all file extensions within the Kindle and fully understand personal document conversion.  Emailing documents to the user’s Kindle email address was used for the personal document conversion process, but it yielded three different naming conventions.  Additionally, converting the same document through the same process produced different visible results on occasion.  This will be problematic if there is a future hash library of known good and bad files.  The conversion process may render alternate results and documents may evade detection by the hash library.  A fuzzy hash algorithm may eliminate this issue.

A future concern that must be researched is the Kindle Development Kit (KDK).  Unknown and undocumented content will enter the market when the KDK is released to the public.  However, the KDK may produce active content of evidentiary value such as calendars and photo galleries, but these items will need to be tested and researched.  Other obfuscation and security issues should be explored, specifically with files appearing as downloaded books and the security of Whispernet.

A final consideration is shielding.  The wireless capabilities can be turned off within the Kindle settings, but a Kindle should be shielded if its wireless state is unknown.  This research determined the possibility of downloaded content overwriting older content when disk space is full and has been proven impossible in its factory state, but future active content applications created by malicious users may be able to erase content, perhaps even from a remote location.  Only FTK was used for analysis in this research.  Other tools should be tested with the Kindle.  The challenges are many, but this paper has provided an introduction to the forensics of the Kindle, which the author hopes the reader finds useful.

6.     REFERENCES

[1]     Allen, K. (2009, December 28). Amazon e-book sales overtake print for first time. In guardian. Retrieved December 13, 2010, from http://www.guardian.co.uk/business/2009/dec/28/amazon-ebook-kindle-sales-surge
[2]     Amazon Kindle User’s Guide. Retrieved from http://kindle.s3.amazonaws.com/Kindle_User's_Guide_English.pdf
[3]     Amazon.com Help: Kindle Software Update Latest Generation. (n.d.) Retrieved from http://www.amazon.com/gp/help/customer/display.html/ref=hp_navbox_top_kindlelgi?nodeId=200529700
[4]     Arrington, M. (2010, January 29). 3 Million Amazon Kindles Sold, Apparently. In TechCrunch. Retrieved December 13, 2010, from http://techcrunch.com/2010/01/29/3-million-amazon-kindles-sold-apparently/
[5]     Baig, E. C. (2010, July 29). Amazon unveils 3rd-generation Kindle e-book reader. In USA Today. Retrieved December 13, 2010, from http://www.usatoday.com/tech/news/2010-07-29-amazon29_ST_N.htm
[6]     disi. (2010, October 20). Quick Kindle 3 root shell via USB [Msg 118]. Message posted to http://www.mobileread.com/forums/showthread.php?p=1172506#post1172506
[7]     Dowling, Paul. (Producer). (2011, February 13). Forensic Files [Television broadcast]. United States: truTV.
[8]     Gutzeit, Andreas. (Director). (2009, August 16). The Man Who Lives with Monsters [Television broadcast]. Australia: Crime and Investigation Network.
[9]     Kindle Wireless Reading Device, Wi-Fi, Graphite, 6" Display with New E Ink Pearl Technology. (n.d.) Retrieved from http://www.amazon.com/dp/B002Y27P3M/ref=btech_kindle_wifi
[10]  Kindle Development Kit for Active Content. (n.d.) Retrieved from http://www.amazon.com/kdk/
[11]  lstefek. (2010, September 15). Quick Kindle 3 root shell via USB [Msg 74]. Message posted to http://www.mobileread.com/forums/showpost.php?p=1110675&postcount=74
[12]  NiLuJe. (2010, June 22). Fonts & ScreenSavers Hack for Kindles [Msg 1]. Message posted to http://www.mobileread.com/forums/showthread.php?t=88004
[13]  Ratcliffe, M. (2009, December 26). Updating Kindles sold estimate: 1.49 million. In ZDNet. Retrieved December 13, 2010, from http://www.zdnet.com/blog/ratcliffe/updating-kindles-sold-estimate-149-million/486
[14]  Rose, C. (Interviewer) & Bezos, J. (Interviewee). (2010). Jeff Bezos, Founder & CEO, Amazon.com [Interview transcript]. Retrieved Charlie Rose Web site: http://www.charlierose.com/view/interview/11138
[15]  Stradlin, Izzy & Hudson, Saul (1988). Used to Love Her [Recorded by Guns N’ Roses]. On G N’ R Lies [CD]. Los Angeles, California: Geffen.
[16]  Sullivan, T., & Maiken, P. T. (1983). Killer Clown: John Wayne: The John Wayne Gacy Murders (p. 33). New York, NY: Pinnacle.
[17]  Wilhelm, A. (2010, July 29). How many Kindles have been sold?. In The Next Web. Retrieved December 13, 2010, from http://thenextweb.com/us/2010/07/29/how-many-kindles-have-been-sold/



7.     APPENDIX









Sunday, July 17, 2011

Digital Forensics Resources

This is a post in response to current discussion on Google+ about where Google+ lies on the usefulness spectrum for digital forensics as well as mediums for information sharing.  Before we talk about how or where to share things we've learned, I want to take a step back and ask where are you learning it from?  Since I've been around this field for a year or so academically, I know I am not aware of all (or even most!) of the resources.  After a quick Google search I didn't see right off-hand what I'd like to do here. The goal here is not to create a comprehensive list, but a high-level list of your favorite sources for all things DFIR.  Where do you find the most valuable information?

Which blogs, papers, conferences, books, Twitter handles, listservs, podcasts, etc. are your favorite?
Where do you go for specific topics (Mac forensics, mobile devices, etc.)?
Do you use the forensics wiki or the custom DF Google search?

Ready?

List three!

Tuesday, June 7, 2011

Big Cell Phone Data

Last time I talked a bit about the "Big Data Problem." Guess who I ran into last weekend at the Indy 500?


We chatted briefly about the issue, which she called the metadata problem. Oh yeah, I'm in the marching band at Purdue. I play this.


She also said I should come work for them. I guess the DHS is now on my "apply to" list. Speaking of work, I gained an internship with Lockheed Martin for the summer. I am awaiting the background check and drug screen results. It should be an interesting summer in network security and forensics.

I got an email from my sister. Her blog is mostly about organizational projects. She had kind words about the blog, but said I should think about shortening them since they seemed like papers rather than short essays. Well, they are papers really, but I agree with her. So, I won't hinder the topic at hand any further.

I did some rough testing a few years ago comparing two forensic tools - WinMoFo and Device Seizure. I acquired data from four Windows Mobile phones with varying amounts of data and use, and I exported the reports into Microsoft Excel. The results are shown in Figure 1.



It would be very difficult for an investigator to interpret or find useful information on the fourth cell phone having over 12000 rows in Device Seizure. It would even be difficult with over 4000 rows in WinMoFo.

Simplistic data reduction and mining techniques have still yet to be ubiquitous. It seems WinMoFo uses some data reduction techniques. It allows the user to choose what information s/he wanted, such as text messages, call logs, and contacts. However, at the time of the project, Device Seizure allowed no data reduction and the user was forced to complete a logical or physical image of the device.  For the project WinMoFo also captured all system files for a fair comparison. I don't know the current capabilites of Device Seizure or WinMoFo.

Rows generated correlate with acquisition time. As shown in Figure 2, WinMoFo has a great advantage over Device Seizure as far as time is concerned.


Earlier this year I decided to take a look at my phone with a couple of tools—Cellebrite and DataPilot. Each provided seemingly accurate results. I choose the word seemingly because of the length of the reports. I converted each report to a PDF document where it was revealed that the Cellebrite report was 317 pages and the DataPilot report was 580 pages. Maybe this is fine for an in depth analysis, but what about quick initial results? How is an investigator supposed to go through this information manually, when I cannot (don't want to, really) verify my own phone report? There is too much information provided. My phone includes and the report provided one thousand contacts and three thousand text messages among other information, but what law enforcement entity will read each of those? How does an investigator determine who is important to the phone user or what evidence is important to the crime? A keyword search is mostly what is available at this time. DataPilot does provide some frequency analysis in its svProbe module, but other items can be taken into consideration when determining what is important to the investigation such as number of times in a row someone called or texted the phone user, response time to missed calls or text messages, number of words in a text message, word length in text messages, number of contacts for one person, and synchronization to social networks. The capabilities for the information within cell phones to be data mined are far reaching and don't seem to be exploited at this time.

Another student and myself are working on an analysis program that aims to take advantage of many aspects of the data. I hope to share more about it later and maybe even have some of you test it.

Tuesday, May 17, 2011

We have a Problem Part II

Let me apologize first for using different scales on my first four graphs in my last post. It's misleading, and my brother caught it. Thanks, bro.

Additionally, I failed to mention the second half of the problem in the last post. I think it's much more important than the first. Happy reading!

As discussed in my previous post, cybercrime is rapidly increasing, whether the computer is the target or a tool. Investigators have collected this evidence as they have become more educated on the importance of it. However, the criminals have an enormous amount of data, and law enforcement collects all of it from each of them. Simson Garfinkel says it this way, “[t]oday’s forensic examiners have become the victims of their own success. Digital storage devices such as hard drives and flash memory are such valuable sources of information that they are now routinely seized in many investigations."

It is difficult to examine and analyze so much evidence. It has been called the “Big Data Problem” by the Secretary of Homeland Security, Janet Napolitano. Because there is so much data, it is difficult to determine what is relevant to the investigation. It is harder to gain intelligence from the large datasets. She defines intelligence in a speech at the Massachusetts Institute of Technology. “…intelligence is not just a matter of having information — it is also about what one does with that information, and how one figures out what it really means." She goes on to say, “[Intelligence] is about discerning meaning and information from millions – billions – of data points. And when it comes to our security, this is one of our nation's most pressing science and engineering challenges.

This certainly applies to digital investigations, and Napolitano affirms this connection with her statement, “Many of you probably deal with a version of this in your own work: your research brings in reams of data, but what is essential is the ability to glean insight, and discern patterns and trends from a mass of information.” Simson Garfinkel explicitly relates the big data problem to digital forensics. He writes, “The growing size of storage devices means that there is frequently insufficient time to create a forensic image of a subject device, or to process all of the data once it is found.” This problem is developing. In the past drives were much smaller and the amount of evidence collected was lower, but “the vast size of today’s storage devices means that time honored and court-approved techniques for conducting investigations are becoming slower and more expensive.”

Nance, Hay, and Bishop reiterate, "It is common for digital forensic investigations to be overwhelmed with massive volumes of data. Increasing numbers of devices hold potentially relevant information, and the data storage capacity on such devices is expanding rapidly. It is easy to find examples of digital media players with 160GB hard drives, inexpensive digital cameras that can store 8GB or more, cell phones that have 16GB of flash storage, inexpensive 8GB USB memory sticks, and consumer-grade terabyte hard disks costing no more than a few hundred dollars."  Napolitano concludes, “Very quickly, you can see that "Big Data" – more so than the lack of data – becomes the most pressing problem.


[Mislan, Casey, & Kessler, 2010]:
This problem can result in negative consequences. On the surface, a casual observer can note that “many [digital forensic laboratories] have substantial backlogs.” When one analyzes the consequences further, it becomes clearer that “delays in processing evidence will inevitably slow down the criminal justice system, giving offenders time to commit additional crimes and causing immeasurable damage to falsely accused individuals.” The large amount of data and backlogs, “hinder investigations and negatively impact public safety.” Furthermore, “[i]n a military context, delays in extracting intelligence…can negatively impact troop and civilian safety as well as the overall mission.”

The vast amount of data collected in investigations makes it difficult to determine what is important to the case. The National Institute of Justice wrote in its report,

"Advances in technology will soon provide all Americans with access to a powerful, high-capacity network that will transport all their communications (including voice and video), deliver entertainment, allow access to information, and permit storage of large quantities of information most anywhere."

Now, ten years later after the report was published, it can be argued the statements are reality. The report continues, “In such an environment, finding important evidence can be nearly impossible. Separating valuable information from irrelevant information, for either communications or stored data, requires extraordinary technical efforts. Determining the location where evidence is stored is also quite difficult.” A report on the High Cost of Not Finding Information says, “…[T]echnologies have improved access to information, but they have also created an information deluge that makes any one piece of information more difficult to find.” Napolitano also recognizes the issue. She continues in her speech,

"We therefore cannot overstate the need for software engineers and information systems designers. We need communications and data security experts. And we need this kind of talent working together to find new and faster ways to identify and separate relevant data."

Rao describes how people engage this problem:

"Workers usually don’t waste their time fighting systems or tasks that don’t pay off. People almost always move on when they can’t find useful information quickly. And they are unlikely to grind away at digesting poorly organized or apparently featureless piles of documents. Thus, organizations don’t draw on reservoirs of information that could influence a particular decision, task, or project. Ultimately, this leads to uninformed decisions, overlooked risks, and lost opportunities."

Beebe and Clark explain what this means for digital forensics:

"Digital investigations are also hindered by the limited processing capabilities of human analysts. As data sets increase in size, the amount of data required for examination and analysis increases. This obviates the digital investigators ability to meticulously review all keyword search ‘hits,’ files by file type, or all applicable system logs."

The International Data Corporation estimates millions of dollars are lost each year among companies that are not able to find information when searching for it. What does this mean for the digital forensics community? It may show that all significant evidence is not found within a case. All evidence does not necessarily need to be found, but clues that point to additional crimes should be discovered, but the investigator is unable to accomplish this alone.

How should this problem be addressed? The budget of the United States will not allow for more personnel at this time and the foreseeable future, especially to a complex problem lawmakers do not understand. As a result laws have a very difficult task to keep pace with the offenders as shown in Figure 2.


To keep up with the crime with the same budgetary constraints, efficiency must be increased – there is plenty of room for it. Dedicated personnel to cyber investigations would certainly increase efficiency due to specialization, but, again, budgetary limits will not allow for this change. One idea that may need exploring is collaboration with universities and students. According to the HTCIA, this resource was least tapped into among other collaborators as shown in Figure 1. Students may be able to be trained to give investigators much needed help. The experience alone may be enough to attract students to an internship without pay (That's my story!)
Figure 1
Another method is to improve the equipment used in triage or on scene. Often evidence is taken back to the lab or even sent to other labs for examination. Improved triage can reduce the amount of evidence sent to other agencies. This will lead to a reduced backlog of cases that currently haunt the digital investigation process. The triage process is different than the analysis at a lab. With triage, obvious evidence is collected quickly to expedite the investigation where analysis at the lab is much more in depth and is very detail-oriented. Gathering and interpreting data quickly can give an advantage to law enforcement. The likelihood of catching criminals decreases as time passes. If a triage tool can investigate data quickly on scene, this will help catch the perpetrator and reduce the number of open cases in an agency or department. This will lead to reduced backlog and, of course, overall increased efficiency.

In my opinion the area to increase efficiency is digital forensics tool design. The digital forensics era is very young. As such, there are problems in many areas of the field as shown in Figure 2 and Figure 3, but tool design may be an area that would benefit the law enforcement community the most if improved.

Figure 2 and Figure 3
You may ask why. You may be thinking, "Tools are listed as the fifth concern in the survey."  It may be argued that the categories of education/training/certification, technology, and theory/research can be rolled into the funding category. Tools may not necessarily be included into funding because private companies often develop these while cost is not a prohibitive factor. All of a sudden tools become a greater need. It appears that in the context of this survey, tools are lacking for a specific purpose or are not advanced in the development to give the investigator information he/she needs. However, this is contrary to the HTCIA survey where respondents indicated level of adequacy of tools in software and hardware. The results were tabulated on a scale of one to ten of investigation equipment and forensic equipment as shown in Figure 4 and Figure 5. It appears the results lean towards the ten side of the scale, but it is also apparent that many respondents are not satisfied with available equipment, particularly the investigation equipment.
Figure 4 and Figure 5
How do we improve the equipment crisis? Current tools are evidence-oriented. The tools help the investigators look for evidence rather than help investigate. They often have a large output that is difficult to understand and difficult to determine trends or other statistical data. The tools themselves provide little to no link analysis or automatic investigation. The tools are not working for the investigator, but are only working with the investigator. The way developers think about the equipment the investigator uses must change. Tools must give the user context and information rather than reports containing pages upon pages of data. They must accentuate common and important information. Investigators are trying to interpret these complicated reports to advance an investigation. Computers can correlate information much faster than any human can and give insight into the totality of evidence.

Beebe and Clark describe the current methods for investigation:

"Currently, digital investigation processes and tools underutilize computer processing power through continued reliance on simplistic data reduction and mining algorithms. In the past, when human labor was cheap and computers were expensive, the analytical burden was shifted to analysts. For quite some time, the roles have been reversed, yet the digital forensics field has continued to levy the preponderance of its analytical burden on the human analyst."

The tools should use data reduction, consolidation, and summarization techniques to even further limit the amount of information the investigator receives in a generated report. Statistical methods can accomplish this to report statistically significant information rather than all information. Given the current situation of increased computer crime, we should use computers to our advantage and reduce load of the investigator. Automatic searches and extraction of information and subsequent data mining may be a part of the solution to the problem.

Rao describes the benefits of automated extraction of information:

"Beyond lifting the burden from personnel, automatic extraction generally lets these organizations better direct their attention. Analysts can move beyond the conventional search to look for specific facts or types of occurrences and contextualize these results against the background of the content collection. With extracted information stored as structured databases, analysts can explore facts and relationships directly to look for significant patterns, trends, or anomalies."

Beebe and Clark define data mining and how it applies to digital forensics:

"Data mining embodies a multi-disciplinary approach to finding and retrieving information, and relies on several reference disciplines that enjoy long, rich research streams, including mathematics, statistics, computer science, and information science…Data mining techniques are specifically designed for large data sets –attempting to find and retrieve data and otherwise hidden information amongst voluminous amounts of data. The data may or may not be structured, noisy or from the same source. In digital forensics, data sources are both structured and unstructured; noisy and not noisy; and from both homogeneous and heterogeneous sources-particularly in large data set cases…Content retrieval has clear and extensive applicability to digital investigations, such as mining large data sets for text documents containing specific content or involving particular individuals, or mining large data sets for contraband graphic images (e.g., child pornography, counterfeit currency). Taking a closer look at the former example, the goal of text (information) retrieval is usually to compare documents, rank importance or relevance of documents, or find patterns/trends across multiple documents. each of these goals is extensible to digital investigations –particularly the latter two. Ranking the importance or relevance of documents relative to investigative objectives, criminal allegations, or target content facilitates data extraction during the Data Analysis Phase and minimizes, as well as prioritizes, the ‘hits’ an investigator or analyst has to review. This is critical when dealing with large data sets. Finding patterns and trends across multiple documents assists an investigator in profiling users and uncovering evidence for which exact keywords are unknown."

They claim three benefits of data mining in digital investigations: “(i) reduced system and human processing time associated with data analysis; (ii) improved analytical effectiveness and information quality; and (iii) reduced monetary costs associated with digital investigations.” 

Carrier and Spafford write about the process for finding evidence without automation:

"The target object is frequently virtual and defined only in the mind of the investigator. Some tools require a partial digital representation of the target object and the user must enter the target’s characteristics. For example, a user types in the keywords for a keyword search and therefore defines a digital representation of the target object."

[Mislan, Casey, & Kessler, 2010]:
This can be compared to the “thumb/scroll through” triage method of cell phones “when no automated forensic tool works…The primary shortcoming of this approach from a forensic perspective [is] the lack of consistency.” Automated tools can provide consistent results because of the use of repeated procedures in the programming.

Efficiency and contextual improvements are recommended to address the cybercrime aspect of cyber conflict. Automation of forensic tools for searching and information extraction may be an invaluable method in the future. This idea of data mining can solve the big data problem currently faced , and bring much needed help to the digital forensics community. If the field continues on it current course, investigators will find themselves in the midst of a digital forensics crisis in the near future. The criminal world will realize this and use it to their advantage exacerbating the problem. The growth of technology will continue, and the use of technology assisted or targeted crime will increase. The proper response must be made developers to empower law enforcement.

I'll share a couple examples next time of big data.

Feel free to comment, criticize, and suggest!