Tuesday, May 17, 2011

We have a Problem Part II

Let me apologize first for using different scales on my first four graphs in my last post. It's misleading, and my brother caught it. Thanks, bro.

Additionally, I failed to mention the second half of the problem in the last post. I think it's much more important than the first. Happy reading!

As discussed in my previous post, cybercrime is rapidly increasing, whether the computer is the target or a tool. Investigators have collected this evidence as they have become more educated on the importance of it. However, the criminals have an enormous amount of data, and law enforcement collects all of it from each of them. Simson Garfinkel says it this way, “[t]oday’s forensic examiners have become the victims of their own success. Digital storage devices such as hard drives and flash memory are such valuable sources of information that they are now routinely seized in many investigations."

It is difficult to examine and analyze so much evidence. It has been called the “Big Data Problem” by the Secretary of Homeland Security, Janet Napolitano. Because there is so much data, it is difficult to determine what is relevant to the investigation. It is harder to gain intelligence from the large datasets. She defines intelligence in a speech at the Massachusetts Institute of Technology. “…intelligence is not just a matter of having information — it is also about what one does with that information, and how one figures out what it really means." She goes on to say, “[Intelligence] is about discerning meaning and information from millions – billions – of data points. And when it comes to our security, this is one of our nation's most pressing science and engineering challenges.

This certainly applies to digital investigations, and Napolitano affirms this connection with her statement, “Many of you probably deal with a version of this in your own work: your research brings in reams of data, but what is essential is the ability to glean insight, and discern patterns and trends from a mass of information.” Simson Garfinkel explicitly relates the big data problem to digital forensics. He writes, “The growing size of storage devices means that there is frequently insufficient time to create a forensic image of a subject device, or to process all of the data once it is found.” This problem is developing. In the past drives were much smaller and the amount of evidence collected was lower, but “the vast size of today’s storage devices means that time honored and court-approved techniques for conducting investigations are becoming slower and more expensive.”

Nance, Hay, and Bishop reiterate, "It is common for digital forensic investigations to be overwhelmed with massive volumes of data. Increasing numbers of devices hold potentially relevant information, and the data storage capacity on such devices is expanding rapidly. It is easy to find examples of digital media players with 160GB hard drives, inexpensive digital cameras that can store 8GB or more, cell phones that have 16GB of flash storage, inexpensive 8GB USB memory sticks, and consumer-grade terabyte hard disks costing no more than a few hundred dollars."  Napolitano concludes, “Very quickly, you can see that "Big Data" – more so than the lack of data – becomes the most pressing problem.

[Mislan, Casey, & Kessler, 2010]:
This problem can result in negative consequences. On the surface, a casual observer can note that “many [digital forensic laboratories] have substantial backlogs.” When one analyzes the consequences further, it becomes clearer that “delays in processing evidence will inevitably slow down the criminal justice system, giving offenders time to commit additional crimes and causing immeasurable damage to falsely accused individuals.” The large amount of data and backlogs, “hinder investigations and negatively impact public safety.” Furthermore, “[i]n a military context, delays in extracting intelligence…can negatively impact troop and civilian safety as well as the overall mission.”

The vast amount of data collected in investigations makes it difficult to determine what is important to the case. The National Institute of Justice wrote in its report,

"Advances in technology will soon provide all Americans with access to a powerful, high-capacity network that will transport all their communications (including voice and video), deliver entertainment, allow access to information, and permit storage of large quantities of information most anywhere."

Now, ten years later after the report was published, it can be argued the statements are reality. The report continues, “In such an environment, finding important evidence can be nearly impossible. Separating valuable information from irrelevant information, for either communications or stored data, requires extraordinary technical efforts. Determining the location where evidence is stored is also quite difficult.” A report on the High Cost of Not Finding Information says, “…[T]echnologies have improved access to information, but they have also created an information deluge that makes any one piece of information more difficult to find.” Napolitano also recognizes the issue. She continues in her speech,

"We therefore cannot overstate the need for software engineers and information systems designers. We need communications and data security experts. And we need this kind of talent working together to find new and faster ways to identify and separate relevant data."

Rao describes how people engage this problem:

"Workers usually don’t waste their time fighting systems or tasks that don’t pay off. People almost always move on when they can’t find useful information quickly. And they are unlikely to grind away at digesting poorly organized or apparently featureless piles of documents. Thus, organizations don’t draw on reservoirs of information that could influence a particular decision, task, or project. Ultimately, this leads to uninformed decisions, overlooked risks, and lost opportunities."

Beebe and Clark explain what this means for digital forensics:

"Digital investigations are also hindered by the limited processing capabilities of human analysts. As data sets increase in size, the amount of data required for examination and analysis increases. This obviates the digital investigators ability to meticulously review all keyword search ‘hits,’ files by file type, or all applicable system logs."

The International Data Corporation estimates millions of dollars are lost each year among companies that are not able to find information when searching for it. What does this mean for the digital forensics community? It may show that all significant evidence is not found within a case. All evidence does not necessarily need to be found, but clues that point to additional crimes should be discovered, but the investigator is unable to accomplish this alone.

How should this problem be addressed? The budget of the United States will not allow for more personnel at this time and the foreseeable future, especially to a complex problem lawmakers do not understand. As a result laws have a very difficult task to keep pace with the offenders as shown in Figure 2.

To keep up with the crime with the same budgetary constraints, efficiency must be increased – there is plenty of room for it. Dedicated personnel to cyber investigations would certainly increase efficiency due to specialization, but, again, budgetary limits will not allow for this change. One idea that may need exploring is collaboration with universities and students. According to the HTCIA, this resource was least tapped into among other collaborators as shown in Figure 1. Students may be able to be trained to give investigators much needed help. The experience alone may be enough to attract students to an internship without pay (That's my story!)
Figure 1
Another method is to improve the equipment used in triage or on scene. Often evidence is taken back to the lab or even sent to other labs for examination. Improved triage can reduce the amount of evidence sent to other agencies. This will lead to a reduced backlog of cases that currently haunt the digital investigation process. The triage process is different than the analysis at a lab. With triage, obvious evidence is collected quickly to expedite the investigation where analysis at the lab is much more in depth and is very detail-oriented. Gathering and interpreting data quickly can give an advantage to law enforcement. The likelihood of catching criminals decreases as time passes. If a triage tool can investigate data quickly on scene, this will help catch the perpetrator and reduce the number of open cases in an agency or department. This will lead to reduced backlog and, of course, overall increased efficiency.

In my opinion the area to increase efficiency is digital forensics tool design. The digital forensics era is very young. As such, there are problems in many areas of the field as shown in Figure 2 and Figure 3, but tool design may be an area that would benefit the law enforcement community the most if improved.

Figure 2 and Figure 3
You may ask why. You may be thinking, "Tools are listed as the fifth concern in the survey."  It may be argued that the categories of education/training/certification, technology, and theory/research can be rolled into the funding category. Tools may not necessarily be included into funding because private companies often develop these while cost is not a prohibitive factor. All of a sudden tools become a greater need. It appears that in the context of this survey, tools are lacking for a specific purpose or are not advanced in the development to give the investigator information he/she needs. However, this is contrary to the HTCIA survey where respondents indicated level of adequacy of tools in software and hardware. The results were tabulated on a scale of one to ten of investigation equipment and forensic equipment as shown in Figure 4 and Figure 5. It appears the results lean towards the ten side of the scale, but it is also apparent that many respondents are not satisfied with available equipment, particularly the investigation equipment.
Figure 4 and Figure 5
How do we improve the equipment crisis? Current tools are evidence-oriented. The tools help the investigators look for evidence rather than help investigate. They often have a large output that is difficult to understand and difficult to determine trends or other statistical data. The tools themselves provide little to no link analysis or automatic investigation. The tools are not working for the investigator, but are only working with the investigator. The way developers think about the equipment the investigator uses must change. Tools must give the user context and information rather than reports containing pages upon pages of data. They must accentuate common and important information. Investigators are trying to interpret these complicated reports to advance an investigation. Computers can correlate information much faster than any human can and give insight into the totality of evidence.

Beebe and Clark describe the current methods for investigation:

"Currently, digital investigation processes and tools underutilize computer processing power through continued reliance on simplistic data reduction and mining algorithms. In the past, when human labor was cheap and computers were expensive, the analytical burden was shifted to analysts. For quite some time, the roles have been reversed, yet the digital forensics field has continued to levy the preponderance of its analytical burden on the human analyst."

The tools should use data reduction, consolidation, and summarization techniques to even further limit the amount of information the investigator receives in a generated report. Statistical methods can accomplish this to report statistically significant information rather than all information. Given the current situation of increased computer crime, we should use computers to our advantage and reduce load of the investigator. Automatic searches and extraction of information and subsequent data mining may be a part of the solution to the problem.

Rao describes the benefits of automated extraction of information:

"Beyond lifting the burden from personnel, automatic extraction generally lets these organizations better direct their attention. Analysts can move beyond the conventional search to look for specific facts or types of occurrences and contextualize these results against the background of the content collection. With extracted information stored as structured databases, analysts can explore facts and relationships directly to look for significant patterns, trends, or anomalies."

Beebe and Clark define data mining and how it applies to digital forensics:

"Data mining embodies a multi-disciplinary approach to finding and retrieving information, and relies on several reference disciplines that enjoy long, rich research streams, including mathematics, statistics, computer science, and information science…Data mining techniques are specifically designed for large data sets –attempting to find and retrieve data and otherwise hidden information amongst voluminous amounts of data. The data may or may not be structured, noisy or from the same source. In digital forensics, data sources are both structured and unstructured; noisy and not noisy; and from both homogeneous and heterogeneous sources-particularly in large data set cases…Content retrieval has clear and extensive applicability to digital investigations, such as mining large data sets for text documents containing specific content or involving particular individuals, or mining large data sets for contraband graphic images (e.g., child pornography, counterfeit currency). Taking a closer look at the former example, the goal of text (information) retrieval is usually to compare documents, rank importance or relevance of documents, or find patterns/trends across multiple documents. each of these goals is extensible to digital investigations –particularly the latter two. Ranking the importance or relevance of documents relative to investigative objectives, criminal allegations, or target content facilitates data extraction during the Data Analysis Phase and minimizes, as well as prioritizes, the ‘hits’ an investigator or analyst has to review. This is critical when dealing with large data sets. Finding patterns and trends across multiple documents assists an investigator in profiling users and uncovering evidence for which exact keywords are unknown."

They claim three benefits of data mining in digital investigations: “(i) reduced system and human processing time associated with data analysis; (ii) improved analytical effectiveness and information quality; and (iii) reduced monetary costs associated with digital investigations.” 

Carrier and Spafford write about the process for finding evidence without automation:

"The target object is frequently virtual and defined only in the mind of the investigator. Some tools require a partial digital representation of the target object and the user must enter the target’s characteristics. For example, a user types in the keywords for a keyword search and therefore defines a digital representation of the target object."

[Mislan, Casey, & Kessler, 2010]:
This can be compared to the “thumb/scroll through” triage method of cell phones “when no automated forensic tool works…The primary shortcoming of this approach from a forensic perspective [is] the lack of consistency.” Automated tools can provide consistent results because of the use of repeated procedures in the programming.

Efficiency and contextual improvements are recommended to address the cybercrime aspect of cyber conflict. Automation of forensic tools for searching and information extraction may be an invaluable method in the future. This idea of data mining can solve the big data problem currently faced , and bring much needed help to the digital forensics community. If the field continues on it current course, investigators will find themselves in the midst of a digital forensics crisis in the near future. The criminal world will realize this and use it to their advantage exacerbating the problem. The growth of technology will continue, and the use of technology assisted or targeted crime will increase. The proper response must be made developers to empower law enforcement.

I'll share a couple examples next time of big data.

Feel free to comment, criticize, and suggest! 


  1. This problem is developing.

    In reality, we've always had this issue. I can remember when having a hard drive at all was a big deal. And you're absolutely correct...this "big data" problem is an issue, IF it is approached using traditional forensic analysis techniques.

    There are other issues regarding the "big data" problem than are really mentioned here. One is the disconnect between responders and analysts, and the communication of goals of the analysis. Having supported LE at one point in my career, I clearly remember sitting with data in front of me, but needing to speak to an agent or officer regarding the issue.

    One of the issues that really isn't discussed in much detail is that many of those committing cybercrimes are focused and dedicated, while their adversaries (IT staff, LE, etc.) do not have that luxury. The best weapon against knowledgeable specialists is...knowledgeable specialists.

  2. Thanks Keydet89 for taking time to read and post! I agree that goals are underutilized. I just started reading "Windows Registry Forensics" by Carvey, and there is a nice little section on goals. From what little experience I have, it seems that we often incorrectly take the Ancestry.com approach - "You really don't have to know what you're looking for. You just have to start looking."

  3. As a current student in the field of Security and CF at Stark State College I can say there is little emphasis on triage, we are learning the traditional forensic approach. I appreciate your incite into the issue. No time is spent on tool development but only what is available to possibly get the job done. So I see my future with the same issue. Learning the tried and true tools and methodology while the field is moving on to try and keep up with the incidence rate.

  4. Not sure about Triage, AFAICT Triage is all about targetting evidential low-hanging fruit to prioritise the order that digital devices get forensicated. I prefer an "enhanced previewing" approach, whereby digital devices are processed natively, if no evidence is found then get rid of it, if evidence is found then image and forensicate it. Using this approach we got rid of a 2 year backlog of cases in our lab. Counter-intuitively we were actually viewing more images (in IIOC cases), than when we were doing a full forensic examination. The benefit of this is that you only have to image only those disk you KNOW have evidence on them, drastically reducing the amount of data you have to archive. You also get a "heads up" of where on the disk the evidence is located e.g "zip archives in unallocated space".

  5. hi marcus I'm daniel from Mauritius. I'm new in digital forensics but i'm looking for a title for my Msc in Digital forensics. I've read about the big data problem it seems a very good subject. can I get some help please.

  6. Hi, Marcus, initial filtering before seizing the digital evidence surely will decrease the size that a forensic analyst need to study, but will that be impractical as most front-liner may not post the skill for forensic tool. Also as Keydet89 said, prior communication are always too luxurious to time-limited investigation. Inevitably, the whole computer/server have to be seized and a full-size forensic image have to be arquired as a working copy.

    So the next question is: how to deal with the enomorously increasing size of forensic image? any better (and more economical) solution other than RAID or SAN?

  7. Thanks for the comment, BiLiBaLa.

    One of my peers wrote his thesis on Distributed Digital Forensics. This may be a solution: http://www.cerias.purdue.edu/apps/reports_and_papers/view/4700

    I just found his website. Looks like he created a survey about the project and there wasn't much interest. Though, the home page says he is rewriting it in Java. http://www.nielsensolutions.com/projects.html

  8. Thanks Marcus for the information of distributed searching. Actually I am just thinking any means to reduce the pressure of putting more and more storage for forensic image to cope with the increasing size of backlogs....

    By the way, does anyone have any suggestions on tools for initial triage on scene or any good practise for initial triage? Thanks in advance.