Tuesday, June 7, 2011

Big Cell Phone Data

Last time I talked a bit about the "Big Data Problem." Guess who I ran into last weekend at the Indy 500?

We chatted briefly about the issue, which she called the metadata problem. Oh yeah, I'm in the marching band at Purdue. I play this.

She also said I should come work for them. I guess the DHS is now on my "apply to" list. Speaking of work, I gained an internship with Lockheed Martin for the summer. I am awaiting the background check and drug screen results. It should be an interesting summer in network security and forensics.

I got an email from my sister. Her blog is mostly about organizational projects. She had kind words about the blog, but said I should think about shortening them since they seemed like papers rather than short essays. Well, they are papers really, but I agree with her. So, I won't hinder the topic at hand any further.

I did some rough testing a few years ago comparing two forensic tools - WinMoFo and Device Seizure. I acquired data from four Windows Mobile phones with varying amounts of data and use, and I exported the reports into Microsoft Excel. The results are shown in Figure 1.

It would be very difficult for an investigator to interpret or find useful information on the fourth cell phone having over 12000 rows in Device Seizure. It would even be difficult with over 4000 rows in WinMoFo.

Simplistic data reduction and mining techniques have still yet to be ubiquitous. It seems WinMoFo uses some data reduction techniques. It allows the user to choose what information s/he wanted, such as text messages, call logs, and contacts. However, at the time of the project, Device Seizure allowed no data reduction and the user was forced to complete a logical or physical image of the device.  For the project WinMoFo also captured all system files for a fair comparison. I don't know the current capabilites of Device Seizure or WinMoFo.

Rows generated correlate with acquisition time. As shown in Figure 2, WinMoFo has a great advantage over Device Seizure as far as time is concerned.

Earlier this year I decided to take a look at my phone with a couple of tools—Cellebrite and DataPilot. Each provided seemingly accurate results. I choose the word seemingly because of the length of the reports. I converted each report to a PDF document where it was revealed that the Cellebrite report was 317 pages and the DataPilot report was 580 pages. Maybe this is fine for an in depth analysis, but what about quick initial results? How is an investigator supposed to go through this information manually, when I cannot (don't want to, really) verify my own phone report? There is too much information provided. My phone includes and the report provided one thousand contacts and three thousand text messages among other information, but what law enforcement entity will read each of those? How does an investigator determine who is important to the phone user or what evidence is important to the crime? A keyword search is mostly what is available at this time. DataPilot does provide some frequency analysis in its svProbe module, but other items can be taken into consideration when determining what is important to the investigation such as number of times in a row someone called or texted the phone user, response time to missed calls or text messages, number of words in a text message, word length in text messages, number of contacts for one person, and synchronization to social networks. The capabilities for the information within cell phones to be data mined are far reaching and don't seem to be exploited at this time.

Another student and myself are working on an analysis program that aims to take advantage of many aspects of the data. I hope to share more about it later and maybe even have some of you test it.


  1. I don't understand why, after acquiring the data, you can't have it run in a different program which would allow "keywords" to be used to narrow down time/date/contact stamps from the information?

  2. That's how a lot it seems to be done now. My question is why aren't computers carrying more of the burden for repeatable tasks? Many people have scripts they have made and can run for their individual needs, but aren't there common needs? I feel like investigators and companies think that current technologies address their needs (take a look at Figure 4 and 5 in my last post), but I say we've been thinking inside the box for far too long. My only reservation is that my perspective as a student is different than the reality of the situation.