June 23, 2015

The seven people you need on your data team

Congratulations! You just got the call – you’ve been asked to start a data team to extract valuable customer insights from your product usage, improve your company’s marketing effectiveness, or make your boss look all “data-savvy” (hopefully not just the last one of these). And even better, you’ve been given carte blanche to go hire the best people! But now the panic sets in – who do you hire? Here’s a handy guide to the seven people you absolutely have to have on your data team. Once you have these seven in place, you can decide whether to style yourself more on John Sturges or Akira Kurosawa.

Before we start, what kind of data team are we talking about here? The one I have in mind is a team that takes raw data from various sources (product telemetry, website data, campaign data, external data) and turns it into valuable insights that can be shared broadly across the organization. This team needs to understand both the technologies used to manage data, and the meaning of the data – a pretty challenging remit, and one that needs a pretty well-balanced team to execute.

1. The Handyman
Weird-Al-Handy_thumb10The Handyman can take a couple of battered, three-year-old servers, a copy of MySQL, a bunch of Excel sheets and a roll of duct tape and whip up a basic BI system in a couple of weeks. His work isn’t always the prettiest, and you should expect to replace it as you build out more production-ready systems, but the Handyman is an invaluable help as you explore datasets and look to deliver value quickly (the key to successful data projects). Just make sure you don’t accidentally end up with a thousand people accessing the database he’s hosting under his desk every month for your month-end financial reporting (ahem).

Really good handymen are pretty hard to find, but you may find them lurking in the corporate IT department (look for the person everybody else mentions when you make random requests for stuff), or in unlikely-seeming places like Finance. He’ll be the person with the really messy cubicle with half a dozen servers stuffed under his desk.

The talents of the Handyman will only take you so far, however. If you want to run a quick and dirty analysis of the relationship between website usage, marketing campaign exposure, and product activations over the last couple of months, he’s your guy. But for the big stuff you’ll need the Open Source Guru.

2. The Open Source Guru
cameron-howe_thumbI was tempted to call this person “The Hadoop Guru”. Or “The Storm Guru”, or “The Cassandra Guru”, or “The Spark Guru”, or… well, you get the idea. As you build out infrastructure to manage the large-scale datasets you’re going to need to deliver your insights, you need someone to help you navigate the bewildering array of technologies that has sprung up in this space, and integrate them.

Open Source Gurus share many characteristics in common with that most beloved urban stereotype, the Hipster. They profess to be free of corrupting commercial influence and pride themselves on plowing their own furrow, but in fact they are subject to the whims of fashion just as much as anyone else. Exhibit A: The enormous fuss over the world-changing effects of Hadoop, followed by the enormous fuss over the world-changing effects of Spark. Exhibit B: Beards (on the men, anyway).

So be wary of Gurus who ascribe magical properties to a particular technology one day (“Impala’s, like, totally amazing”), only to drop it like ombre hair the next (“Impala? Don’t even talk to me about Impala. Sooooo embarrassing.”) Tell your Guru that she’ll need to live with her recommendations for at least two years. That’s the blink of an eye in traditional IT project timescales, but a lifetime in Internet/Open Source time, so it will focus her mind on whether she really thinks a technology has legs (vs. just wanting to play around with it to burnish her resumé).

3. The Data Modeler
ErnoCube_thumb9While your Open Source Guru can identify the right technologies for you to use to manage your data, and hopefully manage a group of developers to build out the systems you need, deciding what to put in those shiny distributed databases is another matter. This is where the Data Modeler comes in.

The Data Modeler can take an understanding of the dynamics of a particular business, product, or process (such as marketing execution) and turn that into a set of data structures that can be used effectively to reflect and understand those dynamics.

Data modeling is one of the core skills of a Data Architect, which is a more identifiable job description (searching for “Data Architect” on LinkedIn generates about 20,000 results; “Data Modeler” only generates around 10,000). And indeed your Data Modeler may have other Data Architecture skills, such as database design or systems development (they may even be a bit of an Open Source Guru). But if you do hire a Data Architect, make sure you don’t get one with just those more technical skills, because you need datasets which are genuinely useful and descriptive more than you need datasets which are beautifully designed and have subsecond query response times (ideally, of course, you’d have both). And in my experience, the data modeling skills are the rarer skills; so when you’re interviewing candidates, be sure to give them a couple of real-world tests to see how they would actually structure the data that you’re working with.

4. The Deep Diver
diver_thumb3Between the Handyman, the Open Source Guru, and the Data Modeler, you should have the skills on your team to build out some useful, scalable datasets and systems that you can start to interrogate for insights. But who to generate the insights? Enter the Deep Diver.

Deep Divers (often known as Data Scientists) love to spend time wallowing in data to uncover interesting patterns and relationships. A good one has the technical skills to be able to pull data from source systems, the analytical skills to use something like R to manipulate and transform the data, and the statistical skills to ensure that his conclusions are statistically valid (i.e. he doesn’t mix up correlation with causation, or make pronouncements on tiny sample sizes). As your team becomes more sophisticated, you may also look to your Deep Diver to provide Machine Learning (ML) capabilities, to help you build out predictive models and optimization algorithms.

If your Deep Diver is good at these aspects of his job, then he may not turn out to be terribly good at taking direction, or communicating his findings. For the first of these, you need to find someone that your Deep Diver respects (this could be you), and use them to nudge his work in the right direction without being overly directive (because one of the magical properties of a really good Deep Diver is that he may take his analysis in an unexpected but valuable direction that no one had thought of before).

For the second problem – getting the Deep Diver’s insights out of his head – pair him with a Storyteller (see below).

5. The Storyteller
woman_storytellerThe Storyteller’s yin is to the Deep Diver’s yang. Storytellers love explaining stuff to people. You could have built a great set of data systems, and be performing some really cutting-edge analysis, but without a Storyteller, you won’t be able to get these insights out to a broad audience.

Finding a good Storyteller is pretty challenging. You do want someone who understands data quite well, so that she can grasp the complexities and limitations of the material she’s working with; but it’s a rare person indeed who can be really deep in data skills and also have good instincts around communications.

The thing your Storyteller should prize above all else is clarity. It takes significant effort and talent to take a complex set of statistical conclusions and distil them into a simple message that people can take action on. Your Storyteller will need to balance the inherent uncertainty of the data with the ability to make concrete recommendations.

Another good skill for a Storyteller to have is data visualization. Some of the most light bulb-lighting moments I have seen with data have been where just the right visualization has been employed to bring the data to life. If your Storyteller can balance this skill (possibly even with some light visualization development capability, like using D3.js; at the very least, being a dab hand with Excel and PowerPoint or equivalent tools) with her narrative capabilities, you’ll have a really valuable player.

There’s no one place you need to go to find Storytellers – they can be lurking in all sorts of fields. You might find that one of your developers is actually really good at putting together presentations, or one of your marketing people is really into data. You may also find that there are people in places like Finance or Market Research who can spin a good yarn about a set of numbers – poach them.

6. The Snoop
Jimmy_Stewart_Rear_Window_thumb6These next two people – The Snoop and The Privacy Wonk – come as a pair. Let’s start with the Snoop. Many analysis projects are hampered by a lack of primary data – the product, or website, or marketing campaign isn’t instrumented, or you aren’t capturing certain information about your customers (such as age, or gender), or you don’t know what other products your customers are using, or what they think about them.

The Snoop hates this. He cannot understand why every last piece of data about your customers, their interests, opinions and behaviors, is not available for analysis, and he will push relentlessly to get this data. He doesn’t care about the privacy implications of all this – that’s the Privacy Wonk’s job.

If the Snoop sounds like an exhausting pain in the ass, then you’re right – this person is the one who has the team rolling their eyes as he outlines his latest plan to remotely activate people’s webcams so you can perform facial recognition and get a better Unique User metric. But he performs an invaluable service by constantly challenging the rest of the team (and other parts of the company that might supply data, such as product engineering) to be thinking about instrumentation and data collection, and getting better data to work with.

The good news is that you may not have to hire a dedicated Snoop – you may already have one hanging around. For example, your manager may be the perfect Snoop (though you should probably not tell him or her that this is how you refer to them). Or one of your major stakeholders can act in this capacity; or perhaps one of your Deep Divers. The important thing is not to shut the Snoop down out of hand, because it takes relentless determination to get better quality data, and the Snoop can quarterback that effort. And so long as you have a good Privacy Wonk for him to work with, things shouldn’t get too out of hand.

7. The Privacy Wonk
Sadness_InsideOut_2815The Privacy Wonk is unlikely to be the most popular member of your team, either. It’s her job to constantly get on everyone’s nerves by identifying privacy issues related to the work you’re doing.

You need the Privacy Wonk, of course, to keep you out of trouble – with the authorities, but also with your customers. There’s a large gap between what is technically legal (which itself varies by jurisdiction) and what users will find acceptable, so it pays to have someone whose job it is to figure out what the right balance between these two is. But while you may dread the idea of having such a buzz-killing person around, I’ve actually found that people tend to make more conservative decisions around data use when they don’t have access to high-quality advice about what they can do, because they’re afraid of accidentally breaking some law or other. So the Wonk (much like Sadness) turns out to be a pretty essential member of the team, and even regarded with some affection.

Of course, if you do as I suggest, and make sure you have a Privacy Wonk and a Snoop on your team, then you are condemning both to an eternal feud in the style of the Corleones and Tattaglias (though hopefully without the actual bloodshed). But this is, as they euphemistically say, a “healthy tension” – with these two pulling against one another you will end up with the best compromise between maximizing your data-driven capabilities and respecting your users’ privacy.

Bonus eighth member: The Cat Herder (you!)
The one person we haven’t really covered is the person who needs to keep all of the other seven working effectively together: To stop the Open Source Guru from sneering at the Handyman’s handiwork; to ensure the Data Modeler and Deep Diver work together so that the right measures and dimensionality are exposed in the datasets you publish; and to referee the debates between the Snoop and the Privacy Wonk. This is you, of course – The Cat Herder. If you can assemble a team with at least one of the above people, plus probably a few developers for the Open Source Guru to boss about, you’ll be well on the way to unlocking a ton of value from the data in your organization.

Think I’ve missed an essential member of the perfect data team? Tell me in the comments.

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

December 14, 2009

I love it when a mail-merge comes together…

Would you buy BI services from a company that can’t successfully execute a mail-merge? Not to mention ones that send unsolicited e-mails to drum up business…


del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

January 30, 2009

Internet Explorer 8 RC1: Porn mode gets a face-lift

ie8logo Now if that headline doesn’t get me some search ranking juice, nothing will - though my contextual ads (left) are likely to be less impressed.

I was going to post this earlier in the week, but Eric Peterson’s swashbuckling defense of cookies (and my hand-wringing response)  intervened. As it turns out, though, that debate is very relevant to this post, which concerns the latest build of Internet Explorer 8 (still used by around 80% of the world’s web users, though not by you lot, who seem to favor Firefox by a nose), which hit the web this week.

I’ve already posted once about IE8, and talked about its new “InPrivate” features (also known as “porn mode”) that allow you to surf the web without leaving a trace (on the machine you’re using, of course – the websites you visit can still track your behavior). It’s worthy of another post because the specific feature that piqued my interest the last time – InPrivate Blocking – has a new name and somewhat different behavior now.

The new name for InPrivate Blocking is InPrivate Filtering, which is certainly a better name. You may recall that InPrivate Blocking was a feature that allowed the user to tell IE to block requests to third-party websites, either manually, or if content from those sites had been served in a third-party context more than 10 times. Examples of this kind of content? Web analytics tracking tag code; ads; widgets; embedded YouTube videos. The idea is to enable users to opt out of this kind of content because it enables third parties to track user behavior (with or without cookies) without them really knowing.

So what’s new in RC1, apart from a friendlier name? Well, a couple of things. The first is that InPrivate Filtering can be turned on even if you’re not browsing in “InPrivate mode”, via the Safety menu, or a handy little icon in the status bar:


Click it, and InPrivate Filtering is on. There’s no way to turn this on by default; you have to click the icon every time you start a new IE instance.

The other major change is that there’s more control over how third-party content is blocked. In the previous beta, content was automatically blocked if it turned up more than 10 times (i.e. on 10 different sites) as third-party content. That number is now tunable, to anywhere between 3 and 30:


The idea of InPrivate Filtering Subscriptions still exists – a user can import an appropriately formatted XML file (or click on a link on a site, such as this one) to subscribe to a list of blocked third-party content.I’ve not seen any public subscriptions pop up, however, in the time since IE8 beta 2 came out.

In my previous post in IE8, I wrote about how, as someone whose job depends on being able to track users, I am conflicted about this functionality. This revision makes it slightly easier for privacy hawks to block third-party content, and whilst I welcome it, my original prediction – that it will be relatively lightly used in practice – still stands.

Interestingly, since IE8 beta 2 was announced in August, other browser manufacturers have followed suit – most notably, Mozilla, which will be including InPrivate-style functionality in Firefox 3.1 – though without the third-party content blocking feature. Apple’s Safari browser has had similar functionality for some time.

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

January 28, 2009

Eric Peterson Rides Again

Looks a little like him, don't you think? Eric Peterson has an impassioned post on his blog in which he defends the Obama Administration’s decision to use persistent cookies for tracking behavior on the Whitehouse.gov site. He directs particular ire at an article by Chris Soghoian at CNET from November which questioned whether it was a smart move for the (then) Obama Transition Team to be using embedded YouTube videos for streaming Obama’s weekly addresses on the Change.gov site.

Eric’s post is a follow-up to his post from November in which he called upon Barack Obama to relax the burdensome rules around the use of persistent cookies on Government websites. And let me say this: those rules suck. They ban the use of persistent cookies altogether, both first- and third-party. And I stand firmly behind Eric’s stance that those rules should be re-written – Government can’t be effective in providing services online if it can’t track the usage of those services.

But in his enthusiasm, Eric does actually conflate two somewhat separate issues - cookies on the one hand, and third-party content & tracking on the other. And third-party tracking & content deserves at least as much attention as cookies (if not more, in fact).

Obama's team's decision to use YouTube to stream videos and WebTrends for web analytics means that behavior data is being sent to a third party. The Whitehouse.gov site does a pretty good job of explaining about its use of cookies, but a less good job of detailing what data is being sent to third parties, who those third parties are, and how to prevent that information being shared.

Whilst it's no skin off my nose to send this data to Webtrends and Google, this is partly because a) I know and trust those organizations, and b) the content on the Whitehouse.gov site is pretty uncontentious. But what if I were looking at detailed information about entitlement programs, or applying online for some Government help with my mortgage? There is at least a valid question to be asked about how this kind of behavior data is shared with third-parties, separate from the cookie discussion.

My view? I don’t really think Government websites should be sending tracking data to third-parties, or retrieving content from third-party sites (other than other Government sites). There are plenty of options for first-party analytics solutions which offer just as much functionality as hosted solutions and would allow the Government to maintain control of this data and to be able to be definitive about how it is stored and used.

Stop wasting their time!

Eric also makes the point that, with everything else that’s going on right now, it’s borderline irresponsible to be chewing up the new administration’s time with pedantic questions about cookies or third-party tracking. But I don't think it's inappropriate at this stage to flag this to the Obama administration, because I imagine that at this moment (or very shortly) a variety of Federal agencies are looking at how they can put more information and services online.

Helping the administration to set sensible policies now will stop precious money being wasted if policies have to be changed later. And besides, wasn’t it Eric who called on Obama’s team to take the time to review the rules in the first place? Could they not churn out some websites with some simple log-based tracking now and then focus on E-government policy when the economy’s calmed down?


Another issue addressed in Chris’s original post is the wisdom of using YouTube (or indeed any third-party streaming service) for the videos on the Change.gov site (YouTube is also used on Whitehouse.gov). This raises a number of questions, such as how was Google chosen over, say, Vimeo, or Hulu, or MSN Video, and whether there any SLAs in place to ensure this material is available on an ongoing basis.

Let me make it clear that I don't object to Obama's addresses being available on YouTube - they should be there, and on every other video streaming website. But for information published through the Whitehouse.gov website itself, I'm not sure that a third-party streaming site is the best choice. How confident can we be about the integrity of this information? After all, we wouldn't want Obama to be RickRolled, now would we?

Yeah, yeah, grumble, grumble… you done yet?

You’re probably thinking “Jeez, what a kill-joy” as you read this post. And it’s true that privacy wonks (which I would not fully consider myself to be) do have a rather Cassandra-ish quality, always looking for the bad. But this is an essential part of the dynamics of the debate on topics like this – which means that Eric’s robust post is also essential and welcome, I should add. But we did get into rather hot water with the previous administration’s disregard for privacy. So it only makes sense that the new guys should get to hear these concerns now.

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

October 01, 2008

‘Anonymous’ Netflix Prize data not so anonymous after all


Does entropic de-anonymization of sparse microdata set your pulse racing? If so, you’re gonna love this paper [PDF] by Arvind Narayanan and Vitaly Shmatikov of the University of Texas at Austin. Even if your stats math is as rusty as mine, however, the paper makes fascinating reading - and is surprisingly readable, if you skip over the algorithm-heavy bit in the middle.

For those of you who don’t have time to read an academic paper, here’s a summary. The paper  presents a method for taking an ‘anonymized’ data set – for example, the Netflix Prize data – and locating the record for a user about which you have a limited set of approximate data. If, for example, I know that you’re a fan of The Bourne Ultimatum, Minority Report and Delicatessen but that you absolutely hated Hitch, Music and Lyrics and Along Came Polly (can’t blame you for the last one, by the way), then there’s about an 80% chance I can find your entry in the Netflix Prize dataset (assuming it’s there – it’s only a 10% sample of Netflix’s total ratings data). And I can do this even if I don’t know anything else about you.

The reason this is possible is that the data is so-called ‘sparse’ data – each record (which represents a Netflix user) has many, many fields (each field represents a particular movie), of which only a tiny fraction are non-null (because even the most prolific Netflix user has only rated a tiny fraction of Netflix’s total library). So the chances of two or more users giving the same rating to the same set of movies is actually quite small.

A lot of the detail in the paper relates to the fact that the information you start with doesn’t even have to be 100% accurate – for example, even though I know that you loved Minority Report, I may not know if you gave it 4 or 5 stars on Netflix. The algorithms are surprisingly robust in this environment. If you know just a little bit more (specifically, when the ratings were entered, to within some tolerance of accuracy), it becomes even easier to locate a record based upon some starting data. Especially if the person is interested in less popular movies (the inclusion of Delicatessen in the list above would dramatically increase the chance of a match).

Why is this interesting? Well, when Netflix released this data they confidently said that it had been shorn of all personally identifiable information – the implication being that you couldn’t link a specific record to an individual. But this paper gives the lie to that - It drives, if not a truck, then certainly a decent-sized minivan through Netflix’s claims.

As the AOL Search data debacle in 2006 showed, simply removing identifiers from this kind of data is not enough to render it properly anonymous. And if you’re thinking that Netflix preference data is hardly sensitive data, then remember that media consumption has a long and inglorious history of being the basis for discrimination and persecution in society – and there are certain idiot politicians who even today still seem to think this kind of stuff is ok.

[Update, 10/3/08: One of the authors of the paper, Arvind Narayanan, has very kindly commented on this post, and points me to a blog that he has started to discuss this topic and its impact, which you can find at http:///33bits.org. The blog has already helped me to understand eccentricity better, so go take a look.]

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

September 16, 2008

Phorm gets the all-clear from the UK Goverment (kinda)

[Update 10/1/08: BT has announced that it will commence a new trial with Phorm to start September 30 in the UK. The trial, in accordance with the conditions below, is opt-in]


phorm_logo Beleaguered behavioral targeting outfit Phorm appears finally to have caught a bit of a lucky break - the UK Government has (belatedly) responded to the EU's queries about Phorm's business practices by saying that Phorm does not break EU data collection/retention laws. But the Department for Business, Enterprise and Regulatory Reform (BERR) - the Government department tasked with assessing Phorm's business and responding to the EU - has placed the following conditions on its approval (from an excerpt of the full letter sent to the EU which is reproduced on The Register - my highlighting added):

  • The user profiling occurs with the knowledge and agreement of the customer.
  • The profile is based on a unique ID allocated at random which means that there is no need to know the identity of the individual users.
  • Phorm does not keep a record of the actual sites visited.
  • Search terms used by the user and the advertising categories exclude certain sensitive terms and have been widely drawn so as not to reveal the identity of the user.
  • Phorm does not have nor want information which would enable it to link a user ID and profile to a living individual.
  • Users will be presented with an unavoidable statement about the product and asked to exercise a choice about whether to be involved.
  • Users will be able to easily access information on how to change their mind at any point and are free to opt in or out of the scheme.

The two key bullets here are the last two - Phorm will be required to operate this service as an opt-in service only, with clear language and functionality enabling even opted-in users to opt out at any time. And  BERR states that it will be keeping a close eye on Phorm to ensure that it continues to comply with these conditions.

The news may do a little to shore up Phorm's deflating stock price, which has lost about 80% of its value since the heady days of March. But it's hard to imagine Phorm building much of a sustainable business on the back of an opt-in only system - it's going to be an incredibly hard sell for the ISPs that Phorm partners with (BT, TalkTalk and Virgin Media being the only ones mentioned so far). The only model I can think of is that the ISPs offer reduced rates in exchange for opting into the targeting system; but that negates the very purpose of implementing the system in the first place - to shore up sagging ISP revenues in the wake of the last few years' broadband price wars. I fear that Phorm is not out of the woods yet - especially if the recent happenings at its competitor NebuAd are anything to go by.

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

September 11, 2008

Yahoo updates IndexTools terms & conditions

safe Yahoo is not letting the grass grow under its feet with its integration of IndexTools. Today IndexTools partners received an e-mail from Yahoo informing them of a change to the terms & conditions of the service, which need to be agreed to by October 15 in order to retain access to IndexTools.

The e-mail calls out a change to the Ts & Cs which require IndexTools partner customers (i.e. the site owners themselves) to place the following (or equivalent) language on their websites (my highlighting):

“Third-Party Web Beacons: We use third-party web beacons from Yahoo! to help analyze where visitors go and what they do while visiting our website. Yahoo! may also use anonymous information about your visits to this and other websites in order to improve its products and services and provide advertisements about goods and services of interest to you. If you would like more information about this practice and to know your choices about not having this information used by Yahoo!, click here.”

Yahoo goes on to say that it will be auditing client sites and will disable accounts where this verbiage has not been included on the site (I wonder how effective this will be in practice - it may just be sabre-rattling).  Partners and client sites have until October 15 to comply.

The comment from the IndexTools partner who forwarded on this information was that it would be a challenge for their clients to implement this - from a logistical perspective, if nothing else. But I can understand Yahoo's move here - part of the benefit of a company like Yahoo (or Microsoft, or Google) offering a web analytics service is the secondary use of the resulting data for ad targeting purposes (something that Yahoo is very good at).

For comparison, here is (a shortened version of) the paragraph that Google requests its customers insert onto their sites:

“[...]  Google Analytics uses “cookies”, which are text files placed on your computer, to help the website analyze how users use the site. [...] Google will use this information for the purpose of evaluating your use of the website, compiling reports on website activity for website operators and providing other services relating to website activity and Internet usage.  Google may also transfer this information to third parties where required to do so by law, or where such third parties process the information on Google's behalf. Google will not associate your IP address with any other data held by Google. [...]  By using this website, you consent to the processing of data about you by Google in the manner and for the purposes set out above.”

This wording does not seem to imply that Google will reuse the data for other purposes, including ad targeting (IANAL, however); though Google did introduce some reuse of data (and some options for controlling it) with their data sharing feature that they launched back in March.

The corresponding paragraph from adCenter Analytics is:

Microsoft may retain and use user data subject to the terms of the Microsoft privacy statement and publish in aggregate or average form such information in combination with information collected from others’ use of adCenter Analytics except that Microsoft will not disclose to any third parties any user data collected by adCenter Analytics from your websites in a manner that (i) contains or reveals any personally-identifiable information or (ii) is specifically attributable to you or your websites.

The Microsoft privacy statement does say that we may use the information we collect to deliver services, "including personalized content and advertising".

So Yahoo is not doing anything here that hasn't been done before; and, as I've said several times before, you can't expect a company to provide a free web analytics service of the quality of IndexTools and not attempt to monetize it in some way. What is a little different about Yahoo's approach, though, is that it's taking a sterner line on actual implementation of the data reuse language, and actually threatening to disable accounts where the wording hasn't been added. This implies that Yahoo anticipates that it may need to defend its usage of this data (at least from a PR perspective), and wants to ensure that it can point to this wording on any site that uses IndexTools, so that users can't complain that their behavior data is being reused without their consent.

[Update 9/11/08: Added a reference to Google data sharing]

[Update 9/12/08: Corrected IndexTools' name - duh]

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

September 02, 2008

Internet Explorer 8 beta 2: Privacy vs Monetizability

image Once upon a time, when I was a young turk, I would assiduously download every last doodad that my employer created as soon as it shipped - or often long before, happily reaching for the pile of floppy disks as I rebuilt my computer for the umpteenth time following the latest toxic combination of untested software.

Age (and a need to still be able to work on my computer) has slowed me down. So I passed over IE8 beta 1, preferring to read about others' experiences of the new "standards mode" that is the default rendering mode for the new browser.

But last week, only hours after its public availability, I downloaded and installed IE8 beta 2. Why? Because it contains a raft of new features for protecting user privacy. I've blogged previously about the eternal tension between user privacy on the web, and the measurement and tracking that is so essential to many websites' business models. Put simply, if users' behavior could not be measured online, a lot of online businesses would go out of business.


What's new?

So how does IE8 contribute to the debate? Well, there are a number of minor features to protect users, and one major one. The minor ones include a nice feature in the address bar to highlight the actual domain of the site you're looking at:


This makes it much easier to spot phishing attacks, since many phishing sites try to confuse users by including familiar looking domains as subdomains of the real site, e.g.:


Another nice feature, related to phishing, is the "Smartsite Filter". This allows the user to check the current website against a known list of bad sites.It's essentially a UI into the automatic phishing filter that was built into IE7  - but it allows users to report sites as well as check them, adding a Cloudmark-like element of user contribution to the process of spotting evil sites.

This feature is rolled up under a new Safety menu, which also contains options to view the privacy policy info for a site (which shows all the cookies that were served and/or blocked, per IE7), and the security report for a site (any problems with the site's SSL certificate etc). Neither of these features is new, but it's nice to see them called out in their own menu.

The other small enhancement worth noting is that the "browsing history deletion" feature has become smarter - you can elect to delete the cookies etc. for all sites except those in your favorites list. This is a step forward, but it still mystifies me that IE has no easy way for browsing the cookies (and their content) on your computer, and selectively deleting them (as Firefox has had since v2, it pains me to say).


InPrivate Browsing & Blocking

The big new security/privacy feature in IE8 is called InPrivate Browsing (others have dubbed it "porn mode", but I am above such lewdness). InPrivate Browsing allows the user to browse without storing any cookies or browsing history, or locally cached files. It's good for when you're borrowing someone else's computer, or if you share a computer and don't want the other people who use the computer to know what you've been up to (now you are starting to understand where the "porn mode" nickname comes from).

The naming of the InPrivate functionality is somewhat confusing. Once you turn on InPrivate Browsing (either from the Safety menu or using Ctrl+Shift+P), something called InPrivate Blocking is also activated. InPrivate Blocking prevents your browser from sending requests for third-party content that it thinks are principally for the purpose of tracking your behavior. The big difference here is that this isn't just blocking third-party cookies - it's third-party content. That's tracking pixels, third-party JS calls, and yes, ads.

InPrivate Blocking will block third-party requests if one of the two following conditions have been met:

  • The request URL has been made in a third-party context on more than 10 other domains
  • You have specifically added the request URL through an InPrivate Blocking Subscription

To understand the first condition, take a look at the screenshot below, which is the dialog that comes up if you select InPrivate Blocking from the Safety menu when InPrivate Browsing is active:


You'll notice that there are some third-party request URLs that come up, well, a lot. googleadsyndication.com is the domain that Google AdSense ads are served from; and you will doubtless know what comes from google-analytics.com. In the dialog above, the four URLs across these two sites have each been requested at least 20 times in a third-party context, and I've only been using IE8 for a few days. With the default settings ("Automatically block"), these URLs are blocked when I am in InPrivate mode.

The other way of adding a URL to the blocked list is to subscribe to an InPrivate Blocking list. This is an RSS or Atom feed of URLs that IE8 should block in InPrivate mode. I have created a subscription list which blocks third-party requests to analytics.live.com - the domain for adCenter Analytics's tracking JS and pixel. You can try it out by clicking here.

The power of the feed-based approach to InPrivate Blocking is that privacy advocacy sites can post a single link to a feed XML file which users subscribe to; if that file changes, the users' blocking lists change. So you can expect to find "click here to block ALL tracking pixels and ads" links on such sites in the not-too-distant future. You can take a look at your InPrivate Subscriptions through the Manage Add-ons option in the Tools menu:



"Aargh! This sucks!"/"Great!" [Delete as applicable]

Whether news of this functionality sends a shiver down your spine or warms the cockles of your heart depends on whether your business depends on online advertising or web analytics. Popular third-party analytics systems like Google Analytics, or third-party ad servers like Atlas Enterprise will lose data on users who enable InPrivate Browsing; and even a less popular service that might not normally be blocked automatically could end up on common "Opt-out" feeds and have its tracking blocked, especially if had a poor reputation for privacy.

I must admit that when I first read of this functionality, I was - ahem - a little apprehensive, for the reasons above. And in truth, only time will tell what proportion of users are engaging InPrivate browsing (although, given the nature of the functionality, we'll not be gathering this data). But my gut feel is that, whilst this capability is a welcome addition to the privacy and security arsenal of Internet Explorer, actual take-up of the feature will be low. It needs to be invoked explicitly, of course, and the blocking of persistent cookies means that some desirable features of websites (such as being able to remember you from visit to visit) will be disabled. So I imagine it will be used sparingly by the vast majority of users.

Even so, this feature could easily add another 1 - 2% to the existing disparity between different measurement systems (such as an in-house web analytics system and a third-party ad server). Though there are techniques that vendors could use to work around the automatic blocking - the best example being the use of CNAME DNS entries to make the third-party tracking URLs look like first-party URLs - these techniques will add complexity to the implementation of such systems; so it might be easier for us all to live with a little less certainty.


If you'd like to read more about the new features in IE8, there's a ton of stuff over at the IE blog. And, with my Microsoft hat firmly on my head, I should say that the IE team has done an outstanding job with this beta, which is performing really well for me, and rendering most sites flawlessly, with just a few slight layout differences cropping up here and there. Well done, guys.

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

July 06, 2008

Online advertising's dirty secret: Malvertising

dodgy_spyware_ad There's been a lot of chatter recently about the "dark side" of online advertising, in particular, the activities of companies like NebuAd and Phorm using somewhat shady techniques to gather behavioral data about users and using this data to target ads. I've even blogged about it myself. And click fraud remains a significant challenge to confidence in online advertising.

But whilst the term "click fraud" generates about 25 million results on the world's best search engine, the term "malvertising" generates only 2,170. Since you may not be familiar with the term, I'll offer you the definition I found on urbandictionary.com (sadly, there's no Wikipedia entry for Malvertising):

An Internet-based criminal method for the installation of unwanted or malicious software through the use of Internet advertising media networks and exchanges.

So Malvertising = malware + advertising. See? Clever (if ugly). But despite its goofy name and low profile, malvertising arguably represents a greater threat to the online advertising industry than either unscrupulous behavioral targeting or click fraud.

Malvertising can take a number of forms, typically along the following lines:

  • Ads that try to trick you into going to a site, where malware is installed (e.g. those "Your PC is infected! Click here to install our anti-virus software NOW!" ads)
  • Hijacking legitimate ad clicks and redirecting users to sites which encourage them to install malware
  • Malware disguised as ads, that exploit security vulnerabilities in web client software (such as this one in Adobe Flash), either to install further malware, or to scrape PII from the browser

The enormous reach of modern ad networks, plus the ability to place malicious code on thousands of otherwise innocent sites, makes distributing malware via advertising networks a very attractive proposition.

The malware itself is usually focused on stealing users' personal data (e.g.login details for broker accounts), taking control of the user's machine for distributed denial-of-service attacks (turning it into a zombie), or convincing the user to spend their own money buying malware "removal" software after they have been "infected".

But it's not just the end user that suffers. The publisher who has unwittingly hosted the malvertising can find themselves besieged by angry users demanding to know why they've been served malware from their site. If the ad was served via an ad network, the publisher will possibly cancel their contract, depriving the ad network of their business (ESPN has already ditched ad networks altogether, although not ostensibly for this reason). And advertisers who want to use increasingly sophisticated ads with high levels of interaction may find that they are unable to because these ads are some of the ones most likely to contain malware, and so are blocked by the ad networks and publishers the advertiser wants to deal with.

Furthermore, if end users lose confidence in the ads they're being shown, either in terms of where a click will lead, or whether the ad itself is malicious, this will drive down ad clicks and drive up the installation of ad blocking software - both of which will have a disastrous effect on the industry.


What can be done?

The malvertising problem is not insoluble, but it will demand a concerted effort from all industry participants to fix (or, at least, contain) it. I'll blog about these topics again in more detail, but the main areas of attention will need to be:

Creative/URL scanning: Ad networks and third-party ad servers will need to start scanning creatives and destination URLs as a matter of course. The technical challenge of scanning Flash or Silverlight-based creatives is considerable, since malicious ads will take steps to cover their tracks, such as obfuscating code, and behaving normally if they detect they're being scanned. Ultimately, the co-operation of Adobe and Microsoft may be required to put in place more robust systems for determining an ad's provenance.

URL scanning is a more manageable problem - all ad networks should ensure that ad click destinations do not lead to sites which are known to host malware.

Creative template quality: Malware has been known to sneak into ads through sloppy management of creative templates - if an agency uses an infected template, then of course all ads created using that template will be infected. This problem will grow as larger numbers of smaller advertisers start to use online services which provide Flash templates that are customized to order - the advertisers will not have the technical sophistication to determine whether the resulting ads are safe or not. Some kind of 'quality seal' may be required for these services, though that will not stop bogus ones springing up.

Outlawing redirect-based tracking: At the moment, many ad networks use redirects to track ad clicks, meaning that a single ad click can be passed around many ad networks before the user is finally deposited at the advertiser site. This system is open to abuse via "click hijacking", where a bogus network sends some clicks for legitimate ads to malware sites. Publishers should inform ad networks that redirects for tracking are unacceptable, which will mitigate this problem.

Ad isolation: At the moment, an ad which is served with a page (rather than via an iframe) has access to that page's DOM, which means that if the ad is malicious, it can crawl the DOM, looking for user PII (such as usernames and passwords for the site the ad is on, or credit card details). Microsoft is working on some technology to isolate ads that are served on its network, so that even if they're served in a first-party context (i.e. not via an iframe or redirect), they are unable to access the page DOM. Other publishers & networks should consider doing the same.

Industry co-operation: Currently, very little specific information about malware is shared within the industry, partly for noble reasons (it can be difficult to be specific about a malware instance without revealing user PII) but mostly for ignoble ones (no ad network wants to advertise the fact that they've been subject to a malware attack). This must change - the industry needs to find a way to share this kind of data without an individual network or publisher having to step into the firing line.


As I said, I'll return to this subject with some more thoughts on some of the above issues. In the meantime, a great resource for information on malvertising is Spyware Sucks, a blog run by Microsoft MVP Sandi Hardmeier, who tirelessly chronicles various malvertising outbreaks. It makes for sobering reading.

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

March 12, 2008

Phorm over function


There's been plenty of buzz (more of the angry hornet variety rather than the just-inhaled-a-lungful-of-dope variety) about Phorm of late, precipitated by a press release that the company put out on Feb 14 in the UK, announcing partnerships with three major UK ISPs to provide a system "...which ensures fewer irrelevant adverts and additional protection against malicious websites". Critics of the system  (led by noted UK cage-rattler, The Register) claim that the technology is little more than spyware by another name. The negative press around Phorm's announcement has caused at least one of their ISP partners to back away from the deal, and cause their stock to plummet by more than 30%. It looks like this could be the latest in an increasingly long line of bungled targeting announcements from the industry (Beacon, anyone?). But what went wrong?

What is Phorm?

Phorm as a company is the new name for 121Media, a UK AIM-listed company who started out producing a browser toolbar which tracked your page usage to provide a social media environment, connecting you with other people who were looking at the same page. Ad-funded, the toolbar quickly picked up a reputation for being spyware (even though I agree with Phorm's protestations that it was really adware, which is better, but still tarred with the same brush), so it was dropped and the company renamed Phorm.

The new service Phorm has launched is called Webwise (not to be confused with the BBC site of the same name). Essentially it is technology that ISPs install at their data centers which analyzes the URL and textual content of web pages being served and uses this information to place users into interest categories so that they can be served behaviorally-targeted ads. The technology does this by intercepting the page request and sending a copy of it to a "Profiling" server which extracts keywords and uses this information to assign users to interest groups:




The same technology has a function to alert the user to phishing web sites; since the URL and content is being examined, phishing sites can be spotted and blocked. This functionality forms a core part of Webwise's value proposition to users.

The other part of the alleged value to users is that this profiling process does not permit the ISP to associate a user's profile with their IP address; that means that the ISP (and any government agency who subpoenaed the ISP's records) could not re-associate the Phorm data with a customer record (ISPs can tell which IP address was assigned to which customer at a particular time). The Phorm system does also not store any of the page information or extracted keywords; once the interest "channel" has been arrived at, all the rest of the data is deleted.

So Phorm claims that its system is a real step forward for user privacy on the Internet, whilst at the same time enabling advertisers to reach their audience more effectively. But the industry (and the public) haven't really seen it like this.


Why all the fuss?

Phorm's announcement was always bound to generate a certain amount of controversy, because it's in the sensitive area of behavioral profiling & targeting.  But there has been a particularly virulent reaction in the UK, which, whilst started by sites like the Register, has now spread to the "mainstream" media.

Some of the reasons for the fuss are (comparatively) silly things - for example, the renaming of the company from 121Media, which has just made people nervous, especially given the previous company's adware history, or the fact that the company operates out of serviced offices in the UK and doesn't really have a physical address in the US.

A more serious blunder on Phorm's part is their failure to anticipate the scrutiny that this kind of system would be placed under. In this kind of environment, given the firm's history, absolute transparency is essential, and Phorm hasn't provided this. There are still unanswered technical questions about Phorm's system, such as how it manages the opt-out (does data still get collected, or not?), and there have been inconsistencies in the claims that Phorm has made about third-party privacy audits of their software.

Phorm has also made the mistake of launching prematurely, with many of their partnerships still only half-baked. At the moment there is no benefit to users being delivered, because none of the systems that Phorm has announced are actually live within ISPs, and so all the focus is on the downside. Phorm would have done much better to wait until the service was fully baked with at least one of their partners and they had some real users onboard who could testify to the increased relevance of ads and how comfortable they were with their privacy with Phorm, before making a big splash. The press release looks like the product of an over-zealous PR agency looking to ensure their monthly coverage targets were being hit. Well, they've certainly done that.


What can we learn?

The main problem here is a poorly thought-out balance of benefits for 'costs' in this offer. Phorm have claimed that this system protects user privacy, but it doesn't really; it's just an ad targeting system with a better-than-average approach to protecting privacy. Users who are opted into Phorm will still receive cookies and targeted ads from other ad networks, and their behavior will still be tracked by those other networks.

Apart from the phishing protection (which is already baked into IE7 and Firefox anyway, and turned on by default), there's nothing in the Phorm system which provides users with protection of their personal data across the Internet. The only way that Phorm's entry into this market can elevate user privacy overall is if other providers of targeted ads who are storing more data decide to pack up and go home - which I doubt will happen.

The furore also highlights the challenges of partnering with ISPs for this kind of service. Because ISPs are the gatekeepers of the Internet (and because, for many people, switching ISPs is a pain in the a**), users are very sensitive to any perceived exploitation of this relationship by the ISPs. In the UK, ISPs are some of the best-known Internet brands, but also some of the least liked. Ironically the cause of this dislike (poor customer service) is a direct result of the price war that has precipitated ISPs' interest in this kind of service, as they are receiving a cut of the revenues, of course.

Ultimately the tale makes clear how careful any company has to be in launching a service like this - the balance of benefits has to be clearly stacked in favor of the user. As Chris Williams of The Register said during an interview with Phorm's CEO, Kent Ertegrul, said:

"a big difference I see between what you're doing and what Google does is that people feel that they're getting a service from Google. I don't think people feel they'll be getting a service from you"

It will be interesting to see how the Phorm saga plays out. Perhaps one day it'll find its way onto an online marketing MBA module syllabus.

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon


About me



Enter your email address:

Delivered by FeedBurner