Archive

Posts Tagged ‘online advertising’

Understanding Click Fraud

November 13th, 2009

What is click fraud?

Let’s start out by defining some key terms that are important in understanding click fraud.

  • Advertiser – The entity that pays money to get traffic to their site in the way of bidding on keywords or topical categories (bid auctions).
  • Publisher – Any entity which displays advertiser ads on their web site or in some other publicly viewable medium.
  • Visitor – A legitimate user who clicked on something to get to the appropriate target web site.
  • Click – A visitor to the advertiser’s site that came by route of one or more publishers.
  • PPC (Pay Per Click) – An internet advertising model where the amount the advertiser pays is dictated on a per click basis for the terms (keyword or categories) being bid on.
  • CPC (Cost Per Click) – The amount (bid price) paid by the advertiser to receive one visitor for a particular term. The amount is paid only if the visit occurs.
  • CPA (Cost Per Action) – An advertising cost associated to a particular desired visitor action, i.e. purchased a product or service, filled out a survey, or signed up for a newsletter.
  • Conversion – A completion of the advertiser’s desired action under a CPA advertising model.
  • Click Stream – The route the click traffic takes from the time the click is made through the time the web user arrives on the advertiser’s targeted URL. There can often be URL redirects and several publishers (usually tracked by cookies or ID’s in the URL) that receive information for each click, completely transparent to the user.
  • Rev Share – A single publisher’s fraction of the revenue generated by specific click and conversion sources. For example, a smaller publisher might arrange to send click traffic into a larger publisher’s click stream, providing the larger publisher with more traffic and retaining a 5% rev share of the total per click amount for the smaller publisher. Rev Share can be seen as a multi-tier sales commission.
  • Ad Feed – ad listings/data provided by an n-tier publisher by request to display to users on another publisher’s web site or application.

Click fraud, generally speaking, occurs when something (person, web bot, etc…) posing as a legitimate internet user follows (or clicks) a paid advertisement URL to the advertisers web site from which money is generated for some entity other than the advertiser.

Valid User

An advertiser pays good money for advertising, expecting that a portion of the traffic received in return will generate revenue in some fashion. Non-legitimate visitors produce bad clicks which in effect spend advertiser dollars with no hope of a return for the advertiser. This expense is instead divvied up between the layers (rev share) of publishers that are likely to be present in the click stream. Publishers, especially the ones on the end of the chain, often have the most to gain from this practice and will devise all sorts of innovative ways to game the system. Larger publishers in the click stream will often ignore or downplay this activity, knowing that it lines their pockets in the process.

In short, advertisers are being hijacked of their advertising dollars from inflated term bidding marketplaces because of traffic that is posing as real, live, interested web site visitors. It is theft akin to diverting fractions of a penny from financial transactions to a private account.

What is a valid visitor?

A non-valid visitor is essentially anything that can pose as a user that has no ability or intention of producing revenue for the advertiser, however, there are endless ways this can happen. A fraudulent click bot (any automated application used for the purpose of methodically sifting through sites in order to perform a specific operation) can target specific sites, following advertiser links and sending traffic header information which falsely identifies the bot as a user from a real browser. These bots can get very creative by spoofing IP’s, using proxies and randomizing the browser agent and other key data elements.

This can get somewhat fuzzy, as there are also legitimate scripts and bots that, although not valid users, are not considered fraudulent. Their purpose is to scan the internet, spidering through sites to map them out, extract data for search engines, and many other reasons. While no fraud is intended, they create traffic that will create “clicks” by following a URL and ultimately charge the advertiser for the traffic, unless some kind of safeguard is put in place to filter this traffic.

A valid visitor is a real person using an actual web browser (or application where ads may appear) that clicks on the ad URLs of their own decision and has an interest in what the advertiser is providing, therefore potentially resulting in a conversion for the advertiser. A valid user is always a person, but a person isn’t always a valid user. Even a real person can be considered a fraudulent user (more on this later).

Location, location, location

Publishers provide a means to create clicks that lead to advertisers’ web sites. There are many ways this can happen.

A publisher can have an existing site which provides something of value to a demographic and therefore gets a lot of “organic” search traffic. The publisher would like to further monetize their website, so they tap into an ad feed, parse the data and place the ads on their website. When an interested user clicks on one of the ad links, a click is registered through one or more publishers resulting in a visitor to the advertiser’s site. The web site publisher then gets paid for the click. This same process can be done for applications which display ads. Banners function in a similar manner.

Many publishers are not simply successful web site owners trying to maximize online real estate. Many will funnel traffic through their system without involving any user-facing applications or web sites. A publisher can get ads from another publisher’s ad feed, determine their own pricing model and associated key words, then take those ads along with others and insert them into Google’s or Yahoo’s ad listing via an API. This is one way in which a publisher becomes an advertiser. They play a publishing/distribution role with their ad providers. However, they play an advertising role with their ad distributors as they are purchase traffic at newly negotiated rates, often based on quantity or key word variations.

Publishers will often create a combination between monetizing online real estate and utilizing ad feeds. They can do this by creating dynamic, topic-based web sites with a search engine or topic directory facade. Sometimes they are hand created with relatively decent designs and basic content. Often they are purely dynamic, totally created on-the-fly based on the user’s search term. The goal of these sites is to do nothing more than gain interest from the user in order to get them to click on an ad link which will bring money to the site owner. These sites are often never ending webs of topic pages where the user never really finds what they are looking for. Fraud is common where the link text does not actually represent the target advertiser URL. This is fraudulent because a user may click on a link that says “pet supplies” when the actual URL takes the user to an advertiser who sells “college text books”. Assuming that “college text books” is a more costly search term, when the advertiser pays for the click, the publisher gets paid the higher price. The user, who still hasn’t found what they are looking for, will usually continue to click on links hoping to find what they are looking for, costing many advertisers money while making the publisher wealthy.

Arbitrage has been a recent problem in the industry. The idea is to sell something back to the ad provider at a higher price than what was originally paid for it. Using this method, an entity can buy click traffic (AdWords) for really inexpensive keywords in a particular category, then direct the traffic to a simple web site that shows nothing but ads (Google’s AdSense) that are in the same or similar category but cost more per click. Nothing of value is provided to the user, except more ads. When a user does click on one of the ads, the entity may get paid $0.25 for an ad that only cost them $0.15. This entity essentially plays the role of both advertiser and publisher and provides virtually no technology or service other then automatically reselling the click traffic. This creates a superfluous, middle-man tier that directly drains money from the original ad provider (Google).

A web of deceit

While a good many publishers are running a perfectly clean business, this picture gets muddied up by the web of relationships and interaction involved in the online advertising sphere. There are often complex relationships involved in bringing a user from their click to the final destination. As previously described, you can see how advertisers can also be publishers and visa versa, creating an interesting matrix of profitability and fuzzy responsibility. For all intents and purposes, most players in the PPC industry can be considered both, including most of the innocent small website owners who not only buy traffic but also show ads on their site. So, for general purposes, consider the relationship diagram below.

Advertiser Publisher Relationship

If a fraudulent click occurs, and there are 4 publishers paid in the click stream, where do you go for answers? Much of the click-related data has a possibility of being faked and the sheer quantity of it turns making anything useful out of it into a daunting task.
Perhaps you start with the publisher at the start of the click stream where the click happened. Perhaps they are the most likely to falsify the click. Do you collect the money back from all the publishers even though 3 out of 4 were operating legitimately. What if the second publisher in the click stream knows that 80% of their traffic comes from publisher 1, so they setup a click bot that hits specific sites provided by publisher 1. This would cause fraudulent clicks to the first publisher’s traffic that they weren’t even aware of.

The art of war

There are several different methods being used to address the click fraud problem. Some of the larger ad providers (such as Yahoo) provide publishers with a traffic quality score (based on conversions and internal statistics) which determines how much and of what quality traffic they continue to receive. The greater efforts the publisher makes to clean up their traffic, the more likely it is they will stay profitable.

Those that are trying to adequately deal with click fraud will often use a combination of their own internal custom methods with a 3rd party service to help determine which traffic is fraudulent. The methods used will often look at the same data in multiple ways and work in conjunction to derive something useful. Internet traffic is highly irregular and unreliable when trying to determine something accurately, making the challenge a moving target. With this in mind, a great portion of potentially fraudulent traffic cannot be deemed 100% fraudulent, leaving many methods to rely on some sort of fraudulent probability scale employed to determine whether or not the affected clicks generate any income for the publishers.

Server-side and client-side validation

Both the web browser (or other ad display client) and the web server can provide a lot of information about a user and their activities. However, anything useful needs to be derived from a series of time stamps, IP addresses, generic browser information and any ad/click related data that can be gathered. There are many arguments regarding using server-side and client-side data for this type of validation. For the most part, server- side data (database records, web server logs, error logs, headers, etc..) can be considered more reliable than client-side data. Server data can provide much in regards to the click, the ads, user location and environment, but can only be gathered in spurts AFTER actions have been taken, leaving lots of room for assumptions regarding the overall user experience. Client-side code, while often likely to be tampered with, can provide a plethora of data regarding user events, intentions, order of events and environment that is not available to the server. Properly obfuscated and used within a limited scope of reliability, client-side validation can filter out a large percentage of fraudulent traffic, leading a handful of click protection companies to rely on it heavily in their products.

Forensic and symptomatic analyzation

Many click fraud detection systems attempt to focus more on factual information and on tracking/gathering more data from the involved symptoms. While this forensic approach can provide a good idea of the concrete data available, falsified or lacking data still leaves a large margin of error in any analysis. As in any adequate security measure, the more the behavior in question is understood, the better it can be targeted and stopped. With click traffic, several valid behavioral assumptions can be made if certain data points exist. So, while the data may not always be completely reliable or correct, using a more symptomatic approach to analyzation can provide many clear behavioral probabilities, which tracked over short periods of time, can isolate bad traffic sources.

Realtime and post analyzation

Server-side analyzation is currently the most prevalent among custom and third party detection solutions. This is due in part to the consistency and availability of server logs that already track many of the needed data points. However, pure server-side analyzation also does not require front end integration, making 3rd party integration relatively simple. A major downside to this method is that all relevant data is analyzed long after the events have happened which only makes it possible to avoid certain traffic sources in the future and makes it necessary to have a feedback loop which tells the analyzed system to undo charges for clicks hours our days after the events have happened. This not only becomes an accounting nightmare, but also misses the opportunity to catch fraudulent behavior as it’s happening. Client-side data coupled with server-side data in a realtime system can identify patterns and known behavior models as they are happening, making it possible to stop fraud short in its tracks before large sums of money are wasted on fraudulent clicks and before upstream publishers and advertisers are affected with the same bad traffic.

In-stream and out-of-stream validation

If the click fraud detection system is realtime and it is done internally or is fully integrated with a third party system, this is considered in-stream detection. Many publishers, especially the small ones with no or little technical resources, are unable to meet the requirements necessary for a full integration, so an out-of-stream option may be provided by a third party. This requires very little integration and allows the provider to capture data (using some kind of tracker pixel or JavaScript ping back) sometime before the click and again sometime after. This methodology is a hybrid between realtime client-side analyzation and pre and post server-side analyzation but doesn’t not offer all the data points and reliability of a full, in-stream integration.
In a model where clicks are rated on a fraud probability scale, any combination of the above methods can be used, within adequate constraints, to provide insight and restriction on a click’s validity.

Smoke and mirrors

One of the greatest challenges facing click fraud security is the rate of change. Browsers are constantly changing. New plugins and their capabilities are constantly making waves in website development. Developers are constantly finding new ways to leverage these tools to their advantage while fraudulent parties are doing the same. Even applications such as email clients are posing great threats to valid click traffic in ways that were not expected.

Key data points

There are many data available for use in click validation, and they are combined and used in many, many ways. For purposes of this article, the many ways these data can be used and analyzed cannot be covered. Below are some of the more common data points.

Server-side
  • IP address. Known bad IP blocks can be ignored and patterns can be determined from recurring IP’s. Geographical location can also be roughly determined.
  • Proxy, if used. Proxies can be blocked or at least treated with more caution.
  • X Forward For. This is a value that is often present when a proxy is used to show the “actual” IP address.
  • Browser Agent. This includes the browser name, version, operating system, etc…
  • Referrer, if available, which includes the referring domain. This is the web URL that the click came from. This may be non-existent or easily faked.
  • Session data. This identifies a particular user’s set of interactions, if the session data isn’t being purged from the user’s end.
  • Cookie data.
  • Time of initial impression (page view).
  • Time of click.
  • Other data specific to the click traffic, i.e. click ID’s, advertiser ID’s, publisher ID’s, impression data, etc…
  • Conversion. Did a conversion occur after the click?
Client-side
  • JavaScript enabled? Sure, there are potentially valid users out there that don’t allow JavaScript, however, if the user’s agent cannot process JavaScript, then the likelihood of them being able to complete the click process and follow through to a conversion becomes negligible.
  • Cookie data. Can it be set and read?
  • Mouse interactivity. Helps to validate actual presence of a user, versus a bot. Mouse rollovers can also be tracked an sent to the server in real time to help determine mouse movement patterns (see PTR below). Was there actual mouseover activity on the ad link that registered a click?
  • Parent window domain. Useful when ad links may be present inside of an iframe.
  • Window size. If the window is too small, it is likely that the publisher is trying to mask links so that the user doesn’t know what they are actually clicking on.

The key to using this data and filtering out bad click traffic is to understand what the fraudulent party is trying to accomplish, identify patterns or oddities, then create methods of validating and deflecting the behavior. While the landscape constantly looks different, there are some methods that fraudulent parties will use that seem pretty consistent.

Methods

Standard bots

Problem: Recognized bots such as the ones run by Google, Yahoo! (and hundreds of other search engines) will create lots of non-user traffic which will often inadvertently create click traffic in the process. This can account for a relatively high percentage of actual site traffic, causing a publisher to make a lot more money than they are entitled to. It is not uncommon for a publisher to implement a click fraud filtering system, only to find out that more than half of their traffic was non-user traffic, effectively slashing their profits. While this traffic isn’t considered fraudulent, it should be filtered out so that advertisers don’t get charged for invalid clicks.

Solution: As a primary filter, don’t allow any known bots that are correctly sending their agent string to register a click. Then put other limits in place regarding the number of clicks allowed by certain IP’s within a time period. Could a cookie be set? Can the requesting agent handle JavaScript? This should catch the bulk of the harmless ones.

Click bots

Problem: One of the primary ways fraudulent parties attempt click fraud is through automated bots. These bots will target certain sites where the target ads are known to be and very efficiently simulate clicks at a very fast rate. New bots are consistently smarter and attempt to simulate user behavior, environment and even run JavaScript.

Solution: Additionally, do some consistency checks for valid browser agents and filter out the blatantly obvious ones. Put in place click restriction limits from a single IP. Even IP’s with large offices behind them should not generate hundreds of clicks on the same ad within a few minutes.

Spoofing IP’s and user agents

Problem: Many click bots and similar systems will take the next obvious step and attempt to make each click look like a unique and valid user by randomizing fake IP’s and incorrect User Agents.

Solution: This is where client-side checking can come into play. If JavaScript is enabled, assume the agent is valid and proceed with mouse event and page/browser property checks. Does a cookie check reveal that the user already clicked this ad recently?

Falsified ad text and diversion

Problem: Ads are displayed with text that is more likely to get the user to click on the link instead of text that accurately describes the ad target. This fools users into clicking on multiple ads that they are not interested in while they look for one that is legit. Often they will be offered links to similar topics that attempt to keep them looking for what they want until they click on an ad. Sometimes, this textual falsification may simply be to show a more common topic in the ad text but have the ad link go to a much higher paid PPC ad. For example, the ad may be for a $0.20 PPC “ring tone download” ad but will go to a $20 “mesothelioma” ad.

Solution: Other than random manual validation of publisher websites, some log analysis can be done on the higher payout PPC traffic to see if any particular referrers have unexpected amounts of traffic to particular keywords and if the referring site content matches the keyword.

Hidden browser layer

Problem: The publishing web developer will blatantly put a transparent layer over the page so that regardless of where a user clicks, they click on a link that takes them to an ad. It is possible to get away with this on a small scale for some time before being caught.

Solution: Because this is a real user, it is up to the ad provider to notice the problem (which can be flagged by lack of conversions from the site referrer) and then go to the site to manually verify the problem.

Hidden clickable iframe

Problem: This is similar to the hidden layer problem, although the web developer puts a transparent 1 pixel by 1 pixel iframe under the mouse pointer which follows the mouse everywhere it goes on the page. Wherever the user clicks, they generate an ad click without knowing it. This can also be tricky to notice and track down.

Solution: This may require some manual checking, but a check for mouseover events on the other ads in the feed may reveal that the user is not able to view the other ads.

Small-scale manual clicking

Problem: Some people looking to make a few quick bucks may simply manually click on the ad links. They may get their friends and associates to do the same, especially for higher priced PPC terms. This is very unsophisticated, but it is easy to stay under the radar until the PPC costs or number of total clicks and conversion ratios becomes noticeable.

Solution: Check for recurring traffic patterns and IP similarities Also check for conversion ratios on curious traffic..

Paid to read (PTR)

Problem: This is a more organized version of the small-scale manual clicking method and often starts as such. In order to keep from being detected, a fraudulent party needs to find ways to make the traffic look like valid and interested users. The users need to have varying IP addresses and need to click on a variety of ads to keep from generating any obvious patterns. Publishers will pay users, many foreign, to do nothing else but click on ads in return for a fraction of the rev share. These users will often click on links from their email clients which won’t send a referrer.

Solution: In this situation the user will show a mouse presence, but may continually click on the same link without “browsing” and rolling over any other links. If the rollover to click ratio is near 1:1, this can be a flag of PTR traffic.

Improper Traffic Purchasing

Problem: Many web site owners will attempt to purchase traffic that is outside the contextual topic of the website intent in order to increase traffic. For example they might purchase traffic for the term “hot rod photos” but the
site topic may be “ring tone downloads”. This unqualified user traffic will often click on some of the displayed ads since they didn’t find what they were looking for.

Solution: The big ad providers will have someone manually verify that a company’s website content matches the terms they are purchasing which stifles much of the problem. Google’s AdSense automatically generates the appropriate ads based off the site content. However, if an ad feed is being given to other publishers, there is not chance for interaction with something like AdSense. Conversion ratios and overall traffic quality can be good measures to raise flags when something looks unusual.

How big is the problem?

Estimations as to how many dollars are wasted through click fraud are all over the place. Realistically, no one really knows for sure. Who gets blamed in the tightly woven web of PPC advertising?

When working with a click fraud detection company, the publishers and advertisers we would deploy would often become very disillusioned with the quality of their traffic. Most of them would go through an initial phase of denial, swearing that our system was broken. We would then go through a phase of traffic validation to substantiate the click fraud claims. After a few weeks of traffic adjustments and analysis, some clients would realize that 60% to 90% of their traffic was completely bogus. Even though they had some incentive to clean up their traffic, they would often drop the click fraud detection service because cleaning it up would mean steep cuts to their revenue. At up to 90% fraudulent traffic, this shows that entire companies are thriving on a bubble of almost pure fraud.

Continuing problems…

Click fraud inflates the PPC market causing continuous challenges for advertisers and the market as a whole. Inflated bidding competition drives term prices up and the likelihood of a conversion down making PPC advertising a questionable long term strategy. But like the demand for oil, PPC advertising costs will continue to soar but won’t likely diminish the PPC market because of advertiser’s dependence on online advertising.

Can click fraud be stopped? A higher standard must be set, however, this is a moving target. Click fraud methods are constantly changing and becoming more complex and those players that are deeply embedded in click fraud are constantly ahead of the game.

Perhaps the question that really needs to be asked is whether or not click fraud be controlled within reasonable limits and allow the PPC market to continue to thrive. Like any other type of security or validation, keeping up with and adjusting for the dynamics of common methods and best practices is a good place to start. Ad providers like Google and Yahoo! need to be more stringent on traffic source quality and create tighter restrictions for those whose traffic is questionable. Safeguarding against click fraud needs to become a standard expectation for operating as a publisher in the PPC market. Publishers need to consider taking a pessimistic approach to their own direct traffic and their applications should reflect this, stopping bad traffic at the root of the problem. Advertisers and upstream publishers should consider taking an optimistic approach to the traffic that passes through them (since they don’t have direct access to the original click context and environment), but still develop the tools necessary to analyze and reject blatantly fraudulent sources.

Do your part!

The best way to get started with validating clicks is to take the first step of filtering out known web bots. Since this accounts for much of the non-convertable traffic which advertisers pay for, it can make a huge dent in the problem. Second, start putting the more advanced validations in place as described above. Third, if necessary, start working with a 3rd Party traffic validation company like Click Forensics, Anchor Intelligence or ValidClick.

fudnik Development , , , , , , , ,





agile ajax black hat Cake PHP centering clifford stoll css cuckoo's egg energy energy drinks espionage flash Flex hacker jquery modular MVC objects optimization performance PHP script timer smarty smarty templates stylesheet up-time uptime variable scope web 2.0 Zend Framework