|Online Profiling from a Consumer's Perspective
Russ Smith, Consumer.net
SUMMARY: The dramatic increase in the use of the Internet combined with the technological nature of the web has led to a large amount of confusion about the collection of personal information via the Internet. The information below is a result of numerous questions and inquiries made to the Consumer.net group of web sites as well as the knowledge gained in the operation of these sites.
The bulk of the questions and comments involve two mail points: (1) knowledge of what information is collected, how it is used, and how it is distributed and (2) control that the consumer has over the information. If these two conditions are met consumers usually see profiling as advantageous. Issues such as IP addresses, web site log files, cookies, proxies, firewalls, e-mail marketing, and tracking are discussed. Some examples of "opt-out" procedures and privacy complaints are also provided. There is also a discussion of the conflict of interest created by government employees taking positions in the private sector.
Ways Information is Collected and Compiled at a Web Site
IP Addresses. Each web surfer needs an Internet protocol (IP) address to connect to the Internet and communicate with other computers. This is similar to a street address in the 'real' world. In the same way that a letter must have an address to be delivered, each 'packet' of information sent over the Internet must have a destination IP address.
From the IP address it is often possible to obtain the domain name. This domain name is usually associated with the user's Internet Service Provider (ISP) or the company or university where they are connected. Most Internet provider use "dynamic" allocation of IP addresses for their users. For dial-up accounts the reason is that ISP's have a "pool" of addresses that are used only the users that are actually connected. ISP's generally have 5-10 users per IP address since most users are not connected at any given time. Home cable modem providers are also using dynamic IP addresses in most cases for different reasons. The dynamic address makes it more difficult for users to operate web sites from their home accounts that can slow the system for their entire neighborhood since they share a common connection. This also improves security because a potential intruder would have unlimited time to try to break into someone's system if it had a fixed address and was on most of the time.
A new standard for a new IP addressing system IPv6 has been proposed for a long time. This new systems adds additional information to the IP packets that identifies the specific computer being used by including the Media Access Control (MAC) address. Each computer has a network card with a serial number or MAC address that is normally only accessible over a local network. This protocol has been proposed for years but only recently have the privacy concerns been in the press since the main concerns were network troubleshooting and security and not privacy.
Microsoft already has used a system where the MAC number is recorded and embedded in documents created with Microsoft software (such as Microsoft Word). It has been reported in the press that this method was used by law enforcement to help trace the author of the "Melissa Virus." The IPv6 issue is also very similar to the issue involving the Intel processor serial number. It involves linking the user's hardware identifier with Internet communications for reasons of identification and verification for whatever reason.
There has also been much discussion of mapping IP addresses to geographical location. This has not yet been done to a large extent but it is a matter of time before such a mapping is completed. This would be tedious but not impossible task to map IP's to a city or state. I already use a gross method at my web site as I look at the first IP address field to determine if the address is allocated to the Americas, European, or Asia-Pac IP registry at http://privacy.net/analyze. The analysis does an IP registry lookup and I only use the European and Asia-Pac queries when necessary to reduce the load on their servers.
Web Site Log Files: The web site log files contain the IP address of the user, date, time, pages and images downloaded, the Referrer (last page where the user clicked a link to the current page), data entered by the user, and internet cookies. The user's web browser sends much of this information in the browser "header." The full browser header can be seen at http://privacy.net/analyze. (note: to see the referrer field go to privacy.net and click the analysis link so the referrer will have a value). By analyzing these log files information about where a consumer linked from (including search terms used to find the site), the path taken through the site, the length of time spent on any given page, items ordered, and return visits if cookies are used. An archive for a major Internet advertising mailing list is found at http://www.internetadvertising.org.
Internet Cookies: Internet cookies are a small text file placed on the user's system by a web site The cookie specification indicates that only the site that placed the cookie could retrieve the cookie. The cookie contains as much information as is put into it. It could range from a random number to identify repeat visitors (anonymous profile) to a code identifying a specific customer in the web site's database. A demonstration of how cookies can be used to provide personalized content is found at http://privacy.net/cookies/.
A cookie bug has been detected that allow sites to download cookies placed by other web sites. At consumer.net this occurs in one out of every few thousand visitors. The bug is not associated with any specific hardware or operating system. It appears as if the cookie text file is somehow corrupted and confuses the web browser into thinking the cookie came from the wrong site. In a small number of cases the cookie contains personal information or information that allow another user to reproduce the cookie on their system so they could "masquerade" as that user at another web site such as obtaining access to their web based e-mail account or stock portfolio. While the possibility of this happening is remote, it does highlight the fact that cookies are not secure.
Collecting Data Input by the Site Visitor: Data collected by the web site is often collected by a web-based input form. The methods used to do this are "Get" and "Post." For "Get" the information is contained in the URL such as www.example.com?data1=a&data2=b. This may create a security problem when a user clicks on a link the information in the URL is passed to this linked web site as the "referrer." An example of this Hotmail where the user name is part of the data in the URL. When a user clicks on link within their e-mail the Hotmail user ID is transferred to that web site.
The "Post" method transfers the data 'behind the scenes' without having it appear as part of the URL. This has lead to another security problem which has been the subject of some media attention. The data sent via a "post" command is often saved in a file (this could be personal information, credit card information, etc.) The 'hidden' instructions used on the web page to send the data is available to the user by looking at the HTML source by using the "view source" option in the web browser. This information often includes the file name where the data is stored. If this file is not placed in a 'protected'
Directory (one which is not available by browsing the web site) then the information can be accessible to anyone.
E-mail Marketing: Responses to e-mail campaigns are usually tracked by having the user respond to a specific URL or have some type of code added to the URL to identify the respondent.
Another tracking system that is coming into widespread use is the detection of the type of e-mail client of the user with the intent of presenting the e-mail ad in a format that corresponds to the e-mail program being used by the customer. The advertiser sends the user a message formatted in HTML. When the user opens the HTML e-mail a request is made to the advertisers web site for an image to display. When this is done the advertiser can see when the e-mail was opened and what type of e-mail program being used (the information is sent in the hidden header when the request is made to the web site). Most discussions by advertisers about this issue revolve around hiding the fact that users are being tracked rather than explaining to consumers the reasons for the tracking (see http://www.internetadvertising.org)
Collection of Credit Card Information: Some consumers still show an apprehension to using their credit card over the Internet but this is waning. Most consumers lose their objections when they realize the risks of how their credit card is used in the 'real world' where their information is distributed to sales people, restaurants, over the phone, etc. For most small and some medium sized businesses the most practical way to process these orders is via a third party Internet credit card processing service. These service maintain a secure server, process the credit card transaction, stores the data, and deposit the funds directly in to the business' bank account. The system I use costs $20/month and $0.10 per transaction. To reproduce this system myself would cost significantly more.
Collection of Information During Software Installation: A common concern of users ids the passing of information from the user's computer to a software vendor during the installation of software. The installation of most software involves the installation of an executable program (ending in '.exe). Often this installation involves some type of registration process where the information is transferred via the Internet. Since an executable is installed by the user the capability of collecting information from the user's computer (such as what type of other software is installed) can be transferred to the software company.
Domain Registration (WHOIS) Database: The database was initially developed for a small number of government, academic, and a few commercial companies. The database was made freely available and contained technical contacts of these organizations. The database for .com, .net, and .org now contains more than 6 million entries and includes many entries that includes personal information on individuals. A complaint was made to the National Science Foundation in 1997 and the US government claims the database was not covered because a private company collected the information under a cooperative agreement (see http://www.cavebear.com/nsf-dns/). However, since then numerous officials of the Dept. of Commerce has stated emphatically that they have control over the data in public statements and congressional testimony. The Internet Corporation of Assigned Names and Numbers (ICANN) has been asked to look into the issue but they currently have no plans to and simply say it has always been that way. The main reason given to continue the current system is to track down trademark and copyright infringement while, at the same time, the Intellectual Property community complains that many of the entries in the database are fictitious. There already exists a verified database of all IP addresses and which ISP they are assigned. Any web site, whether they use a domain name or not, can be traced in this manner without access to the database of domain registrations. The database is now being sold for mailing lists in a product called the "Dot Com Directory." Information covered by the Privacy Act of 1974 prohibits the sale of mailing lists if the list includes information about individuals. The DOC is also contemplating allowing the companies administering the list to see the entire database for no more than $10,000.
Consumer responses to Profiling and Tracking
Falsification of Data: Many users have resorted to falsifying data in order to obtain access to the web site. Some common examples are Microsoft for their technical support site, the New York Times, and Real Networks.
Media reports have indicated that the percentages of falsified entries are as high as 70% (such as the New York Times web site. NYT has refused to verify this number). This situation was readily apparent to me when I configured the Privacy.net domain for e-mail. Within minutes I began receiving numerous e-mails from many different companies claiming I subscribed to their newsletter or registered their software. When real networks does a mailing I used to receive more than 50 copies claiming I registered at their site. In several cases it took many complaints and several contacts with the legal departments of the web sites involved to stop the e-mail. I eventually set up a special e-mail address people can use in registrations of this sort, email@example.com. When e-mail is sent to that address an autoresponse is sent back to the company requesting removal and the e-mail is deleted. Most companies ignore these removal responses. Many companies also ignore e-mail 'bounces' and continue sending e-mail to dead addresses.
While most Internet providers have an acceptable use policy (AUP) and such activity would be grounds for immediate termination for a smaller customer, the ISP are unwilling to implement the policies in the cases of these large corporations. This eventually led to Real networks being placed on the Real-time Blackhole List (RBL). This is system where ISP's can subscribe to the RBL and obtain a list of "blachholed" IP addresses. The ISPs then block all traffic to and from these IP addresses so the users cannot send e-mail, download web pages, etc. See http://maps.vix.com/rbl/.
The above problem could be solved rather simply verifying the registrations where an e-mail is sent to the user and a user's affirmative response is necessary to continue the subscription. However, real Networks readily admits that very few people would agree to accepting the e-mail and it would cause their business model to fail.
Proxies and Firewalls: Proxies and firewalls are essentially a wall between a computer and the Internet. Communications are only allowed under certain circumstances and certain types of communications can be blocked entirely. The two major classes of these are third party proxies and software loaded on the user's computer.
A third party proxy are services such as "The Anonymizer." Under this scenario the user's computer logs into the proxy computer. The proxy then goes out to the Internet get the requested information. The outside Internet only 'sees' the proxy and not the individual computer. The proxy computer can also be set up to block certain types of communications (firewall) such as cookies, junk e-mail, Java, ad banners, other types of communications used by intruders attempting to hack into computers, etc. Some Internet providers are using proxies for the caching capabilities. When a user requests a page the content is saved in the cache of the proxy. The next person on that system that requests the page retrieves the page from the proxy cache rather than reloading the page from the Internet. The cable modem companies use this to as part of the equation when they advertise the bandwidth of their system. This sometimes introduces a problem when web site contents changed. The cached version is read rather than the current version.
Several new software products allow the users to set up a personal firewall to block certain types of communications (firewall) such as cookies, junk e-mail, Java, ad banners, other types of communications used by intruders attempting to hack into computers, etc. The user can set up "rules" such as block all cookies from a certain domain, accept communications from your e-mail server, etc. Some programs just block cookies or pop-up windows. The full firewall would be somewhat difficult for novice users as it requires some knowledge of cookies, how banner ad systems work, and "ports" as used in Internet communications.(2)
Some examples I can provide of opt-out are about trying to opt-out of invalid e-mail addresses that should be a very simple process. Some e-mails offer a link to opt-out the specific address (using a code associated with the account). This system usually works very well. Other systems tell you to respond to the e-mail to get off the list. A few systems put a code in the return e-mail address to specifically identify the account but most systems depend of the return e-mail address. Under this circumstance it is not possible remove invalid addresses and it sometimes takes weeks of complaining to remove the addresses. In some cases, such as the New York Times, their e-mail system forge the e-mail headers and I cannot what address the e-mail was sent to without obtaining the e-mail server logs. Many of the companies who fail to have sufficient e-mail opt-out mechanisms also include Internet companies such as CNET (addresses not verified and opt-out system completely fails) and Sun-Netscape Alliance (no way to opt-out of another address and e-mail bounces are not removed).
Generally consumers see profiling relating to Internet shopping as beneficial if there is knowledge of what information is being collected and control over how it is used. The general example I often use is the benefits gained when a consumer is interested in skiing and they get a discount offer on ski equipment they want. However, a skydiver may not as happy if his insurance company finds out. It is difficult to collect specific information about how profiling is conducted as essentially all profiling activities are kept secret to protect proprietary information and prevent consumer backlash.
There is general concern over profiling where there is no consent and where it was not expected. One service offered by Acxiom involves a reverse phone number lookup directory that provides address information for telephone numbers, including addresses of unlisted phone numbers. A NJ consumer was able to obtain an demo account to access the Acxiom database by calling an 800 number. Within minutes he was able to obtain the home addresses of several unlisted telephone numbers and telephone numbers listed in the directory without a specific address. ^he salesperson boasted about a loophole in the law that allows them to obtain this information directly from the local phone companies.
"Self Regulatory" Efforts
The only viable self-regulatory mechanisms are consumer backlash to perceptions that may or may not be correct. The current Seal programs such as TRUSTe and BBBOnline are hyped by lobbyists and spin doctors there is little confidence by consumers in these programs. Some consumer reaction can be found by search newsroups such as Deja News (http://www.deja.com). The companies supposedly being regulated are funding these programs. The staff of these programs makes a substantial effort to block complaints rather than protect privacy.
I will give one example of AOL. Under TRUSTe they claim AOL.COM is covered by the seal program. However, when a complaint was filed about AOL distributing personal information to third parties the loophole was enacted. The TRUSTe seal only covers "www.aol.com" and NOT "members.aol.com." This means that if you visit www.aol.com (which is covered by the seal) and you decide to join you are sent to members.aol.com which is NOT covered by the TRUSTe seal. AOL has different privacy policies posted at http://legal.web.aol.com/policy/aolpol/privpol.html and http://www.aol.com/info/privacy.html. In addition AOL claims in their TRUSTe policy the statement:
This statement is false. Information described above collected is provided to a third party telemarketing company: Dial America, Inc. This company claims they are not acting as an agent on behalf of AOL and, therefore, do not have to comply with requests for the AOL do-not-call policy pursuant to 47 CFR § 64.1200(e)(2)(i).
Conflict of Interest with Government Employees
TRUSTe has refused to pursue this complaint and the only other option is to file a complaint with the FTC. However, there is no coordinated system available for processing such complaints at the FTC and I have never received a response to any complaint that I have filed. In addition, several FTC and NTIA employees leave government positions and take positions within these organizations. A few weeks after the last FTC workshop Commissioner Varney left to head up the Privacy Alliance. Another FTC Commissioner took a position with the Direct Marketing Association. Currently a former FTC employee heads up the BBBOnline complaint mechanism and a recently departed DOC NTIA employee heads up the TRUSTe complaint system. Under these circumstances it is not realistic to expect complaints to government agencies to be pursued vigorously.
Consumer.net is developing, and will continue to develop, the Privacy.net web site which provides information and demonstrations about cookies, junk e-mail, distribution of driver's license data, etc. The privacy analysis pages currently run more than 10,000 analyses per day on average.
1. Consumer.net | Privacy.net | Network-Tools.com | Santa.Claus.net | ChristmasTrees.com | Santas-List.com | Mummers.com | TranslateFree.com | Alcatraz.San-Francisco.ca.us | GrandparentsDay.com | Native-Americans.com | Post-Office.org are all part of the Consumer.net group of sites.
2. A "port" in Internet communications is not a physical port but a notation in the packets of information being transferred over the Internet. Some standard ports are "80" for web pages and "25" for e-mail, and "21" for FTP. By typing ":80" after a web page request (such as consumer.net:80) you will get to the same web page. Unlike cable TV where additional channels correspond to additional frequencies and additional bandwidth, the use of additional Internet ports does not increase the total bandwidth. The Internet port is merely a notation in the header used as a sorting mechanism.