The SPIDERLEGS Conundrum: The Ultimate Guide to the Invisible Web

Friday, May 23, 2014

The Ultimate Guide to the Invisible Web

Search engines are, in a sense, the heartbeat of the internet; “googling” has become a part of everyday speech and is even recognized by Merriam-Webster as a grammatically correct verb. It’s a common misconception, however, that googling a search term will reveal every site out there that addresses your search. In fact, typical search engines like Google, Yahoo, or Bing actually access only a tiny fraction – estimated at 0.03% – of the internet. The sites that traditional searches yield are part of what’s known as the Surface Web, which is comprised of indexed pages that a search engine’s web crawlers are programmed to retrieve.

So where’s the rest? The vast majority of the Internet lies in the Deep Web, sometimes referred to as the Invisible Web. The actual size of the Deep Web is impossible to measure, but many experts estimate it is about 500 times the size of the web as we know it.

Deep Web pages operate just like any other site online, but they are constructed so that their existence is invisible to Web crawlers. While recent news, such as the bust of the infamous Silk Road drug-dealing site and Edward Snowden’s NSA shenanigans, have spotlighted the Deep Web’s existence, it’s still largely misunderstood.

Search Engines and the Surface Web

Understanding how surface Web pages are indexed by search engines can help you understand what the Deep Web is all about. In the early days, computing power and storage space was at such a premium that search engines indexed a minimal number of pages, often storing only partial content. The methodology behind searching reflected users’ intentions; early Internet users generally sought research, so the first search engines indexed simple queries that students or other researchers were likely to make. Search results consisted of actual content that a search engine had stored.

Over time, advancing technology made it profitable for search engines to do a more thorough job of indexing site content. Today’s Web crawlers, or spiders, use sophisticated algorithms to collect page data from hyperlinked pages. These robots maneuver their way through all linked data on the Internet, earning their spidery nickname. Every surface site is indexed by metadata that crawlers collect. This metadata, consisting of elements such as page title, page location (URL) and repeated keywords used in text, takes up much less space than actual page content. Instead of the cached content dump of old, today’s search engines speedily and efficiently direct users to websites that are relevant to their queries.

To get a sense of how search engines have improved over time, Google’s interactive breakdown “How Search Works” details all the factors at play in every Google search. In a similar vein, Moz.com’s timeline of Google’s search engine algorithm will give you an idea of how nonstop the efforts have been to refine searches. How these efforts impact the Deep Web is not exactly clear. But it’s reasonable to assume that if major search engines keep improving, ordinary web users will be less likely to seek out arcane Deep Web searches.

How is the Deep Web Invisible to Search Engines?

Search engines like Google are extremely powerful and effective at distilling up-to-the-moment Web content. What they lack, however, is the ability to index the vast amount of data that isn’t hyperlinked and therefore immediately accessible to a Web crawler. This may or may not be intentional; for example, content behind a paywall or a blog post that’s written but not yet published both technically reside in the Deep Web.
Some examples of other Deep Web content include:

Data that needs to be accessed by a search interface
Results of database queries
Subscription-only information and other password-protected data
Pages that are not linked to by any other page
Technically limited content, such as that requiring CAPTCHA technology
Text content that exists outside of conventional http:// or https:// protocols

While the scale and diversity of the Deep Web are staggering, it’s notoriety – and appeal – comes from the fact that users are anonymous on the Deep Web, and so are their Deep Web activities. Because of this, it’s been an important tool for governments; the U.S. Naval research laboratory first launched intelligence tools for Deep Web use in 2003.

Unfortunately, this anonymity has created a breeding ground for criminal elements who take advantage of the opportunity to hide illegal activities. Illegal pornography, drugs, weapons and passports are just a few of the items available for purchase on the Deep Web. However, the existence of sites like these doesn’t mean that the Deep Web is inherently evil; anonymity has its value, and many users prefer to operate within an untraceable system on principle.

Just as Deep Web content can’t be traced by Web crawlers, it can’t be accessed by conventional means. The same Naval research group to develop intelligence-gathering tools created The Onion Router Project, now known by its acronym TOR. Onion routing refers to the process of removing encryption layers from Internet communications, similar to peeling back the layers of an onion. TOR users’ identities and network activities are concealed by this software. TOR, and other software like it, offers an anonymous connection to the Deep Web. It is, in effect, your Deep Web search engine.

But in spite of its back-alley reputation there are plenty of legitimate reasons to use TOR. For one, TOR lets users avoid “traffic analysis” or the monitoring tools used by commercial sites, for one, to determine web users’ location and the network they are connecting through. These businesses can then use this information to adjust pricing, or even what products and services they make available.

According to the Tor Project site, the program also allows people to, “[...] Set up a website where people publish material without worrying about censorship.” While this is by no means a clear good or bad thing, the tension between censorship and free speech is felt the world over; the Deep Web. The Deep Web furthers that debate by demonstrating what people can and will do to overcome political and social censorship.

Reasons a Page is Invisible

When an ordinary search engine query comes back with no results, that doesn’t necessarily mean there is nothing to be found. An “invisible” page isn’t necessarily inaccessible; it’s simply not indexed by a search engine. There are several reasons why a page may be invisible. Keep in mind that some pages are only temporarily invisible, possibly slated to be indexed at a later date.

Engines have traditionally ignored any Web pages whose URLs have a long string of parameters and equal signs and question marks, on the off chance that they’ll duplicate what’s in their database – or worse – the spider will somehow go around in circles. Known as the “Shallow Web,” a number of workarounds have been developed to help you access this content.
Form-controlled entry that’s not password-protected. In this case, page content only gets displayed when a human applies a set of actions, mostly entering data into a form (specific query information, such as job criteria for a job search engine). This typically includes databases that generate pages on demand. Applicable content includes travel industry data (flight info, hotel availability), job listings, product databases, patents, publicly-accessible government information, dictionary definitions, laws, stock market data, phone books and professional directories.
Passworded access, subscription or non subscription. This includes VPN (virtual private networks) and any website where pages require a username and password. Access may or may not be by paid subscription. Applicable content includes academic and corporate databases, newspaper or journal content, and academic library subscriptions.
Timed access. On some sites, like major news sources such as the New York Times, free content becomes inaccessible after a certain number of pageviews. Search engines retain the URL, but the page generates a sign-up form, and the content is moved to a new URL that requires a password.
Robots exclusion. The robots.txt file, which usually lives in the main directory of a site, tells search robots which files and directories should not be indexed. Hence its name “robots exclusion file.” If this file is set up, it will block certain pages from being indexed, which will then be invisible to searchers. Blog platforms commonly offer this feature.
Hidden pages. There is simply no sequence of hyperlink clicks that could take you to such a page. The pages are accessible, but only to people who know of their existence.

Ways to Make Content More Visible

We have discussed what type of content is invisible and where we might find such information. Alternatively, the idea of making content more visible spawned the Search Engine Optimization (SEO) industry. Some ways to improve your search optimization include:

Categorize your database. If you have a database of products, you could publish select information to static category and overview pages, thereby making content available without form-based or query-generated access. This works best for information that does not become outdated, like job postings.
Build links within your website, interlinking between your own pages. Each hyperlink will be indexed by spiders, making your site more visible.
Publish a sitemap. It is crucial to publish a serially linked, current sitemap to your site. It’s no longer considered a best practice to publicize it to your viewers, but publish it and keep it up to date so that spiders can make the best assessment of your site’s content.
Write about it elsewhere. One of the easiest forms of Search Enging Optimization (SEO) is to find ways to publish links to your site on other webpages. This will help make it more visible.
Use social media to promote your site. Link to your site on Twitter, Instagram, Facebook or any other social media platform that suits you. You’ll drive traffic to your site and increase the number of links on the Internet.
Remove access restrictions. Avoid login or time-limit requirements unless you are soliciting subscriptions.
Write clean code. Even if you use a pre-packaged website template without customizing the code, validate your site’s code so that spiders can navigate it easily.
Match your site’s page titles and link names to other text within the site, and pay attention to keywords that are relevant to your content.

How to Access and Search for Invisible Content

If a site is inaccessible by conventional means, there are still ways to access the content, if not the actual pages. Aside from software like TOR, there are a number of entities who do make it possible to view Deep Web content, like universities and research facilities. For invisible content that cannot or should not be visible, there are still a number of ways to get access:

Join a professional or research association that provides access to records, research and peer-reviewed journals.
Access a virtual private network via an employer.
Request access; this could be as simple as a free registration.
Pay for a subscription.
Use a suitable resource. Use an invisible Web directory, portal or specialized search engine such as Google Book Search, Librarian’s Internet Index, or BrightPlanet’s Complete Planet.

Invisible Web Search Tools

Here is a small sampling of invisible web search tools (directories, portals, engines) to help you find invisible content. To see more like these, please look at our Research Beyond Google article.

A List of Deep Web Search Engines – Purdue Owl’s Resources to Search the Invisible Web
Art – Musie du Louvre
Books Online – The Online Books Page
Economic and Job Data – FreeLunch.com
Finance and Investing – Bankrate.com
General Research – GPO’s Catalog of US Government Publications
Government Data – Copyright Records (LOCIS)
International – International Data Base (IDB)
Law and Politics – THOMAS (Library of Congress)
Library of Congress – Library of Congress
Medical and Health – PubMed
Transportation – FAA Flight Delay Information

The SPIDERLEGS Conundrum