What is Indexing?
Indexing is quite simply put, just a list of things. When you open a book, the very first thing you see after all the acknowledgements is a list of everything that is in the book, this is called the index, remember? So, in a similar fashion, an index is used for managing large amounts of data, which is called database indexing. But, even more data is present on the internet and it takes a much larger index to compile all this information into a list. This is what we know as web indexing. It is literally a list of all webpages on the internet, and as everyday millions of webpages are created, so are they indexed.
Is it that simple?
No. Alas, it takes a lot more than just crawling the web to compile an accurate list of webpages on the internet. First of all, the size of the internet is constantly expanding at a massive rate and it takes a significant amount of crawling to just get all the information in one place. Then this information needs to be sorted into a list that is easy to access and remains as up to date as possible. This would require frequent crawling of the same pages that have already been indexed to see if anything has changed or not. Obviously, this is redundant as recrawling the whole internet over and over again would never work and would be counterproductive. So then, how to make this process efficient while still not compromising on the quality of the index?
Fine-tuning the Process
This is where the sophistication in web indexing comes to play. Search Engine servers will notice patterns over time of webpages that are not being updated often enough and will slowly increase the duration between each crawl of that webpage. This gradually leads to less time being wasted crawling webpages that are outdated. Google crawls the whole internet routinely and has about a trillion webpages in its index, but most of these webpages are dead and no longer active. To crawl them as much as webpages that are constantly being updated, is just not effective.
The meta tags on your website can be used in order to give instructions to the crawlers about certain types of actions that they are allowed to perform when it comes to indexing the content as well as the digital identity you have created. The default meta value that is present on the website is that of “follow”, which is essentially the signal for the spider to continue with indexing the page. There are number of other commands that can be given to the spiders in order to manage the indexing of your pages according to parameters that you consider important. At the same time, different search engines have different ways in which they carry out indexing, owing to the differences in the manner in which they respond to comments.
Let’s Address the Elephant in the Room
I know, it is getting all complicated by this point. Let’s try to imagine indexing in our heads. So, you can imagine the internet as a network of websites. Websites are the sole source of information on this network. They contain webpages within them. No information is flying around in strands outside of the websites. These websites are floating in the digital space isolated from the other websites. The crawler bot of Google goes to each of these websites and takes an inventory of each single word. It then feeds this information to a list-making software which sorts through this information using keywords and meta-tags which is information the website creator provides to the Indexer. For example, if you had a website about grooming for dogs, you would not want it to be put next to a website that is about cats.
When it comes to storing and retrieving information from within an index, there are many ways in which the crawlers or search engines can go about the process with ease and efficiency. In this case, the fundamental aspects that define the process include assigning specific values to the information or disparate data sets in such a way that they are segregated on the basis of relevance to a particular search query. In the case of such a form of segregation, there are no set values which place one type of data set higher as compared to another data set. The valuation of data, which in other words is just a fancy way of saying the “labeling” of data (similar to how the quotation marks add a label of emphasis to the word “labeling”), is based on how relevant it is to a particular data set. It is when search queries are introduced is that the factor of value and weight of information comes into play, owing to the fact that search engines require a series of instructions or an address which allows them to find one web page amidst an ocean of web pages. In other words, imagine trying to find a needle in a haystack, a scenario which has been categorically attributed to an utter sense of impossibility over generations of linguistic development. In this case, the only thing that enables the recognition and the subsequent retrieval of the needle from the haystack are the qualities of the needle, namely its make, its colour and all the other aspects that make it different from the hay around it. The fundamental concept that is being approached here is that the characteristic values of a certain object within a set of other similar yet different objects is the essential aspect that allows for its differentiation, be it in terms of interpretation or simple recognition.
Food for Thought: As per our afore-mentioned points, an important element of understanding that comes forth is that the process of indexing is largely an un-biased process, free of human prejudice. However, how can such a benign process be an extension of one which is deeply associated with the concept of competition, an inherent and primal quality of human nature? Find the answer in next blog.
Through Your Eyes
Now from the opposite end, when as a user, you type in a query regarding dogs into Google, the Google servers go straight to websites that have information as close as possible to the words you typed in. This is where your website about dogs is and depending on the processes of computing, you might or might not make it into the first 10 search results, because most people just click on one of the websites on the first page, preferably the top 3 or 4. To compute this process, the indexer will have used the information you had given to the web crawler and then place it in a list of all websites with same or similar information. It is after this particular move that the process of valuation begins on the front-end of the Internet, which is essentially the place where the race for search engine dominance is on. Once the indexer has ranked the pages, it will serve you the websites that it thinks are best matched to suit your needs.
When we say “Indexer”, what we actually mean is a vast network of Google servers that are constantly listing and curating all the information that has been uploaded on the internet so far. That is over a trillion web pages! This network of machines synchronously processes all of this information and makes it useful for the common person who wants to know, anything really. Next time you search for something on Google, see how many seconds it took to produce that many results for you. It tells you that right at the top actually, as a way of showing off its immense precision. Anyway, that is the power of web indexing. A list to contain all lists. “One search engine to find them all, and in the index bind them”. And this search engine, a behemoth of so many machines all working in tandem with each other, churning information on a scale hard to imagine. All of humanity, in one place, so to speak. Constantly. Never ending chatter. A transcript of it all. That’s the Google Index.