How Search Engines Work | Crawling, Indexing, and Ranking

A hundred petabytes or so is a reasonable start on the amount of storage you need. Maybe an exabyte to allow for intermediate processing. That’s only between a thousand and ten thousand hard drives, and in practice it will be a lot more than that because a thousand drives will be way too slow for processing it.

Processing it is hard, you need huge clusters of computers and clever code.

It is then distilled down into an index that can be loaded on a few thousand machines and held in memory, so that queries can be answered very quickly. That takes rather a lot of RAM, but thousands of large servers can do it. Google runs many, many sets of these around the world, so there is one fairly close to every user.

The actual front end of the search engine takes the query, looks at each word in it and asks the index for the relevant results for each, and then intersects those to get the results relevant to the actual query. Or something like that, as every implementation gets modified fairly regularly and I’m sure Bing works differently to Google (note, I never really knew the details of how Google search works, just a few overall concepts).