Google In 2000

From PeacockWiki

Presentation by Jim Reese in October 2000

Google Pagerank looks at link structure
- Spam proof
Started with a couple of PCs running Linux in an office in stanford. "Google Alpha"
Professor handed them a checque in Sept 1998, and said this is too good, it has to become a company
300 PCs, 500K searches in 1999
$25M in June 1999 of VC funds, bought more computers
6K PCs, 50M searches, Oct 2000. 1000 searches/second
Sex and MP3 are number 1 searches, excepting day after academy awards, #1 search was "Jennifer Lopez Dress"
in 1999, internet had 150M users, estimated to increase to 320M in a few years
500M pages in 1998, 3-8Billion in 2002
deep web has more, possibly 2Bill in 2000
1999, 100M searches a day (for all search engines)
- estimated 500M in 2002
Search servers requires massive
- Download
- Index Processinh
- Storage
- Redundancy and speed
this requires a whole lot of computers
Goal for requests in <0.5Sec
currently ~0.35sec mean
75 full time engineers
Page rank uses link structure analysis
google has a billion page index
Pagerank is an objective measure of importance
- a form of popularity content
- each link is a vote
Hardware load balancing (paired)
Global load balancing
- tuned daily
GWS load balancing
- balances between index, doc, and add servers
Many index servers, each with a shard of the DB
- +massivly redundany
1 query uses a dosen index+doc servers
- Uses the fastest responce
TCP+UDP
- UDP for low bandwidth, low data
  - Try again on fail
  - Enhances query speed
  - reduces bandwidth
Reliability and fault tolerance very important
require 100% uptime
Use 1000s of "grey box" PCs
- Are unreliable, but are cheap and fast
- Split data across servers
- replicate across clusters
- replicate across datacenteres
KISS
- HArdware + software
- debugging 6K PCs across the world
- Pipe, router, router, load balancing, server, back end server, apps
TCP for GWS - back end
If a back end server dies
- Hardware software, old software, missing, sw dies
- Periodically retries every few minutes
Round robin / least connections balancing
- Different method for different servers
Index read only
instead of "1 search / k docs" you do "k searches / n/k docs" by sharding
3 data centres
overseas datacentres not online yet
private OC-12 for redundancy
2Gb lines for replication
every NW device paired, links paired
80pc racks, 100mb switches, gb uplink
supplied by rackable.com
- 400-800MHz cel or pent
- 256Mb ram (2Gb for some special applications)
- 40-80Gb IDE all local disk
- 1 IDE per channel, patched to use both channels simultaneously (70-80% dual throughput)
- cluster 7ft tall, 3ft deep, 2ft wide
normal rack may give you 10-15 servers (eg sun) + extras (san)
cabinet produces 5KW of power
2 fast ethernet sw per cabinet
GB uplink

Old equipment:

2U rack
- 4 motherboards (4 reset switches
- 8 HDDs
- 8 Nics
- 8 ram
- 1 PSU
cabinet door has 62 fans

New:

4 top mounted fans create a vaccum
under floor cooling
1U machines
Sw in middle

|                      air                       |
|                      ^  ^                      |
|     +--------------+ |  | +--------------+     |
|  +->|     PC       |-+  +-|      PC      |<-+  |
|  |  +--------------+      +--------------+  |  |
| air                                        air |

1000 searches/sec
6K pcs
500TB storage
10GBytes/sec sustained io
with 6K computers, 1 google day = 16.5 machine years
PC hardware is inheriently unreliable
average uptime of 6 months = 33 deaths a day
datacentre problems
- unreliable bandwidth
- unreliable cooling

record of 93F ambient
- also 62F
65-70F acceptable
KW in = heat out
LMsensors package reads temp from every machine every minute
record of 100C
- stop working at about 80-85C
- 35-40 acceptable
- 50+ bit errors
- 60+ major problems
- > server dies

NW probs
- Sw dies
- cable dies, or flakey
- nic dies
- drivers
- backhoe
stupid pro problems
DNS (multiple servers, load balanced)
routing misconfiguration
loops
security
- syn flood
- firewalls
- ACLS
multiple centres
multiple bandwidth providers
hardware probsblems
- HDD, MB, NIC, PSU, Heat, Random
- 4-7% deaths post burn in
- 80 machines - 24H burn in test
  - up to 10% deaths
- replace and reburn
- typically 4-7% post brun in in first 30-60 days
SW
- code
- linux bug
NFS
SSH
monitoring
- temp, disks, ram, apps
if server fails, restart it. who cars about 1/6k dying
performance in real time
if fault found, take it out of service immediately and diagnose

with 33 dead machines a day, what happens when they come back up?
- old/incompatible code
- old security

scripts to monitor datacentres + other datacentres (redundancy)

sripped down RH
every machine idential SW
packages for Web, index, doc, add servers
make and revert systems for different uses.
- Dynamically change

Retrieved from "http://wiki.peacocktech.com/wiki/Google_In_2000"

Google In 2000

From PeacockWiki

Views

Personal tools

Navigation

Search

Toolbox