Google In 2000
From PeacockWiki
(Difference between revisions)
Revision as of 06:41, 6 December 2005 (edit) Trevorp (Talk | contribs) ← Previous diff |
Current revision (07:18, 6 December 2005) (edit) Trevorp (Talk | contribs) |
||
Line 98: | Line 98: | ||
| air air | | | air air | | ||
</pre> | </pre> | ||
+ | *1000 searches/sec | ||
+ | *6K pcs | ||
+ | *500TB storage | ||
+ | *10GBytes/sec sustained io | ||
+ | *with 6K computers, 1 google day = 16.5 machine years | ||
+ | *PC hardware is inheriently unreliable | ||
+ | *average uptime of 6 months = 33 deaths a day | ||
+ | *datacentre problems | ||
+ | **unreliable bandwidth | ||
+ | **unreliable cooling | ||
+ | ---- | ||
+ | *record of 93F ambient | ||
+ | **also 62F | ||
+ | *65-70F acceptable | ||
+ | *KW in = heat out | ||
+ | *LMsensors package reads temp from every machine every minute | ||
+ | *record of 100C | ||
+ | **stop working at about 80-85C | ||
+ | **35-40 acceptable | ||
+ | **50+ bit errors | ||
+ | **60+ major problems | ||
+ | **> server dies | ||
+ | ---- | ||
+ | *NW probs | ||
+ | **Sw dies | ||
+ | **cable dies, or flakey | ||
+ | **nic dies | ||
+ | **drivers | ||
+ | **backhoe | ||
+ | *stupid pro problems | ||
+ | *DNS (multiple servers, load balanced) | ||
+ | *routing misconfiguration | ||
+ | *'''loops''' | ||
+ | *security | ||
+ | **syn flood | ||
+ | **firewalls | ||
+ | **ACLS | ||
+ | *multiple centres | ||
+ | *multiple bandwidth providers | ||
+ | *hardware probsblems | ||
+ | **HDD, MB, NIC, PSU, Heat, Random | ||
+ | **4-7% deaths post burn in | ||
+ | **80 machines - 24H burn in test | ||
+ | ***up to 10% deaths | ||
+ | **replace and reburn | ||
+ | **typically 4-7% post brun in in first 30-60 days | ||
+ | *SW | ||
+ | **code | ||
+ | **linux bug | ||
+ | *NFS | ||
+ | *SSH | ||
+ | *monitoring | ||
+ | **temp, disks, ram, apps | ||
+ | *if server fails, restart it. who cars about 1/6k dying | ||
+ | *performance in real time | ||
+ | *if fault found, take it out of service immediately and diagnose | ||
+ | ---- | ||
+ | *with 33 dead machines a day, what happens when they come back up? | ||
+ | **old/incompatible code | ||
+ | **old security | ||
+ | ---- | ||
+ | *scripts to monitor datacentres + other datacentres (redundancy) | ||
+ | ---- | ||
+ | *sripped down RH | ||
+ | *every machine idential SW | ||
+ | *packages for Web, index, doc, add servers | ||
+ | *make and revert systems for different uses. | ||
+ | **Dynamically change |
Current revision
Presentation by Jim Reese in October 2000
- Google Pagerank looks at link structure
- Spam proof
- Started with a couple of PCs running Linux in an office in stanford. "Google Alpha"
- Professor handed them a checque in Sept 1998, and said this is too good, it has to become a company
- 300 PCs, 500K searches in 1999
- $25M in June 1999 of VC funds, bought more computers
- 6K PCs, 50M searches, Oct 2000. 1000 searches/second
- Sex and MP3 are number 1 searches, excepting day after academy awards, #1 search was "Jennifer Lopez Dress"
- in 1999, internet had 150M users, estimated to increase to 320M in a few years
- 500M pages in 1998, 3-8Billion in 2002
- deep web has more, possibly 2Bill in 2000
- 1999, 100M searches a day (for all search engines)
- estimated 500M in 2002
- Search servers requires massive
- Download
- Index Processinh
- Storage
- Redundancy and speed
- this requires a whole lot of computers
- Goal for requests in <0.5Sec
- currently ~0.35sec mean
- 75 full time engineers
- Page rank uses link structure analysis
- google has a billion page index
- Pagerank is an objective measure of importance
- a form of popularity content
- each link is a vote
- Hardware load balancing (paired)
- Global load balancing
- tuned daily
- GWS load balancing
- balances between index, doc, and add servers
- Many index servers, each with a shard of the DB
- +massivly redundany
- 1 query uses a dosen index+doc servers
- Uses the fastest responce
- TCP+UDP
- UDP for low bandwidth, low data
- Try again on fail
- Enhances query speed
- reduces bandwidth
- UDP for low bandwidth, low data
- Reliability and fault tolerance very important
- require 100% uptime
- Use 1000s of "grey box" PCs
- Are unreliable, but are cheap and fast
- Split data across servers
- replicate across clusters
- replicate across datacenteres
- KISS
- HArdware + software
- debugging 6K PCs across the world
- Pipe, router, router, load balancing, server, back end server, apps
- TCP for GWS - back end
- If a back end server dies
- Hardware software, old software, missing, sw dies
- Periodically retries every few minutes
- Round robin / least connections balancing
- Different method for different servers
- Index read only
- instead of "1 search / k docs" you do "k searches / n/k docs" by sharding
- 3 data centres
- overseas datacentres not online yet
- private OC-12 for redundancy
- 2Gb lines for replication
- every NW device paired, links paired
- 80pc racks, 100mb switches, gb uplink
- supplied by rackable.com
- 400-800MHz cel or pent
- 256Mb ram (2Gb for some special applications)
- 40-80Gb IDE all local disk
- 1 IDE per channel, patched to use both channels simultaneously (70-80% dual throughput)
- cluster 7ft tall, 3ft deep, 2ft wide
- normal rack may give you 10-15 servers (eg sun) + extras (san)
- cabinet produces 5KW of power
- 2 fast ethernet sw per cabinet
- GB uplink
Old equipment:
- 2U rack
- 4 motherboards (4 reset switches
- 8 HDDs
- 8 Nics
- 8 ram
- 1 PSU
- cabinet door has 62 fans
New:
- 4 top mounted fans create a vaccum
- under floor cooling
- 1U machines
- Sw in middle
| air | | ^ ^ | | +--------------+ | | +--------------+ | | +->| PC |-+ +-| PC |<-+ | | | +--------------+ +--------------+ | | | air air |
- 1000 searches/sec
- 6K pcs
- 500TB storage
- 10GBytes/sec sustained io
- with 6K computers, 1 google day = 16.5 machine years
- PC hardware is inheriently unreliable
- average uptime of 6 months = 33 deaths a day
- datacentre problems
- unreliable bandwidth
- unreliable cooling
- record of 93F ambient
- also 62F
- 65-70F acceptable
- KW in = heat out
- LMsensors package reads temp from every machine every minute
- record of 100C
- stop working at about 80-85C
- 35-40 acceptable
- 50+ bit errors
- 60+ major problems
- > server dies
- NW probs
- Sw dies
- cable dies, or flakey
- nic dies
- drivers
- backhoe
- stupid pro problems
- DNS (multiple servers, load balanced)
- routing misconfiguration
- loops
- security
- syn flood
- firewalls
- ACLS
- multiple centres
- multiple bandwidth providers
- hardware probsblems
- HDD, MB, NIC, PSU, Heat, Random
- 4-7% deaths post burn in
- 80 machines - 24H burn in test
- up to 10% deaths
- replace and reburn
- typically 4-7% post brun in in first 30-60 days
- SW
- code
- linux bug
- NFS
- SSH
- monitoring
- temp, disks, ram, apps
- if server fails, restart it. who cars about 1/6k dying
- performance in real time
- if fault found, take it out of service immediately and diagnose
- with 33 dead machines a day, what happens when they come back up?
- old/incompatible code
- old security
- scripts to monitor datacentres + other datacentres (redundancy)
- sripped down RH
- every machine idential SW
- packages for Web, index, doc, add servers
- make and revert systems for different uses.
- Dynamically change