Wikipedia network ideas

The first steps for Squid integration are done now, please test it in the testwiki running at wiki.aulinx.de.

High(er) availability

More Hardware (RAID,...)

Linux Virtual Server HA Load Balancing, Heartbeat, Fake,...
http://www.linuxvirtualserver.org/HighAvailability.html
if possible database replication (one way only with MySql).

The principle is simple- always have two of each component and have it configured to detect broken components immediately. The network connection in the current colo seems to be pretty stable and redundant, IMO this isn't the real problem.

may look like:

          Internet
              |
        Load Balancer (two machines/heartbeat?)
   ___________|__________
  |           |          |
Apache      Apache     Apache
  |___________|__________|
        |          |
      MySQL       MySQL

Better performance with caching

Current setup explained by Brion Vibber here:
Every page view comes in through wiki.phtml as the entry point. This runs some setup code, defines functions/classes etc, connects to the database, normalizes the page name that's been given, and checks if a login session is active, loading user data if so.

Then the database is queried to see if the page exists and whether it's a redirect, and to get the last-touched timestamp.

If the client sent an If-Modified-Since header, we compare the given time against the last-touched timestamp (which is updated for cases where link rendering would change as well as direct edits). If it hasn't changed, we return a '304 Not Modified' code. This covers about 10% of page views.

If it's not a redirect, we're not looking at an old revision, diff, or "printable view", and we're not logged in, the file cache kicks in. This covers some 60% of page views. If saved HTML output is found for this page, it's date is checked. If it's still valid, the file is dumped out and the script exits.
---------

So there's a lot of work involved even for a cached page, this explains the results below:
I've redone the load test with siege (because ab woudn't work with the cache headers i'm using), hammering the front page for 4 seconds with a concurrency of 4 from this webserver in Düsseldorf/Germany:

~# siege  -t4S -b -c4 -H 'Accept-Encoding: gzip' -u http://en2.wikipedia.org/
** Siege 2.55
** Preparing 4 concurrent users for battle.
The server is now under siege...
Lifting the server siege...      done.
Transactions:                     29 hits
Availability:                 100.00 %
Elapsed time:                   3.91 secs
Data transferred:              82341 bytes
Response time:                  0.47 secs
Transaction rate:               7.42 trans/sec
Throughput:                 21059.08 bytes/sec
Concurrency:                    3.47
Successful transactions:          29
Failed transactions:               0

~# siege  -t4S -b -c4 -H 'Accept-Encoding: gzip' -u http://www.aulinx.de/
** Siege 2.55
** Preparing 4 concurrent users for battle.
The server is now under siege...
Lifting the server siege..      done.
Transactions:                   3487 hits
Availability:                 100.00 %
Elapsed time:                   4.47 secs
Data transferred:           13641144 bytes
Response time:                  0.00 secs
Transaction rate:             780.09 trans/sec
Throughput:               3051710.21 bytes/sec
Concurrency:                    3.83
Successful transactions:        3487
Failed transactions:               0

The most interesting number here is 7.42 Transactions/second from en2.wikipedia.org. My Server below (Celeron 2Ghz) is running Squid and does 780 Transactions/second. I have to correct my previous claim about 10% cpu use- it's averaging 54% with Siege taking up the rest. It would be interesting to do a similar run from wikipedia to aulinx, then siege wouldn't take up cpu. Siege for some reason seems to return lower numbers than ab.

Anyway- there's a lot of speed to gain.

Some Squid features:

cache hierarchies (many squid servers working together)
ICP, HTCP, CARP, Cache Digests
WCCP (Squid v2.3 and above)
extensive access controls
HTTP server acceleration (this is what we need)
highly configurable mem- & diskcache (sizes, maximum object sizes, replacement algolorithm once full, etc, etc)

artist's impression:

          Internet
              |
        Load Balancer (two machines/heartbeat?)
       _______|______
      |              |
    Squid          Squid                       Squid Tier1
   ___|______________|___
  |           |          |
Apache      Apache     Apache
  |___________|__________|
        |          |
      MySQL       MySQL

I've simplified the setup a bit by cutting out the load balancers between the Squids and the Apaches. All Apaches would be configured as parent peers to the Squids and will be queried round-robin. If an Apache doesn't respond to the Squid's TCP request in peer_connect_timeout (i'd propose 2 seconds) it will move on to the other Apaches without throwing an error. If more than half of the requests in a short time fail for one of the Apaches, he'll get marked as dead by the squid and no user requests will be forwarded to this Apache anymore. Squid will still try to establish TCP connections from time to time to see if the host is back up. If he responds again, he's in again. See dead_peer_timeout.
The peers can have different priorities to account for different Apache performance, they're more often considered in the round-robin then. More sophisticated load balancing based on the current load of the servers would be possible with an ICP daemon that delays answers depending on the load, but that's really advanced. With a linuxvirtualserver.org Director this feature isn't available out of the box either, but there's feedbackd for this. Propably yagni.
Session data seems to be stored in MySQL, so the Apache serving a session can change (you can hop around between en and en2 and stay in the same session).

The MySQL servers should be coupled with heartbeat. There needs to be a primary server because replication (afaik) only works one way, the second one could still put into use for searching for example. In case the primary dies the secondary needs to take over the primary's ip (additionally to it's own one) and propably run a script that changes the MySQL replication status, same the other way around.

Further simplification

We can further simplify the setup by using a Bind DNS round robin (just multiple A records, one for each Squid) as the load balancer. We should connect the Squids with Heartbeat to make shure they take over each other's ip in case one goes down. This would save some latency and two machines without sacrificing HA. Preformance- wise a single Squid would easily do. A modest Celeron 2Ghz could do more than ten times the current load, although i would recommend to use a 64bit processor that can make use of more than 4Gigs of memory in the future. DNS round-robin is not a very fine-grained load balancer, but in this case it doesn't matter really. Each of the Squids will (and should) always be able to handle the full load.

          Internet
              |----------- Bind DNS round robin
       _______|______      (multiple A records)
      |              |
    Squid          Squid               Squid Tier1 (connected with heartbeat)
   ___|______________|___
  |           |          |
Apache      Apache     Apache
  |___________|__________|
        |          |
      MySQL       MySQL

Cache Setup

The caching policy is controlled with HTTP Headers (sent from php). We're interested in these two headers:

'Vary: Accept-Encoding, Cookie'
This tells Squid to maintain separate cache items for compressed/not compressed pages (Accept-Encoding) and different Cookies [users], all for the same url. Currently we're sending Vary: User-Agent, i have no idea why. It defeats caching.
if anonymous:
'Cache-control: must-revalidate, max-age=0, s-maxage=600'
must-revalidate and max-age=0 force the browser to check back for updates and s-maxage tells squid to expire this page after 10 minutes
if logged in:
'Cache-control: must-revalidate, max-age=0, no-cache'
no caching when logged in, this is the same as sent now.

More information about caching in this caching tutorial.
Currently a cookie is set even for anonymous users, this should be turned off (interferes with Vary: Cookie). It's also not good for some search engines. The possibility to send messages to anonymous users is affected by this, Brion proposed to start a session on editing- so after a user clicks 'edit this page' his browsing isn't cached anymore and he'll receive messages as he does now.
After adjusting these settings squid will start to cache anonymous connections. Any logged-in browsing will just pass through squid.

However, to use the full potential of squid it would be nicer to refresh pages only when they are changed. This is easy to do- we just have to do a

PURGE http://en.wikipedia.org/wiki/This_page_is_edited HTTP/1.0

request to each squid server. This can be done from the command line with squidclient, but it's nicer for us to do it in php. Using a library like Yahc we can set up a small function that takes the url as argument and purges a list of squid servers. This function gets a call on each save and voila- the pages are immediately updated for any user. I've just written a small purge tool in python for zope/cmf, this is not difficult.

Currently the modification-times of pages referring to a newly created page are also modified (the link rendering changes). The purge function should be hooked in here as well, then those are also invalidated immediately.
With nothing left depending on an expiry time we can configure squid to cache the pages for a very long time- more or less as a static server.
We could use the s-maxage value in the Cache-control header, but squid.conf is better because it won't get picked up by other downstream caches then. s-maxage should be set to 0 in this case- we force any downstream transparent caches to always check back for updates at our Squids.

If we use at least two squids there's no Spof, the load balancer would detect a hosed squid and move all the traffic to the running one(s). The two squids query each other via ICP (or better: HTCP) about objects, this way they 'share' one cache, pages are only rendered once after changes.

Taking over the world: our own Akamai (local mirrors)

There are offers from universities and isps to provide mirror sites. This would be nice for several reasons- the bandwidth burden would be distributed on many shoulders, less latency with closer servers,...

Setting up separate server systems with a separate database, possibly some form of replication, backup, HA hard-&software,... would be very hard. It would mean duplication of admin & update work and hardware resources. Problems with sign-on, editing conflicts, user database etc would arise.
The minimum hardware requirements for a mirror would be very high.

These problems can be avoided with simple distributed squid servers acting as mirrors:

          Internet
    /     /   |   \    \  \________________ Dents DNS Server /
   /     /    |    \    \                   Super Sparrow load balancer
  /     /     |     \    \
  |     |     |     |     |
Squid Squid Squid Squid Squid
 US    DE    FR    JP    AU                 Squid Tier2
  |____  \    |    / ____|                  (globally distributed 'mirrors')
        \ \   |   / /
          Internet
       ___/      \___
      |              |
    Squid          Squid                    Squid Tier1
   ___|______________|___
  |           |          |
Apache      Apache     Apache
  |___________|__________|
        |          |
      MySQL       MySQL

Each Tier2 Squid Server has all Tier1 servers as parents, if some Tier1 server fails it will automatically use the others. All Tier2 Squid Servers will also get the purge messages after edits (by adding them to the list of servers in the php purge function). The changed content will roll out immediately.
Now one could do a dns round robin (multiple A records) or give each squid its own hostname and ask the user to pick a close one. Or this could be done with an redirect based on the user's Accept_Language setting.

However, there's a very nice way to do this with complete transparency and good redundancy, very much like Akamai does:

Super Sparrow

http://www.supersparrow.org/

Super Sparrow load balances traffic between geographically separated points of presence (read: Squids) by finding the site network-wise closest to clients. This is done by accessing BGP routing information, the information that determines the path that traffic will take on the internet.
The BGP information is provided by GNU Zebra running on each Tier2 squid server. This is very fast- just try it at vergenet.net, you can see the server picked for you in the Servers Table (for my request from Hannover/Germany it's the Amsterdam server). The expert and author of Super Sparrow is Horms (Simon Horman).

Slightly modified example from the SuperSparrow website:

**Figure 3:** Redirecting Connections Using DNS
$\includegraphics[width=13cm]{abc_dns}$

Typically DNS[ 18][ 19][ 16] ⁶ servers are set up to statically map a given query to a reply or list of replies. In the case of a hostname lookup, an IP address or list of IP addresses will be returned. Generally, the result changes infrequently if at all. It is, however, possible to have a DNS server that returns results based on the output of some algorithm. This allows results to be determined dynamically. In this way the results of DNS lookups may be used the communicate the results of a load balancing algorithm to clients. DNS is a fair choice for this application as the DNS protocol is designed with some measure of redundancy. A domain may have multiple DNS servers and if one fails others may handle requests without the client being notified of any problems.

As an example, suppose that www.wikipedia.org is mirrored between POP X and Y and DNS is being used to distribute traffic between these two POPs as shown in figure 3.

Client Makes DNS Request to local DNS Server in Network C for
www.wikipedia.org
The DNS Server makes a recursive query on behalf of the Client. In doing so it queries POP X for the IP address of www.wikipedia.org. Both POP X and Y are authorative for the wikipedia.org domain, the Network C DNS Server happens to query POP X this time around. POP Y would do equally well. (Having a single primary dns would do as well, given the secondary and/or tertiary dns do a simple round-robin as a fallback, this is just the example from the supersparrow website.)
The DNS server in POP X is able to determine the best POP for a given connection. Note that the best POP for the Network C DNS Server is queried and not the best route to the Client, as the IP address of the Client is not known to the DNS server in POP X. This assumes that Clients use DNS servers that are network-wise close to them. The IP address of the web server or farm in POP Y is returned in response to the DNS query by the Network C DNS Server.
If POP X had been down then the Network C DNS Server would have queried POP Y and returned the IP address of the web server or farm within itself, thus the result would be the same.
If POP Y was down then the query to the route server in POP X would have shown that POP X was the closest POP to the Network C DNS Server and the IP address of the web server or farm in POP X would be returned.
If both POPs were down then there would be no result and the DNS lookup would fail.
Network C DNS Server responds to the Client's DNS request with the answer obtained from POP X.
The client has the IP address of a server in POP Y as the IP address of www.wikipedia.org and makes an HTTP request to this server.
The server responds to the Client's HTTP request.

The Squid server

immediately returns any cached pages without contacting the main server or the Tier1 squids if the user is not logged in or
requests the page from the Tier1 squids

For most (anonymous) users the squid server will act as a very fast static webserver. All editing (logged in) will just pass through the squid without being cached. Latency will increase marginally (by maybe 10ms) due to extra hops, but i doubt anybody would notice this ;-)

Setup, hardware and maintenance of the Tier2 servers are very simple. They are only running Squid and Zebra. Any server with 512Mb ram and 40Gig hd will do. This makes it easy to find a lot of mirrors.

Future possibilities: ESI

Most traffic is currently anonymous, propably around 70%. Caching this will take most load off the main servers. The remaining 30% should be mostly logged-in browsing, only very little is actual previewing/ editing. This can be cached as well with Edge Side Includes, an open standard pioneered by Akamai and others. Read more about it at esi.org. Esi is implemented in Squid 3. Another open source project pushing ESI is Zope in its next version 3.
Another related subject to this is the skin system- widely different skins can be done based on a single tableless xhtml file and different css styles. This site here is an example for this (based on Plone), but the real Css Zen is found at csszengarden.com. ESI could insert the chosen stylesheet link into the standard xhtml page based on a cookie value without contacting the main server.
Another useful feature found in Squid 3 is https acceleration.

Summary

Advantages

step by step, no major changes, relatively little work
High Availability
proven technology
no duplication of admin/ update effort, no additional update & administration hassle
scales very well with average machines (cheap), easy to add more 'mirrors' provided by universities or isps
minimal code change
single url, completely transparent to users and search engines
single sign on possible
no additional replication traffic, most traffic is shifted to mirrors
no stale content with purging
no database passwords over the net
very little load on the main servers (mainly editing and searching)
ESI offers possibilities to further lower the load purely to one-time rendering, edit previews/ commits and searches

Disadvantages:

Would suffer from complete network breakdown at Verio (unlikely, they have multiple redundant uplinks) or a plane crash (Al Quaeda targeting Wikipedia ;-)
~140ms latency and additional hops for uncached requests (~10ms) - that's 0.15 seconds. ESI could solve most of this
No messages for anonymous users before clicking on 'edit this page'- pointed out by JeLuF. Another possible ESI application.
?

Gabriel Wicke - looking forward to comments.
You're free to use this under the GFDL

Created by gw
Last modified 2004-01-24 12:05 AM

"You've got new messages"

Posted by Anonymous User at 2004-01-04 12:06 AM

When caching pages for anonymous users, message notification won't work for anons any longer. My proposal: Use the squid box as an additional web server and use a simple layer 2/3 load balancing a la http://www.virtualserver.org/ .

The MediaWiki is already delivering a cached copy from disk when there are no messages for the user and the cached version is recent. That's currently two DB queries, but two very fast ones (Accessing one row using an indexed column). -- [[en:JeLuF]]

Replies to this comment

Depends on the Cookie (Posted by gw at 2004-01-04 03:40 AM)

(Posted by Anonymous User at 2004-01-04 12:13 PM)

Brions idea: cookie after editing (Posted by gw at 2004-01-04 02:22 PM)

Sounds good (Posted by Anonymous User at 2004-01-04 08:12 PM)

Session storage, pipelined requests

Posted by Anonymous User at 2004-05-26 04:17 PM

The article mentions that Wikipedia stores sessions in a database. I think it should be considered to store the sessions as files on an NFS volume. This is because I assume that it takes less ressources to retrieve a file off NFS than to query a database.

Another think: When I fetch URLs from Wikipedia, I don't seem to receive Content-Length headers on non-static URLs (pages). This probably prohibits pipelined requests, see http://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html#sec8.1.2.1
I suggest that PHP's output buffering system be used, so that it may be queried for content length.

/Troels Arvin

Distribution by Multicast

Posted by Anonymous User at 2004-06-17 03:21 PM

I'm still not convinced if it is the _right_ way to keep all editing in one
central place.. it would be much more the internet way to have each document
on a potentially different home wiki server.

But given that central database is the only short-term viable solution, let's
look into distribution: I have the impression a lot of institutions would
like to run their own copy of wikipedia, or parts thereof.

It will become a problem if this is done by unicast ICP: 15 servers in
south africa will want 15 notifications from the wikipedia root server.
I was pointed to the ability of squid to accept multicast ICP, that would
solve this aspect.

Next problem: ICP may not be able to deliver all packets to destination,
therefore some pages on some servers may go out of synch. One could think
that's not so dramatic since hitting the EDIT button will always summon
the current version from the root server. But what if weeks pass and nobody
ever does?

Now you could try an rsync-type of approach, but that's really not a very
clean way to go about it. Instead it would be best to put all the servers
on a notification infrastructure which can do *both* multicasting and
store & forward for nodes which were temporarily unreachable.

There are technologies out there who can help you solve this problem, for
instance pilhuhn has written a very interesting NNTP server with builtin
ability to distribute via the multicast IP backbone. considering that
NNTP also comes with a lot of database-style features that may be useful
to the issue at hand, it may be a very smart and internettish approach
to solve the problem using NNTP. http://mcntp.sourceforge.net/ is the
site for the multicast-capable news server.

The other thing I can come up with is the protocol we are working on
ourselves, it's a messaging and chat protocol called PSYC. it can carry
arbitrary messages, do hybrid multicast delivery - that means by using
irc-like tcp routing in combination with multicast and unicast udp. so
if a server is not multicast-capable you can at least reach it via a
spanning tree type of network. the content of the messages is arbitrary
and submitting updates to the network is rather trivial to do. we need
to work on some parts of this whole scheme but it's essentially there.
have a look at http://psyc.pages.de or download from http://muve.pages.de

I wasn't able to create a login on this server, so here goes my .sig:

--
symlynX » psyc://lynX@symlynX.com » irc://ve.symlynX.com/PSYC
network chat technology since 1988 » http://psyc.pages.de
for a truly private chat » https://ve.symlynX.com:34443/LynX/

Jun	JUL	Sep
	10
2003	2004	2005

Aulinx

Sections

Personal tools