Harvard CS75 Web Dev

Lecture by David Malan, 2012

Lecture 0: HTTP

source

IP address

  • identifies server/computer on the internet. Number of the form w.x.y.z (0-255 numbers i.e. 48=324*8=32 bits)
  • Therefore there are 23242^32\approx 4 billion possible IP addresses for version 4. Version 6 has 128 bit IP addresses.
  • local computers have private IP addresses: 192.168.x.x / 172.16.y.y / 10.z.z.z (class A)

DNS

  • Domain Name System, used for domain name look up (convers IP addresses to domain names and vice versa)
  • Internet Service Providers (ISP) have their own DNS servers.
  • There is a hierarchy of DNS servers. Root servers sit at the top and know which DNS server is responsible for which website. You go up the hierarchy to the root servers and down to the DNS server responsible.
  • DNS lookup on mac: nslookup google.com
  • /etc/hosts is a text file that hard codes IP addresses to domain names (used for development purposes). OS looks at this file before querying DNS servers.

Visit an IP address

  • computer creates a packet with a get request : GET / HTTPS/1.0 (first slash is root of web server / synonym for index.php or index.html and 1.0 is the version number, more HTTP headers can get sent from browser to server) and sends it via a network of routers to the destination node: the IP address
  • IP address sends back html code to sender
  • http:// (protocol or schema) www.(sub domain name) google(domain name) .com(top level domain) / (path)

TCP

  • Transport Control Protocol

  • Used in conjunction with IP: TCP/IP protocol

  • communication standard to send packets across the internet (browsing, email, instant messaging). Both client and server have to be connected before data gets sent.

  • Different port numbers are used to specify types of network services:

    • HTTP is listening on port 80.

    • HTTPS (secure version of HTTP): 443

    • Email: port 25.

  • port forwarding. Example: configure home router to forward HTTP (TCP 80) requests to your public IP address (given by ISP) to your private IP address. (ISP would probably prevent you from hosting a website though)

Domain names

  • buy domain name and tell the registrar what the domain name's DNS server's IP addresses will be
  • each domain name has primary and secondary DNS servers (back up). Web-hosting services offer storage space (to upload HTML, CSS, PHP, JavaScript files) on a server, which is referenced in their primary/secondary DNS servers. I then tell the registrar the IP addresses of those DNS servers.

Shared Hosting

  • multiple websites can sit on the same physical server (=1 IP address)
  • browser specifies what user typed in as a HTTP header
  • downsides: competing for compute resources, if server goes down (because of say a denial of service (DOS) attack on one website) every website goes down, no administrative access to server (e.g. old version of PHP),

Virtual Private Servers

  • run virtual machine on server and get your own (virtual) dedicated server (with root access, independent of other websites, but still dividing resources among VMs)
  • think about adversial attacks. Example: somebody uploading/downloading the same video again and again could use all your monthly allotment of bandwith (e.g. 20 GB)

Protocols

  • SSH (Secure SHell): secure way of connecting to remote server and executing commands
  • SFTP (secure file transfer protocol)

Leaking cookies

  • session hijacking: if a website uses HTTP vs HTTPS, an adversarial user could intercept session cookies and hijake the user's session

Lecture 9: Scalability

source

Web hosts features

  • block access to certain IP ranges
  • SFTP (S=secure because traffic is encrypted) in contrast to FTP
  • shared web host vs VPS (virtual private server, you get your own copy of the operating system): is that still relevant with Docker?
  • Amazon EC2 (Elastic Cloud): self-service VPS

Vertical Scaling

One server with lots of resources:

  • CPU: cores, L2 cache, ...
  • Hard Disk Drive (mechanical drives (metric is Rounds Per Minute), SSD (solid state drives) much faster)
  • RAM

Horizontal Scaling

Multiple cheaper servers

Server has IP address with domain name

Clients make HTTP requests

Load balancer distribute requests over server.

Load balancer has public IP address with domain name.

Backend servers have private IP addresses (e.g. 192.168, 172.16.0.0, no client can address them)

version 4 IP addresses (32 bits) have been running out, having a load balancer only requires you to get one public IP

load balancing based on:

  • acts as a DNS in round robin (server 1, 2, ..., k modulo n). Downside: one server might get one or more heavyweight users (e.g. if DNS returns server IPs directly, client could hit the same server because of caching)
  • load (CPU cycles of servers), all servers need to be identical
  • based on host HTTP header to send to dedicated servers (images, videos, etc...). However, no redundancy and potential disproportionate traffic

Some load balancers: Amazon Elastic Load Balancer, Nginx

Sticky sessions

Load balancing breaks PHP sessions for clients (sitting in one server). E.g. you have to log in until you have a session cookie on all servers

Solutions:

  • share state in one central HDD instead of individual servers' HDD or put sessions on the load balancer. Cons: no redundancy in the database/session states

  • Better solution, RAID (Redundant Array of Independent Disks):

    • striping (slicing and writing data in parallel to multiple disks)

    • mirroring the data

    • a combination of both across multiple drives

  • Shared storage (file server): MySQL, AWS S3 (Simple Storage Service), etc...

  • Cookies (stored on user's browser) but privacy issues (first party, stored by the domain you are visiting / third party used by advertising providers, triggered in the background, for cross-site tracking, retargeting, ad-serving): you can store key to the server IP to which you send the user

PHP acceleration

  • PHP slower than compiled language like C or C++
  • PHP accelerators for executing compiled versions. Similar in Python with .PyC files

Caching (key-value store)

  • storing static .html file on disk instead of generating page dynamically (downside: redundancy in template code, can't change template without regenerating all pages)
  • MySQL query cache
  • Memcached: run on a server, stores anything on RAM. Only stores strings.
  • Redis: remote in-memory data structure store, persists data on the hard disk, master-slave replication. See this article. RAM can be accessed 100s x faster than disk.
  • LRU (least recently used) caching to free up space

Replication

Master-slave

  • master gets copied to slaves
  • good topology for website that is very read heavy and less write heavy. You can write from master and balance read requests accross servers. Con: if master fails, slave has to be promoted, possible loss of data in the meantime

Master-master (for write-heavy)

  • applications can read from both masters
  • distributes write loads across both master nodes (data is propagated accross both masters). Can distribute based on location e.g. east and west.
  • Load balancers often come as a pair and operate as active-active (= master-master) mode to prevent a single point of failure. Checking each others heartbeats (periodically sending packets), so that, if one of them dies, the other can take the requests.

Note that you can have partitioning within each server too.

Data-center replication

  • Load balancing at the DNS level: geography based load balancing
  • where does each part sit? Do we replicate the entire network (load balancer, servers, databases) ?

Partitioning

Horizontal: split database based on index (rows), within the same server.

Vertical: split by columns.

Sharding

  • Same as partitioning but each partition sits on a different server. Shards are different database instances. Not ACID compliant (why?)
  • Two load balancers, distributing read requests on different databases based on user information.
  • Rebalancing data requires downtime when a shard outgrows others.

Security: firewalling

  • TCP (transport protocol) 80 and 443 (used for SSL/TLS for HTTPS) on the way in
  • allow SSH (secure shell, cryptographic network protocol) to connect to your data center
  • from load balancers to web servers: can keep everything unencrypted over TCP 80
  • from web server to databases: SQL query ports (TCP too)
  • value of limiting connections beyond the load balancer is in the case of an intrusion. You don't want web server to be able to SSH the other ones for example.

Sidenotes

  • how does a request work: browser or OS sends request to DNS server (Domain Name System, translates host names to IPs and vice-versa). DNS sends IP address (with TTL (=Time To Live) until new request has to be sent). Browser/OS can cache DNS responses in order to prevent sending same requests every time.
  • difference between Router and DNS server:
  • CPU cycle = time required for the execution of one simple processor operation (e.g. addition)
  • RPC (Remote Procedure Call) is a higher level protocol that uses TCP/IP as its transfer protocol and is used to execute code on another computer as if it were a local call.
  • Apache: most widely used HTTP (web) server software