Just how big IS a terabyte?

December 21, 2016

Being an old guy in computing terms, I still remember the occasion when one of my colleagues told me that they’d bought a new PC with a gigabyte hard drive.  That’s 1 GB.  1,000 MB.  It was, I believe, back in 1994.  They’d got themselves a new home machine with a 60 MHz Pentium (built on an 800 nm process) and a 1 GB hard disk.  How they gloated.  How envious was I, stuck with my 500 MB disk and a 486DX2-66 processor?

These days, we have 512 GB SD cards, which are merely the size of a postage stamp, available for only GBP 314.  A terabyte version has recently been announced by SanDisk.

The size of today’s storage devices has led to people just not appreciating just how vast those stores are!

Here comes the science part: An SI terabyte, which is a usual measure of persistent storage, is 1,000,000,000,000 (1012) bytes.  Each byte is 8 bits, and can encode a single character in ASCII or EBCDIC schemes – let’s ignore internationalisation and multi-byte Unicode for now as that just complicates things.

Note that this is distinct to a CS tebibyte which is 1,099,511,627,776 (240) bytes and is used as a more usual measurement for memory.

So, to put these numbers into some perspective, just how big IS a terabyte?

It’s quite common to talk about it in terms of hours of music or video but that is really kind of misleading… sure you can imagine a stack of 213 Blu-ray movie disks, two and half years of continual MP3 music or nearly 2 weeks of worldwide tweets but what does that actually mean in more visceral terms?

Let’s work it out in terms of space and time using, whilst they still exist as physical artefacts, traditional paperback books.  For this example, let’s use Tolstoy’s War and Peace to generate the numbers. We’ll do time first.

War and Peace has a word count of 587,285.  It’ll depend on the particular translation, I guess, but let’s use that as a working number for now.

The average typing speed for a typical person is around 40 words per minute (but obviously more for a professional typist).  That means that it would take someone, maybe you, around 245 hours to type a copy of War and Peace.  We can convert time into money too – the minimum wage is  around GBP 7 per hour, so the opportunity cost of the time taken for that typing would be around GBP 1,715.  That represents about 6 weeks of work effort.

The English language has around 4.7 letters per word on average but don’t forget the spaces and punctuation too – so make it 5.7 characters per word; 5.7 x 587,295 = 3,400,380 bytes.  A Project Gutenberg copy of War and Peace is 3,359,550 bytes so that’s quite a good correlation.  Let’s just use 3.4 MB to make the numbers easier.

A terabyte is 1,000,000 MB so it could hold 294,118 copies of War and Peace.  As calculated above for a single copy and multiplied up, that represents around  72,058,910 hours of human effort and so a labour value of about GBP 504,412,370.  (You can see why photocopiers took off…)

Those 294,118 copies would hold about 368,529,854 pages (see below), each with maybe 476 words on it on average.  Every adult in the UK could have 7 pages dedicated to them.

A terabyte is worth over 500 million pounds sterling of minimum wage workers’ typing effort and, at 8 hours a day for 260 days a year, would take 34,644 years to complete.

Best get some help to finish that before lunch.

The adult population of the UK is about 52 million so, if they all mucked in, they should get the job done in just under an hour and a half by typing some 3,325 words each, which just so happens to be around 7 pages of content. How convenient.

It’s also worth noting that, it would take 34.6 years of work for that average typist to fill a single GB, which holds 294 copies of the book.  This is why, when everything was held as plain ASCII text files, even 500 MB felt cavernous to most people!

People read, on average, at about 250 words per minute so it’d take around 5.6 years to just read a GB of text, and 5,623 years to read a TB. (Assuming a 9-5 job, 5 days a week.)

So that’s time, and by extrapolation, money.  Now for space.

Copies of War and Peace differ in physical size and page count depending on extraneous supporting content and typeface used.  Here are three off of Amazon, we’ll take an average of them as our working numbers.

ISBN-13 Pages Dimensions Volume
978-1853260629 1,024 12.2 x 5.6 x 19.8 cm 1,352.74 cm3
978-0140447934 1,440 13.0 x 6.1 x 19.8 cm 1,570.14 cm3
978-0099512246 1296 13.1 x 5.5 x 21.6 cm 1,556.28 cm3
 Averaged 1,253 12.8 x 5.7 x 20.4 cm 1,493.19 cm3

We’ll just assume that the extraneous text is not material to the argument and stick with the 3.4 MB data per edition.

So the printed information density of a War and Peace paperback book is around 2,277 bytes per cubic centimetre, or 2.28 KB/cm3.  This equates to about 2.28 GB/m3.  (You can see why electronic storage is so popular…).

There are 1,000 GB in 1 TB, so a TB in such a printed form would take up around 439 cubic metres and neatly contain those 294,118 copies of War and Peace.  That’s a cube 7.6 metres (or about 25 feet) on each side.

It’s still a little abstract so let’s make it more concrete.

A community swimming pool is generally about 25m long and 13m wide. It obviously various depending on the number of lanes but this is roughly what you’d expect.  Depth varies too but, for this, imagine a 2m depth throughout – no shallow end or diving area!

That means that a terabyte of books would fill such a community pool to a depth of 1.35 metres or nearly 4.5 feet.  Each pool could hold the equivalent of around 1.5 TB in printed form.

A 6 TB disk, now commonly available for under GBP 200, holds enough data that, if printed in a paperback novel form, would fill 4 standard swimming pools, or a 50m Olympic pool to a depth of 2.1 metres.

Hopefully that gives you a feel for just how much data a terabyte, or even a mere gigabyte, actually is, when it represents textual information rather than high-definition video or music content.

As an aside (and this is highly speculative and could be utterly wrong in so many, many ways), just how small a volume can a TB be stored in? Well, it’s looking like a holographic universe (if that’s an actual thing) stores one bit in an area one Planck length per side, or 2.56E-70 m2.  If we want to store 8E12 bits we’d need 20.48E-58 mor a sphere at least 2.55E-29 metres in diameter surrounding the information.  That’s the size of the black hole formed by a mass of 8.9 g packed into a singularity, or something like that anyway. Of course, we’d likely also need sufficient information to be held to fully describe the substrate containing the data which would push out the size very substantially.  As the man said, “there’s plenty of room at the bottom“.

Project Euler Problem #60

November 20, 2016

Read the details of the problem here

Summary

Prime pair sets.

Solution

This problem took a little bit of thinking about, primarily to get it to perform adequately – on the basis that I’m trying to get this to run in under 5 seconds. When I first solved this problem I simply couldn’t get the computational performance out of Groovy to do the prime validation quickly enough and ended up calling out to a native Java method. With the introduction of the @CompileStatic AST transformation this has made it possible to do. After my next laptop upgrade, I’m going to be having to aim for maybe 2 seconds at the most on these questions!

The solution uses a fairly naive implementation of an isPrime() method which is still substantially faster than using BigInteger#isProbablePrime() from Groovy when using @CompileStatic – at least for the range of numbers being used here. I made an assumption that 4-digit primes would be sufficient to solve this question and, luckily, it worked out to be correct.

The program starts by constructing a list of primes mapped to a list of other primes that conform to the question’s constraints of both orders of concatenation also being prime. Using lists of strings rather than integers saves a lot of overhead in the subsequent checks. The lists only contain primes that are greater than the key. So, for example, 3’s list contains 7 (as both 37 and 73 are prime), but that of 7 doesn’t contain 3. Any numbers that have fewer than 4 entries in their list are dropped as a set of 5 primes are being sought. The lowest number in that set will have the full membership in it’s list.

Having generated this list of prime-pairs, a set of primes are grown to find a five members set that fulfil the criteria. At each recursion the entries of the list of the most recently added member are checked to make sure that they meet the concatenation criteria of each of the other member of the list (by making sure that it’s on their respective lists of potential pairs) and that it wouldn’t exceed the current minimum value found. For each of the values that fulfil this criteria, a recursion is called with this appended to the current list. The minimum value found for that particular path is returned. In this respect it’s like creating a directed graph with the nodes (of which there are only 1,227) being the primes and the edges (of which there are 17,845) being the fact that two nodes fulfil the concatenation criteria.

There are probably some algorithmic optimisations that could be made here but, as it is, the code is quite simple to understand.

import groovy.transform.CompileStatic

@CompileStatic
boolean isPrime(final int num) {

  if (num <= 1) return false
  if (num <= 3) return true
  if ((num % 2 == 0) || (num % 3 == 0)) return false

  int r = (int)Math.sqrt(num)
  int f = 5

  while (f <= r) {
    if (num % f == 0) return false
    if (num % (f + 2) == 0) return false
    f += 6
  }

  true
}

def getPrimePairs(maxnum) {

  def primes = ((2..maxnum).findAll { isPrime(it) })*.toString()
  def primeCount = primes.size()
  def pairs = [:]

  for (ip1 in 0..<primeCount - 1) {
    def sp1 = primes[ip1]
    pairs[sp1] = [] as Set
    for (ip2 in (ip1 + 1)..<primeCount) {
      def sp2 = primes[ip2]
      if (isPrime((sp1 + sp2).toInteger()) &&
          isPrime((sp2 + sp1).toInteger())) {
        pairs[sp1] << sp2
      }
    }
    if (pairs[sp1].size() < 4) pairs.remove(sp1)
  }
  pairs
}

def find5(list, min, pairs) {

  def sum = list*.toInteger().sum()
  if (list.size() == 5) return sum

  pairs[list[-1]].findAll { p -> ((sum + p.toInteger()) < min) &&
                                  (list.every { pairs[it].contains(p) }) }
                 .each { min = [ min, find5(list + it, min, pairs) ].min() }
  min
}

def (answer, primePairs) = [ Integer.MAX_VALUE, getPrimePairs(10000) ]

for (p in primePairs.keySet()) {
  answer = [ answer, find5([p], answer, primePairs) ].min()
}

This executes in Groovy 2.4.7 under Java 8u112 in around 2.9 seconds, so within my 5 second cut-off. The build of the list of primes takes about 115 ms, with the construction of the prime pair map taking another 1.5 seconds. The isPrime() method is called around 850,000 times and takes around 960 ms of the total runtime.

Conclusion

Quite surprisingly, Groovy was perfectly good for this problem as long as @CompileStatic is used to optimise the isPrime() method. Left as dynamic, the runtime pushes out to just under 30 seconds. The Groovy listops made it easy to construct the prime-pair lists and for quite an expressive statement for the recursive descent conditions.

Project Euler Problem #54

November 18, 2016

Read the details of the problem here

Summary

Poker hands.

Solution

This was a straightforward programming problem rather than anything taxing from a mathematical perspective. I used Groovy listops to process the input file into a sorted list of [ count, face_value ] pairs that could then just be inspected in order of priority (along with some flags to indicate whether there was a monotonic sequence and a only a single suit in the hand) to rank the hand.

There are only 5 combinations of counts available (as given below). It would have been possible to simply check against these to provide a basic classification of the hand but with the need to have the flush and straight processing it just seemed easier and clearer to have the ranking done explicitly.

Count Permutations: [ 1, 1, 1, 1, 1 ], [ 2, 1, 1, 1 ], [2, 2, 1 ], [ 3, 2 ], [4, 1]

Ties were broken by using a base-13 (tridecimal) system with the rank being the most-significant digit and then using the first two group face values as the differentiators so a simple score was created for each hand. As it happened there were no clashes that weren’t broken by this method – it would have to be extended to subsequent cards and then suits (which were ignored in this question) otherwise, so you’d end up with a base-52 (duopentagecimal) system where each card had a unique value but were grouped unless there were clashes.

It didn’t take a lot of coding to get this done. It could be done more tersely, perhaps, but it would be more cryptic!

class Hand {

  private final _cards
  private final _isStraight
  private final _isFlush

  Hand(cardList) {
    def vals = cardList.collect { ['face': '23456789TJQKA'.indexOf(it[0]), 'suit': it[1]] }

    _isFlush = (vals.groupBy { it.suit }.size() == 1 )
    _cards   = vals.groupBy { it.face }
                   .collect { [ (it.value.size), it.key ] }
                   .sort { -(it[0] * 13 + it[1]) }
    _isStraight = (_cards.size() == 5) && (_cards[0][1] - _cards[-1][1] == 4)
  }

  def score() {
    rank() * 169 + _cards[0][1] * 13 + _cards[1][1]
  }

  def rank() {
    if (_isStraight && _isFlush && _cards[0][1] == 12) return 9
    if (_isStraight && _isFlush)                       return 8
    if (_cards[0][0] == 4 )                            return 7
    if (_cards[0][0] == 3 && _cards[1][0] == 2 )       return 6
    if (_isFlush)                                      return 5
    if (_isStraight)                                   return 4
    if (_cards[0][0] == 3 )                            return 3
    if (_cards[0][0] == 2 && _cards[1][0] == 2 )       return 2
    if (_cards[0][0] == 2)                             return 1
    0
  }

  def beats(opponent) {
    def (s1, s2) = [ this.score(), opponent.score() ]
    if (s1 > s2) return true
    assert (s1 != s2)
  }
}

def answer = new File('pe054_poker.txt').readLines().count {
  def cards = it.split(" ")
  new Hand(cards[0..4]).beats(new Hand(cards[5..-1]))
}

This executes in Groovy 2.4.7 under Java 8u112 in around 0.25 seconds, so well within my 5 second cut-off.

Conclusion

Groovy was perfectly good for this problem with the simple file handling and listops allowing the input to be prepared for processing very easily.

Read the details of the problem here

Summary

Investigating multiple reflections of a laser beam.

Solution

A nice little problem and really helped by Euler providing the slope tangent for points on the ellipse. It’s a matter of finding the intersection point between the vector denoting the beam’s path and the ellipse equation and then reflecting it around the normal (at right angles to the tangent) of the mirror’s slope at that point (using the dot and cross products). This just repeated until the beam intersection is within a suitable tolerance of the entry/exit gap at the top.

def ( answer, epsilon ) = [ 0, 0.01d ]

def p0 = [ x:0.0d, y:10.1d ]
def p1 = [ x:1.4d, y:-9.6d ]

while ((answer < 1) ||
       !((Math.abs(p1.x) <= epsilon && (p1.y > 0.0)))) {

  answer++

  def a = [ x:p0.x - p1.x, y:p0.y - p1.y ]
  def b = [ x:-4.0 * p1.x, y:-p1.y]

  def ( dotp, crossp ) = [  a.x * b.x + a.y * b.y,
                            a.x * b.y - b.x * a.y ]

  def c = [ x: dotp * b.x - crossp * b.y,
            y: crossp * b.x + dotp * b.y ]
  def s = (-2.0 * (4.0 * c.x * p1.x + c.y * p1.y)) /
          (4.0 * (c.x * c.x) + (c.y * c.y))
  def p2 = [ x:s * c.x + p1.x, y:s * c.y + p1.y ]

  ( p0, p1 ) = [ p1, p2 ]
}



This executes in Groovy 1.8.8 under Java 7u17 in around 0.12 seconds, so well within my 5 second cut-off.

Conclusion

Groovy made for quite a terse solution to this problem. The ability to use tuples with named members as lightweight point and vector classes was quite convenient and made for more readable code. The multiple assignment capability kept the LoC count down too. It’s important to include the ’d’ type signifiers on the starting values to force the use of doubles throughout the calculations. This is one of those Gr-r-r-oovy annoyances that I’d prefer not having to worry about though.

Rolling the Groovy Dice

January 6, 2013

Bit of a throw away post but it might save someone 5 minutes…  Nothing to do with Project Euler

I needed to have a quick way of rolling an arbitrary number of polyhedral dice including the ability to take a number of the highest from all those rolled.

I did this in Groovy by adding an overloaded roll() instance method to Integer along with a helper isPolyhedralDie() method (but only to catch me making some sort of basic typo in the calling code).

Integer.metaClass.isPolyhedralDie() {
  (delegate as Integer) in [ 3, 4, 6, 8, 10, 12, 20, 100 ]
}

Integer.metaClass.roll() {
   delegate.roll(1)
}

Integer.metaClass.roll() { dice, best = dice ->;
   assert delegate.isPolyhedralDie() && (best > 0) && (dice >= best)
   def ( sides, rnd ) = [ delegate, new Random() ]
   (1..dice).collect { rnd.nextInt(sides) + 1 }
            .sort().reverse()[0..<best].sum()
}

// Generate 4x 3/4d6 character attribute sets

4.times {
  def r = (1..6).collect { 6.roll(4,3) }
  r.countBy { it }.sort { a, b -> b.key <=> a.key  }
   .each { k, v -> print "${v}x $k, " }
  println "Sum = ${r.sum()}"
}

Usage:

Integer#roll()                   // 6.roll() = roll 1d6
Integer#roll(noOfDice)           // 6.roll(3) = roll 3d6
Integer#roll(noOfDice, noOfBest) // 6.roll(4,3) = roll 4d6 take 3

Groovin’ in Java

March 7, 2012

I’ve just been watching an interesting presentation on using Groovy as an additional toolkit in Java.  Groovy excels in handling XML, File I/O and JDBC, as well as offering good support for Collections and Closures (as regular readers of this blog already know!).  The type of things you typically need to do day-in-day-out for your regular job.

If you want to learn more about what you can do with Groovy in the real world take a look at the presentation on InfoQ – Improve Your Java with Groovy

The presenter, Ken Kousen, also has a good blog with some very useful information on using Groovy for doing regular & proper programming tasks – the sort of thing it’s really intended for – not what I do here!

He’s got a book in the works, Making Java Groovy, which should be coming out in late 2012.

Read the details of the problem here

Summary

How many continued fractions for N ≤ 10000 have an odd period?

Solution

It took a little while for me to understand how this was working so, to make it clearer this is essentially what’s required.

  1. Take the square root of the original number then…
  2. The integer portion becomes the next number in the sequence,
  3. Take the reciprocal of the remaining decimal portion,
  4. Take this number as input into Step 2.

Obviously, at step 3, if the decimal portion has been seen before, that sequence would just repeat from then on, so a cycle would have been found. In that respect it’s very similar to the solution for Problem #26.

As an example, here’s the trace for 23, with it’s square root – 4.79583.

n Digit Remainder
4.79583 4 0.79583
1.25655 1 0.25655
3.89792 3 0.89792
1.11369 1 0.11369
8.79583 8 0.79583
1.25655 1 0.25655

Whilst this works as an implemented algorithm for numbers with short sequences it breaks down on the longer ones, some of which are over 200 elements long, due to problems with precision. So I expanded out the algorithm to better reflect the way that Euler describes in the question and coded that up. I made the assumption that the sequence would start from the first position rather than an arbitrary one.

def cf(n) {
  def ( i, j, c, k ) = [ 0, 1, 0, null ]
  def sqrt_n = Math.sqrt(n)
  while (true) {
    j = (n - (i * i)).intdiv(j)
    i = -(i - ((int)(sqrt_n + i)).intdiv(j) * j)
    if ([i, j] == k) return c - 1
    if (c == 1) k = [i, j]
    c++
  }
}

def answer = (2..10000).count {
  (((int)Math.sqrt(it)) ** 2 != it) && (cf(it) % 2)
}

This executes in Groovy 1.8.6 under Java 7u3 on my Samsung 700G7A i7-2670QM 2.2 GHz machine in around 0.45 seconds, so well within my 5 second cut-off (reduced from the old 15 seconds on my old machine). This runs quite fast as it only ever uses integers – apart from the initial sqrt calculation.

Conclusion

Didn’t use many Groovy features for this solution in the main apart from keeping the function initialization quite terse and the count listop on the number range. Performance was fairly good – though an equivalent Java solution does run in 55 ms – around 8 times faster.