Astro 585 discussion 8


24 February 2014 astro585

Bottom lines

  • Write your parallel code as if the machine doesn't have a shared memory system (more distributed), and then if it has, spend time to optimize it for shared memory.

  • If there is a doubt in your mind that you need a sync statement - use it!. Also, help yourself (and potentially others) out by commenting on how confident you are on this sync statement.

  • Optimize later is a good motto to keep in mind when parallelizing code.

Interactive vs batch machines

Interactive

They usually have very short walltimes (maybe something like 5 minutes or so) Good for rapidly debugging your code, but they might have very short walltimes - (like 5 minutes!).

Examples of such clusters here at Penn State: Tesla (has GPUs), and Hammer

Batch

More queue oriented. Has a scheduler, and a load-balancer

Infiniband vs Gigabit

Gigabit has very low latency vs Infiniband. Moreover, Infiniband has more bandwidth.

Scheduling jobs on a cluster

When you are starting out it might get good get the cluster to send you an email when your program starts, ends, and/or aborts. It can be done per-core or per-node.

False sharing

Too frequent updates on caches between different processors. This can severly hurt application performance.

Common strategies to reduce this traffic

You can intentionally group all your writes into a single job.

Julia's three datastructures that are good for parallel

1. Normal arrays

Good!

2. Distributed arrays

Can be spread across many processors. Each processor owns a chunk of the array. Other processors can ask for other chunks of it - costs communication time though. As slow as the network is, it is still faster than a hard-disk, if you really need a lot of memory using distributed arrays might be a good choice.

3. Shared arrays - Warning this is labeled as experimental

This type of array is shared between multiple cores at a workstation. So with a shared array, there is one copy of the array in the shared memory, and the multiple cores each have a part of it, but all of them can access the other parts that are on the other cores (this access is slower than just accessing the local part of the array).

Downside: You have to be careful if you have multiple processes that are reading and writing simultaneously to the shared array. You can use sync or other types of barriers to prevent running into unvanted troubles, but if you have a lot of synchronization statements then the parallization might not work to well for you.

Advice: If there is a doubt in your mind that you need a sync statement - use it!. Also, help yourself (and others) out by commenting on how confident you are on this sync statement.

Virtualizing

Very good for debugging parallel code, and stress test it. People who work in cloud computing really love this.


Ergonomically handwritten code, with a little help from my friends; vim, jekyll and twitter bootstrap