DCSC logo
 
ABOUT-DCSC
DCSC/SDU
DCSC/AU
DCSC/AAU
DCSC/DTU
DCSC/KU
 
+Open all         -Close all
 
 

 

CSC-AA News archive

 

29-Jul-2010
CSCAA moves to a temporary location.
The planned renovation of the CSCAA-computerroom will start soon, and will last 4-5 months. Before the renovation starts, the Grendel- and Huge-clusters will be moved to a temporary location at the AU Campus. It will take min. one week to move the clusters, during that time the systems will be unavailable. We expect to start moving the clusters in week no. 31 or 32. More info follows.
 
 
28-Jul-2010
The Archive is closed. It will be moved to a different location. We expect The Archive available again in few days.
 
 
10-Jul-2010
Several q8n-nodes crashed last night, probably due to overheating. Please check your jobs.
 
 
3-Jul-2010 8:00
A disk serving the Grendel cluster is Full!
Users belonging to the group headed by Frank Jensen should immidiately check to see if it is possible to move/remove some data. This applies only users on filesystem /chome2 (see 'hd' -command).
 
 
30-Jun-2010
CSCAA moves to a temporary location.
A planned renovation of the CSCAA-computerroom will start within a month. During the renovation, which will last ca. 4 months, the Grendel- and Huge-clusters will be hosted at a temporary location at the AU Campus. It will take ca. one week to move the systems, which most likely will be done in week 30 or 31. More information follows.
 
 
18-Jun-2010 21:00
Fileserver problem on Grendel and Huge -clusters.
 
 
5-Jun-2010
A fileserver attached to Grendel failed yesterday at 17:30. The fileserver was restarted and everything seems ok. Please check your jobs if you are residing on fileserver fs3 (use the 'hd' command).
 
 
21-May-2010
Yesterday one of the fileservers for Grendel (and some users on Huge) started to "misbehave". After the fileserver was coldstarted, everything seems to work again, but jobs may have suffered or even crashed.
 
 
14-May-2010 17:55
A node in the IBM Power6-cluster, Huge2, was hanging this afternoon due to memory oversubscription. To solve the problem it was necessary to reboot the system, causing all running jobs on Huge2 to fail.
 
 
12-May-2010
We are struggling with some network problems between Aarhus Universitet and CSC-AA.
 
 
4-Apr-2010
The SGI Altix system, Freke, has crashed this morning probably due to a memory fault. The system has been restarted but may be unstable.
 
 
30-Jan-2010 17:30
The Archive is unavailable due to severe hardware problems. Currently we do not have any prognose for when it'll be online again.
 
 
25-Jan-2010
The power to the computerroom has been restored. All systems are now available again except the main part of the HP-nodes in Grendel.
 
 
24-Jan-2010 13:30
The CSCAA computerroom suffers from a major power outage!!!
One of the main-power distibution units (el-tavle) is completely down, which causes all the new Grendel nodes (HP and GPU), Rack8 (Dell nodes) and fileserver2 to be unavailable. Also the HUGE-cluser and the Farfar-cluster are unavailable.
The problem is expected to be fixed tomorrow.
 
 
22-Dec-2009
Please notice the Call for Proposals from DCSC. Closing date 1. April 2010
Contact CSCAA for more informations.
 
 
20-Oct-2009
CSCAA has finished the procurement, and have chosen HP as supplier of the Grendel expansion: 196 nodes with Infiniband QDR interconnect. Furthermore, CSCAA has ordered 20 nodes from Atea/SuperMicro, each equipped with two Nvidia Tesla M1060 GPU-cards and Infiniband. All 216 nodes have two quadcore Intel Nehalem 2.6 Ghz CPUs, 24 GB memory and 1 TB disk.
The new hardware has just arrived and is expected to become operational during January 2010.
 
 
1-Dec-2009
Status for the systems after the power outage last week:
Grendel: Available
Huge: Available
Sleipner: Sleipner+Fenris (64 CPUs) available
Freke: Available
Karlsen: Available
Stepstone computer: Available
Farfar cluster: Available
 
 
23-Nov-2009 8:20
A fileserver in Grendel has been misbehaving tonight. We expect the problem solved very soon.
 
 
5-Nov-2009
Two new members to CSC-AA's board
Associate professor Søren Frandsen has after nearly 8 years left CSC-AA's board. He will be replaced by two new members: Professor Steen Hannestad, IFA and Professor Christian Storm Pedersen, BRICS.
See the full organisation of CSC-AA.
 
 
3-Nov-2009
All data have been restored after the diskcrash last week, so all users again have access to Grendel.
 
 
29-Oct-2009 9:00
A fileserver in Grendel crashed yesterday morning. Most nodes came in bad condition and were rebooted, therefore most of the running jobs were lost. It will be necessary to restore the fileserver from backup which will take long time (days). Users not belonging to this fileserver can use Grendel as normal.
Users belonging to the affected filesystems are currently unable to log on to Grendel.
 
 
22-Oct-2009 10:55
The IBM power6 system, HUGE, is available again after the crash yesterday. More paging space has been added.
 
 
6-Oct-2009 - updated 8-Nov-2009
The power-supply to buildings in Ny Munkegade will be interrupted.
Due to insufficient power to part of the University-buildings in Ny Munkegade, the Energi Distribution Company, NRGi, will replace a Main Transformer Wednesday 25th November 2009. Since we also have to do some prev. maintenance on most of the computers, all systems run by CSCAA will be closed from 24-Nov-2009 18:00 to 26-Nov-2009 18:00
 
 
8-Sep-2009 15:00
The Grendel cluster is available again.
 
 
24-Aug-2009
Due to preventive maintenance of the cooling system it will be necessary to close the Grendel cluster Tuesday 8th September 2009 6:00-12:00
All running jobs will be terminated.
 
 
18-Aug-2009
The /scratch -filesystem in Freke is available again.
 
 
11-Aug-2009
The SGI Altix system, Freke, suffers from a defect /scratch filesystem. We expect the problem fixed by the end of this week.
 
 
24-Jul-2009 16:15
The old part of Grendel and the IBM power6 cluster, Huge, have been hit by a power outage in the computerroom.
 
 
6-Jul-2009 20:00
The filesystems have been repaired, and Grendel is available again.
 
 
6-Jul-2009 8:00
We still suffer from severe problems with a central fileserver. The filesystems must be repaired, which takes a long time. Therefore we don't expect Grendel available before tomorrow.
A few users on the IBM clusters, Huge and Sleipner, are also hurd by this.
 
 
3-Jul-2009 22:14
The /chome2 -filesystem on Grendel crashed again.
 
 
2-Jul-2009 23:45
The fileserver has been restarted and all filesystems are now fully available again. It remains to clean some of the nodes for "Stale NFS handles" though.
 
 
2-Jul-2009
We have severe problems with one of the userfilesystems. Therefore, it has been necessary to reboot one of the fileservers, which will cause a lot of troubles for users with files on this.
 
 
4-Jun-2009
The Stepstone computer has been reinstalled. All passwords are reset.
 
 
11-May-2009
DCSC has distributed ~24 Mill kr. See who received the grants here.
 
 
7-May-2009
The power supply to the computerroom has been extended so now all nodes in the Grendel cluster are available again.
 
 
5-May-2009 10:30
The powersupply to the computerroom has failed again. All jobs running on the new SUN x2200 nodes were lost.
 
 
21-Apr-2009
The SGI Altix, Freke, is available again w. 60 CPU's and 240 GB memory.
 
 
20-Apr-2009
The power is back in the computterroom, but due to capacity problems, only ca. 80% of the Grendel-nodes have been started.
 
 
17-Apr-2009 16:30
The powersupply to the computerroom has failed again. Only the old part of the Grendel-cluster is available. We expect the system available again monday before noon.
 
 
17-Apr-2009 14:30
Due to a major power outage in the computerroom, all the x2200 nodes in the Grendel-cluster crashed. The power has just been reestablished, and most of the cluster is available again.
 
 
17-Apr-2009 3:00
Once again the SGI-Altix system, Freke, crashed. The system won't be available until SGI technicians have solved the problem.
 
 
16-Apr-2009 8:40
There are currently some network problems at the University, which may prevent connections to the CSC-AA computers.
 
 
10-Apr-2009
Once again the SGI-Altix system, Freke, crashed. All jobs were lost.
The system has been rebooted but should be regarded as unstable until a solution has been found. Please check /scratch/save for old scratchfiles.
 
 
7-Apr-2009
The SGI Altix system, Freke, crashed again this morning. All jobs were lost.
The reason is currently not known - the system has been restarted and is now running normally again.
 
 
23-Mar-2009
The SGI Altix system, Freke, crashed this afternoon. All jobs were lost.
The reason is currently not known - the system has been restarted and is now running normally again.
 
 
16-Feb-2009
The old Grendel-nodes have joined the new Grendel2-system. Grendel and Grendel2 are now synonyms.

Please notice the Call for Proposals from DCSC. Closing date 1. March 2009
 
 
12-Feb-2009 6:00
Major problem with the cooling system!
The cooling system broke down causing a VERY high temperature in the computerroom. We disconnected the power to all computers to prevent further damage.
At this moment we are not able to overview the damage caused by this accident nor to estimate when the systems will be back in duty.
 
 
30-Jan-2009
We have some problems with the new filesystems, /chome1 and /chome2. In the start of next week we will try to fix this. It may cause some inconveniences for users using these filesystems.
 
 
28-Jan-2009
Grendel2 will be generally opened Monday 2-Feb-2008
 
 
28-Jan-2009
All systems except Freke are now available again after the power-outage. We do however have some problems with the new filesystem, /chome1.
Freke is suffering from a defect system controller. We do not have any prognose for when it will be back in duty.
 
 
19-Jan-2009
Please notice that due to rearrangements in the power supply to the AU-building housing the computerroom, it will be necessary to close down all systems operated by CSCAA from
Tuesday 27. Jan 2009 18:00 to Wednesday 28. Jan 2009 18:00.
The systems affected by this are: Huge-cluster, Grendel-cluster, Sleipner-cluster, Freke, Karlsen, The Archive, Loke-cluster, Snehvide, Farfar-cluster, UNI-C's unix-service
 
 
18-Dec-2008
Grendel to be expanded!
Grendel2 - the new 318-node part of the Grendel - is now in a test phase. Still remains installation of some HW-parts, and much SW also need to be re-installed. In start of the new year the old part of Grendel will be merged into Grendel2.
 
 
22-Oct-2008
Grendel to be expanded!
26th of November 2008, 335 new Grendel-nodes will arrive at CSCAA. In total the new nodes have 2680 2.3 Ghz AMD "Barcelona" cores and more than 5.7 TB memory. The interconnect will be 1 Gb ethernet. The new nodes are expected to be in duty before Christmas this year.
 
 
30-Oct-2008
The Archive is currently not available due to some problems in the Tivoli software. We don't have any prognose for when it will be back in duty.
 
 
3-Oct-2008
Due to reconstruction of the computerroom, The Archive is closed. As the work still goes on, The Archive will not be available until next week.
 
 
15-Sep-2008
The queueing system setup on Grendel has changed in order to reflect a wish from the user-community. Now the resource-request "-l nodes=N:ppn=1" asks for N CPUcores wherever they are in the system (and no longer requiring the cores to be on N different nodes!)
 
 
15-Sep-2008
Please notice, that the SGI Altix system, Freke, will be closed Friday, 19th September 2008 at 8:00. The system will be moved to a different room. We expect the system up and running later that day.
 
 
24-Jun-2008 18:45
The SGI Altix, Freke, is available again.
 
 
23-Jun-2008 9:00
The SGI Altix system, Freke, crashed last night, due to a defect system disk.
A new disk has been ordered, and the system is expected available again later this week.
 
 
15-May-2008
All main systems operated by CSCAA are now back in duty.
 
 
14-May-2008
All systems operated by CSCAA except the IBM Sleipner cluster and the power6 system, Huge, are now back in duty after the power outage this morning. Sleipner suffers some problems in the console unit administrating the whole cluster. We currently have no prognose when it will be available. Huge is closed because it mounts its userfilesystem from Sleipner.
 
 
5-Apr-2008
All systems operated by CSC-AA will be closed down 14-May-2208 4:00-16:00 due to power outage in the machineroom.
 
 
10-Apr-2008 4:00
Due to a breakdown in the cooling system the following systems has been shut down: Huge, Grendel, Freke, Farfar-cluster. After the cooling system was repaired the systems were started again.
 
 
2-May-2008
The Farfar cluster is currently unavailable due to technical problems. The system is expected (partly) available again 10. May 2008.
 
 
8-Apr-2008 10:00
Sleipner has been repaired and is now available again. All jobs on Sleipner were terminated. Please find scratchfiles in /scratch/save.
 
 
2-Apr-2008
Tuesday 8th April 2008 9-16 the IBM Power4 system, Sleipner et al. will be closed for replacing broken hardware. All jobs will be terminated.
 
 
19-Mar-2008
The Farfar cluster is available again.
 
 
18-Mar-2008 15:45
The Farfar/Farmor -cluster is currently unavailable due to severe problems with the userfilesystem.
Dell technicians have fixed the problem, and we are now rebuilding and expanding the filesystem. If everything goes smooth we expect the system available again Wednesday 19th March at ca. 12:00
 
 
3-Mar-2008
Sleipner will be closed tomorrow at 12:00 for fixing a problem with a fibrechannel adapter, which affects the userfilesystem. The whole IBM-cluster and Huge will be closed and all jobs terminated. We expect the systems available again ca. 18:00.
 
 
22-Feb-2008
The scratchfilesystem on Freke is bad due to a diskcrash. A new disk will not be available before Tuesday 26th. The queueing system will not be available in that time.
 
 
19-Feb-2008
Karlsen is expected to be available tomorrow at 12:00.
Now Karlsen has only 14 CPUs and 16 GB memory, - but the userfilesystem is expanded to ca. 950 GB:
 
 
12-Feb-2008
All systems run by CSC-AA is now back in duty. Karlsen though stays down for reconfiguration.
 
 
11-Feb-2008
The cooling system is partly back in duty. Therefore we only started the SUN-part of the Grendel cluster. Freke, Huge and the Farfar/farmor cluster are up and running. Karlsen will stay down for reconfiguration. We hope the cooling system will be 100% up later this week.
 
 
7-Feb-2008
Today we launch our new WEB pages. The graphical layout conforms with the WEB pages at DCSCs central office, and the other DCSC regional centers. The content of our WEB pages are basically the same as previously.
 
 
17-Jan-2008
Due to renovation of the cooling system all computers operated by CSC-AA will be closed and powered off from Monday 4-Feb-2008 to Tuesday 12-Feb-2008.
This is necessary as some of the pipes in the cooling system are more than 30 years old and much too small for the demand of today.
 
 
6-Nov-2007 12:00
The Grendel- and Farfar/farmor -clusters are back in duty. Grendel has lost 3 nodes; on Farfar/farmor all nodes survived.
 
 
5-Nov-2007 15:00
The Dell-parts of the Grendel-cluster and the Farfar-cluster are back in duty after the breakdown of the cooling system last night.
 
 
5-Nov-2007 3:00
Due to a complete breakdown in the cooling system it has been necessary to close these systems to avoid further damage:
The Grendel-cluster and The Farfar-cluster
All running jobs on these systems has been terminated.
We currently do not have any prognose when the systems will be back in duty.
 
 
4-Nov-2007 17:22
The Farfar/farmor cluster is currently not available due to a diskcrash in the /home -filesystem.
 
 
22-Oct-2007
Due to diskcrash in the gpfsscratch -filesystem, Hugin + Munin in the IBM-cluster has been disabled. We expect the nodes to be available again later this week.
 
 
20-Oct-2007 18:00
The cooling system has been unstable this afternoon. This has caused the temperature to go too high in the computerroom and forced the Farfar/farmor -cluster to shut down. We may have fixed the problem with the cooling system, but there will be no attempt to start the Farfar-cluster until tomorrow.
 
 
17-Aug-2007
The Grendel cluster will be closed Tuesday 21. August at 8:00 until Wednesday 22. August 8:00, for software installations. All jobs will be terminated!
 
 
16-Aug-2007
The Grendel cluster has been expanded with 95 Dell sc1435 nodes, each with 2 DualCore 2.6 GHz AMD/Opteron CPUs and 8 GB memory.
 
 
1-Aug-2007
Freke will be closed 2-Aug-2007 8:30 - 12:00 for implementing some fixes, that will prevent the system from crashing at intensive ethernet traffic. All jobs will be terminated.
 
 
17-Jul-2007 9:15
Freke is currently unstable due to HW problems in the PCI subsystem. We hope that SGI will come back with a solution soon.
The system is ready for use, but avoid copying large amounts of data via the network to/from Freke.
 
 
16-Jul-2007 16:15
Freke is currently unavailable due to severe HW and SW problems.
 
 
14-Jul-2007 16:30
Freke crashed July 9th due to a persistent problem in the PCI subsystem. The system has been rebooted and put into duty, but the problem isn't solved yet!
 
 
18-Jun-2007 09:15
Due to severe problems with the cooling system, it has been necessary to close one of the IBM-Regattas, Fenris, this weekend. All running jobs on Fenris were lost!!! The cooling system has now been fixed, and Fenris is available again.
 
 
15-Jun-2007 21:50
Due to severe problem with both the main-cooling system and the backup-cooling system in the SMP-computerroom, it has been necessary to close down Fenris in the IBM system to avoid damage of the hardware. We hope the cooling systems will be fixed monday next week.
 
 
4-Jun-2007 7:45
The queueing system on Sleipner crashed tonight, and all jobs on the node Sleipner were terminated. The reason is currently unknown.
 
 
9-May-2007 8:30
The tests tonight revealed a bad HW-component in Freke. SGI technicians will come onsite today to replace the faulty part.
We estimate that Freke will be available again 9-May-2007 20:00
 
 
17-Apr-2007
Freke will be closed 8-May-2007 11:00 for updating some firmware and replacing powersupplies. The system is expected available again about 18:00 that day.
 
 
12-Apr-2007 13:00
Freke crashed due to a problem in the PCI subsystem. The system became available again at 14:00. All jobs were lost.
 
 
25-Jan-2007
System close down!
The IBM-cluster will be closed Wednesday 14. February 2007 from 8:00 to 16:00 for reg. maintenance. All running jobs will be terminated.
 
 
30-Jan-2007 22:00
The queueing system on the IBM-cluster restarted due to a full filesystem. Several jobs were aborted and later restarted.
 
 
18-Jan-2007
Systems close down!
A thermo-photography of the power cables around the new UPS has revealed a potential risc of fire! To fix the problem, it is necessary to close down the Grendel- and Farfar- clusters Tuesday 30th January 2007 8:00-12:00. All running jobs on the affected systems will be terminated.
 
 
21-Jan-2007
A problem with the GPFS filesystem /gpfshome in the IBM-cluster (Sleipner et al.) has caused some problems in the weekend. Several jobs were disrupted.
 
 
8-Jan-2007 10:45
The SGI Altix system, Freke, is available again after it was booted on a new root-disk. All jobs were lost, but the scratchfiles are saved in /scratch/save
 
 
6-Jan-2006 10:00
The SGI Altix system, Freke, went down Friday afternoon due to a defect systemdisk. The system cannot be started until we get a new disk from SGI.
 
 
1-Dec-2006
Freke will be closed Monday 11-Dec-2006 for replacing a bad disk in the scratch area. All files under /scratch will be lost!
 
 
20-Nov-2006
IMPORTANT: Backup system exhausted!!
The capacity of our backupsystem system is exhausted. No new backups can be made right now! We expect to expand the backupsystem within 1-2 weeks from now. Until then NO backups can be made of any of our systems (Karlsen, IBM-cluster, Freke, Grendel)!
 
 
29-Aug-2006
In order to extend the power- and cooling capacity in the computerrrom all systems operated by CSCAA will be closed Monday 9th October 2006 18:00. The systems will be available again 11th Oct. 2006 in the evening.
 
 
15-Sep-2006 16:00
Freke had a crash - the reason is probably due to a sw-error in the disk firmware.
 
 
15-Aug-2006
Gaussian 03 Rev D.01 has been installed on the IBM-cluster, the SGI Altix system Freke, and the SUN/Opteron cluster Grendel. This version is default when using the subg03 utillity.
 
 
8-Aug-2006 17:00
Freke is available again w. 64 CPUs.
 
 
3-Aug-2006 7:30
Freke crashed again this night, and has been rebooted.
Tuesday 8th August 2006 Freke will be closed for repair.
 
 
24-Jul-2006
Freke is available again w. 56 CPUs.
The system is currently unstable, but progress have been made in order to solve the problems, which seems to stem from the memory subsystem. We hope all problems will be solved soon.
 
 
19-Jul-2006
Freke is currently not available due to severe HW problems. SGI technicians has booted the system, but it will not be opened for batchjobs. We hope for a soon solution.
 
 
6-Jun-2006 9:30
Freke crashed again this night. SGI technicians are investigating the problems.
For the moment we must regard Freke as being highly unstable.
 
 
30-Jun-2006
Freke is available again, now with 60 CPU's.
 
 
29-Jun-2006 9:00
Freke is still unavailable. Today SGI technicians will be onsite to replace some hardware parts, which have turned out to be faulty.
 
 
28-Jun-2006 8:30
Freke has crashed again. SGI technicians are investigating the problems.
For the moment we must regard Freke as being highly unstable.
 
 
27-Jun-2006 10:40
Freke is available again w. 56 CPU's after the crash tonight. Apparently a C-brick went bad.
 
 
21-Jun-2006
The SUN-Opteron cluster, Grendel, is now ready for the first pilot-tests.
To get aceess, send an e-mail to staff at cscaa dot dk
 
 
27-Jun-2006 8:40
Freke is crashed. We are contacting SGI Technicians to solve the problems. No prognoses yet.
 
 
7-Jun-2006
Freke will be closed Monday 26th June 2006 in order to replace HW components.
 
 
24-May-2006 16:00
Freke crashed 00:30 today due to HW problems in its internal 5 Volt powersupplies. All jobs were lost!
Freke is now up and running again, but it is necessary to close the system 15. and/or 16. of June 2006 in order to replace the powersupplies.
 
 
17-May-2006
All systems are now back in duty after the blackout Saturday.
Karlsen suffered most by the accident as several disks went bad, but it was possible to shuffle the good diske around (there are more than 70 disks in Karlsen!) and get all important filesystems online again.
 
 
13-May-2006
Major problem in the power-supply to the machineroom.
All our systems suffers from the power blackout this morning at 6:30. An interfacecard in a 40 kVA UPS went bad and caused a short but very fatal loss of power.
Technicians from the UPS-vendor came Saturday afternoon onsite, and was able to repair the UPS, and get power back to the computerroom. The IBM-cluster and Freke has been restarted, as the Farmor-cluster, and The Archive. Systems still unavailable counts Karlsen, Snehvide, the Loke-cluster and the UNI-C PD-server. These systems will not be available until Monday.
 
 
2-May-2006
CSCAA acquires Opteron cluster from SUN.
The system will be installed in June 2006, see more informations here.
 
 
31-Mar-2006 10:00
Freke was hanging since yesterday evening. It was necessary to reboot the system to bring it alive again. The problem seems to be related to what is going to be fixed 4th April anyway, when the system will be closed for maintenance.
The system is available again, but all jobs were lost. Also please remember, that it will be closed again in 4 days!
 
 
10-Mar-2006
Freke will be cloed Tuesday 4th April and maybe also Wednesday 5th April for regular HW-maintenance.
Some I/O-cards needs to be shuffled and some powersupplies needs to be replaced.
 
 
2-Mar-2006 15:00
Freke was hanging, due to excessive use of memory. Again we experienced Out Of Memory Kills causing the queuing system to restart. Most of the jobs therefore restarted at 16:20 overwriting their files!
As this has happened several times we will enforce use of memory requirement specifications upon job-submit. More information on that will follow.
 
 
16-Feb-2006
The 8 CPU node in the IBM-cluster, Munin, has crashed due to an error in the power supply. Munin will not be available again before next week.
 
 
26-Jan-2006 11:00
Freke crashed due to a kernel panic in the I/O-system. All jobs were lost!
The system is available again, the scratchfiles from the running jobs are all saved in /scratch/save.
 
 
24-Dec-2005 00:40
Freke is currently very unstable. We tried to reboot the system half an hour ago but it didn't help, the system freezed shortly after the reboot. The system might be unavailable during this Christmas.
 
 
15-Dec-2004
The SGI Altix system, Freke, will be closed tomorrow at 11:00. The system will be moved to a new computerroom. We expect the system available again tomorrow at 18:00.
 
 
24-Nov-2004
The Archive is available again after the buildings work in the computerroom.
 
 
21-Nov-2005 15:40
Due to a hang, the IBM-node Hugin has been restarted. All jobs running on Hugin were lost, look in /scratch/save and /localscratch/save for old scratchfiles.
 
 
21-Nov-2005
A full filesystem on the IBM clusternode Fenris caused the LoadLeveler queueing system to close down, and terminate all the batchjobs on that node.
The problem has been fixed, and Fenris is now available again.
 
 
16-Nov-2005 13:00
The SGI Altix system has been closed for maintenance this morning. It is now available again.
 
 
8-Nov-2005
The SGI Altix system, Freke, will be closed Wednesday 16th Nov. at 11:00 for preventive SW and HW maintenance. All running jobs will be terminated.
The system is expected available again at 18:00 that day.
 
 
2-Nov-2005 10:00
Freke crashed tonight at 2:00 due to a software bug in the kernels memory management subsystem. SGI will investigate the problem further.
All jobs were lost, but, as usual, you may find your scratch-files in /scratch/save/...
 
 
27-Oct-2005 14:45
Freke crashed again. SGI is investigating why it has become so unstable. We expect the system available again at 16:00 today.
 
 
20-Oct-2005 13:45
The SGI Altix system, Freke, has been repaired and is now ready for use. Please find your old scratchfiles in /scratch/save
 
 
19-Oct-2005
The SGI Altix system, Freke, suffers from a severe hardware error in the I/O system. We will try to reboot the system this morning (Wednesday), but the system will be closed again tomorrow at noon for replacing the defect parts.
 
 
3-Oct-2005
The Archive temporary closed.
Due to building works in the "small" computerroom, the Archive system will be closed from 5-Oct-2005 until 28-Nov-2005.
 
 
29-Sep-2005
Freke had a hang this morning, presumable due to a temporary loss of the connection to the I/O-system. The system has been restarted, but alle running jobs were lost. Check /scratch/save for scratchfiles, and clean up.
 
 
8-Sep-2005 14:30
The Freke system has been repaired and all 64 CPUs and 256 GB memory are now available again. Please check /scratch/save and /scratch/save.old for old scratch-files. They will be removed from these directories in one week as we need the space for the running jobs.
 
 
7-Sep-2005 15:30
Major power outfall in the Aarhus area
A major power outfall in the Aarhus area this morning caused several of our computers to become unavailable.
Freke lost 4 CPUs due to the power outfall. These CPUs will be replaced tomorrow afternoon, and therefore the system will be closed at that time!
 
 
22-Aug-2005
The Archive: Due to building works in the "small" computerroom, the Archive system will be closed in a longer period during this autumn. Yet we don't know the exact schedule for the works, but You'll be informed here as soon as we know the plans.
 
 
14-Jul-2005
IBM/Sleipner: A bad disk was causing some troubles this morning. All jobs on Sleipner were aborted and no scratchfiles could be saved. The disk has been replaced, and the system is back in duty.
 
 
27-Jun-2005 16:30
Freke crashed again.
The system has been rebooted and jobs are restarted. Please find your old scratchfiles in /scratch/save*
 
 
23-Jun-2005 6:00
The SGI Altix system, Freke, crashed tonight - reason currently unknown.
The system has been rebooted and jobs are restarted. Please find your old scratchfiles in /scratch/save*
 
 
14-Jun-2005
The bad disk in Freke was replaced, and the system is now fully available again.
 
 
11-Jun-2005
Freke, the SGI Altix, crashed tonight. All jobs were lost!
The system has been restarted and is ready for use, but the scratcharea is very limited due to a bad disk. The system will be closed Tuesday 14th June for replacement of the faulty disk.
 
 
9-Jun-2005
Due to a bad disk the /gpfsscratch filesystem was lost this morning. The filesystem holds scratch-data for the two small nodes Hugin and Munin.
 
 
30-Mar-2005
The batchjobs running on Hugin, Munin and Fenris (but not Sleipner) didn't survive the network interruption this morning. Please resubmit your jobs.
 
 
11-Mar-2005
Freke, the SGI Altix 3700 system, is now available for use.
Contact staff to get an username on the system.
The system will be closed Friday 18th March for installing additional memory and disks.
 
 
28-Feb-2005
Freke, the SGI Altix 3700 system, has arrived.
During the next couple of days SGI technicians will install the system, and run some tests. We will keep you informed here.
 
 
18-Jan-2005
All our machines are available again.
Please cleanup as we desparately need the diskspace.
 
 
2-Dec-2004
All computers run by CSCAA will be closed Friday 7th January 2005 at 12:00 due to renovation of the power supply in the building where the computerroom is. At the same time some buildingswork in the computerroom wil be done too. The system will stay closed for (at least!) 10 days.
The affected computers are: The IBM-cluster (Sleipner, Fenris, etc), the SGI-Origin 2000 system (Karlsen), the LOKE cluster, the SNEHVIDE cluster, The Archive and the Public Domain Software server of UNI-C (pdsrv4.uni-c.dk).
The systems will be available again from the 17th January.
 
 
23-Sep-2004
The previously announced close down of our computers at Saturday 9th October has been canceled. The Buildings Department at The University of Aarhus has decided to postpone the work with the powersupplies and the buildings to the start of the new year.
 
 
10-Aug-2004
On Hugin and Munin, a new fast scratchfilesystem has been set up to speed up the execution of especially Dalton jobs.
 
 
5-Aug-2004
A new data archiving system - The Archive - has been put into duty, and is ready to use for all users of CSCAA systems, and users from the Department of Physiscs and Astronomy, University of Aarhus. More informations here.
 
 
5-Aug-2004
Gaussian 03 Rev. B05 has been installed on the IBM cluster, and is available for all users affiliated with The University of Aarhus.
The easiest way to use G03 is via the new jobsubmission utillity, subg03. Just type subg03 inputfile (where inputfile is the Gaussian command file). The script will automaically create a bacthjob - serial or parallel according to the %nproc directive in the inputfile - and submit the job to the system.
See subg03 --help for further instructions how to use the subg03 utillity.

Also, GaussView has been installed on Sleipner (the interactive node). To run GaussView, just type rungv. As it is a graphical application it requires access to your X-server, so please ensure that you connect to Sleipner with X11-forwarding enabled (ssh -X sleipner.cscaa.dk).

 
 
10-Jun-2004
A gigabit datamanagement backbonenet has been installed. NFS crossmounts, backups and access to the archive is now done via a new gigabit network, separating these tasks from the normal interactive work done by users.
 
 
9-Feb-2004
Matlab 6.5 has been installed on the IBM interactive node, Sleipner. Users from The Faculty of Science at The University of Aarhus are entitled to use the Matlab software.
 
 
25-Feb-2004
All nodes in the IBM system will be closed Tuesday, 24-Feb-2004 8:30 for installation of new hardware: The memory in Fenris will be doubled to 128 GB, and the userfilesystem /gpfshome will be doubled to ~2 TB. Please observe that no scratchfiles from jobs on Hugin and Munin will be kept!
 
 
22-Jan-2004 20:30
A disk in the /scratch -filesystem on Sleipner crashed this morning. All jobs on Sleipner were aborted, and all files on the filesystem were lost.
The bad disk has been replaced, and the system is now available again.
 
 
1-Dec-2003
CSCAAs nameserver, webserver and mailserver has been unavailbale this weekend. Users may have experienced problems reaching the SGI and IBM systems due to lack of nameservice.
 
 
17-Nov-2003
A faulty CPU on Karlsen which has caused two crashes recently has been identified and disabled. We hope the system will become stable again now.
 
 
15-Nov-2003 13:15
Karlsen crashed again - the system seems to be highly instable.
We will try to restart the system monday morning.
 
 
14-Nov-2003
Karlsen crashed tonight.
The system has been restarted, and is available again.
 
 
29-Oct-2003 11:20
Karlsen crashed due to a CPU-fault.
The system was able to start, but only with 63 CPUs available.
Please check your files in /scratch/saveaftercrash
 
 
26-Sep-2003 16:30
Sleipner crashed last night due to a bug in a SCSI-device driver. The crash was triggered when an IBM technician ran a monitoring tool, which exploited the bug.
Sleipner is now available again.
 
 
16-Sep-2003
The queues (classes) q4 and q8 on the IBM-cluster have been phased out. Instead use the queue qpar. qpar is for jobs requiring 2 to 8 CPUs.
 
 
10-Sep-2003
Statement from IBM to the users of CSCAA regarding the instability of the IBM-cluster in the last 7 months:

IBM is continuing to investigate any further JFS2 problems and we understand the need for system stability. The laboratory has not experienced these problems during any extended testing. At this time IBM's field data does not indicate these problems are occurring in other installations. IBM has sent skilled laboratory resources from Austin on site at the University to make the migration from JFS2 to JFS. The tuning performed on site should enable the environment to run on JFS without any noticable performance degradation. Preliminary results indicate that this was the correct decision.

 
 
4-Sep-2003 11:00
The defect disk in Fenris has been replaced and Fenris is now available again.
 
 
3-Sep-2003
After the reconfiguration of the scratchfilesystems yesterday a small hardware error on one of the disks in Fenris has shown up. We decided not to risk a real diskcrash by ignoring this error; instead the disk will be replaced tomorrow morning at 8:00. Fenris will be closed during the operation and running jobs on Fenris will be terminated. The rest of the IBM-cluster will not be affected by this.
Until tomorrow it is not possible to start new jobs on Fenris.
 
 
2-Sep-2003 17:15
Fenris and Sleipner has been reconfigured and is now available again.
 
 
1-Sep-2003
The IBM-system has 8 months after delivery not yet proven to be stable. The main reason for this is that the kernelcode for the JFS2-filesystem has a bug. IBM claims that it is a very complex problem which cannot be solved "shortly".
We have told IBM, that this is a precarious situation and that a workaround must be implemented immediately. Therefore we agreed, that the /scratch-filesystem on Sleipner and Fenris will be converted from JFS2 to JFS. This apparantly small change covers the fact that JFS is a substantiel older and well-tested filesystem but with limitations in filesize (max 64 GB), filesystemsize (max 1 TB) and (maybe) performance.
The change will be done Monday+Tuesday 1st + 2nd September 2003. The batchsystem on Sleipner and Fenris will be closed during the operation, but jobs on Hugin and Munin will continue unaffected.
With this we hope the system will get stable. We emphasizes that going back to JFS is temporary and that IBM must demonstrate a stable system now.
 
 
31-Aug-2003
The IBM-system is seriously defect.
In next week we will decide what to do with the system.
 
 
29-Aug-2003
Certainly the system isn't stable - Fenris crashed at 4:00 this night...
All running jobs on Fenris were lost.
 
 
28-Aug-2003
It turns out that the problem we had since Monday 25th August is far more complex than first assumed. IBM has not come up with a kernel that solves the problem, and as we cannot get any estimate when a new kernel will be ready, we decided to start production on the old (erroneous/instable) kernel.
The batchsystem on Sleipner+Fenris has been started again.
 
 
27-Aug-2003 10:10
The IBM lab in Austin might have found another bug in the new (Efix-) kernel which we installed two months ago. We will presumably get a new kernel which solves this bug soon (within 1 day). Until then the batchsystem on Sleipner+Fenris only accepts short jobs in qexp (wallclock limit = 1 hour). These jobs may be terminated when we get the new kernel. Hugin+Munin are available for jobs as normal.
 
 
25-Aug-2003 22:00
Sleipner crashed at 8:20 this morning.
Also this night Fenris got a corrupted /scratch-filesystem.
Fenris and Sleipner have both been rebooted, causing all jobs to be lost!
Technicians from IBM are trying to find the reason for the problems, therefore you are able login to the system but you aren't able to run jobs on Sleipner and Fenris. We don't have any prognose when these machines will be available again.
Hugin and Munin runs unaffected of all this.
 
 
18-Aug-2003
Karlsen has been moved behind the Firewall at CSCAA.
Connection to Kalrsen from outside CSCAA's LAN can only be established w. 'ssh'. We do not support unencrypted protocols as ftp, telnet, rsh, rlogin through the firewall.
 
 
17-Aug-2003
Fenris had a hang last night which caused the GPFS filesystems to hang. Unfortunately it was not possible to determine what caused the hang. All jobs on Fenris were lost.
 
 
12-Aug-2003 11:10
The systemparameters on Sleipner and Fenris has been adjusted and the IBM cluster is now available again. Please notice, that it was necessary to terminate all running jobs on Sleipner and Fenris. The scratchfiles from these jobs can be found in /scratch/save
 
 
8-Aug-2003
The IBM-cluster will be closed Tuesday 12th August 10:00-16:00 for adjustments of systemparameters. All running jobs on Fenris and Sleipner will be terminated. The queues on Sleipner and Fenris are stopped so that new jobs don't start.
 
 
4-Aug-2003
IBM claims that the problems we had with our system is solved!
The new kernel we got from IBM's lab for a month ago to solve the fatal JFS2 filesystem bug, introduced new semantics and functionality with respect to the "systemparameters". This new behavior wasn't documented along with the new kernel, and caused the system to "hang" several times. IBM has now corrected the settings of the parameters, and believes that the system will be stable now.
 
 
1-Aug-2003
Once again IBM claims that the stabilityproblem with our system has been identified and a fix has been implemented.
 
 
31-Jul-2003
The IBM system is still unstable. The problems are not solved!
We are very sad and disappointed that IBM cannot solve the problems.
 
 
30-Jul-2003 16:00
IBM claims that the problems with our system are solved!!
The system is open again and ready for jobs.
 
 
29-Jul-2003 6:30
Sleipner crashed again yesterday morning. The system is highly instable.
It is not open for users right now.
 
 
27-Jul-2003 22:00
Sleipner and Fenris crashed today. All running jobs were terminated.
IBM is trying to solve the problems, but we don't know yet what caused the problems. The system has been restarted and is open for users again.
 
 
26-Jul-2003 21:00
Sleipner crashed at 7:15 this morning. The rest of the cluster stalled until Sleipner was rebooted at 14:00.
 
 
19-Jul-2003 14:15
Sleipner had a hang this morning. The machine has been rebooted and jobs which were running on Sleipner were aborted/restarted. At this moment we do not know if this is a new error. Sleipner was running on the new kernel which was supposed to fix the problems we had the last 7 months.
Jobs on the other nodes seems to be unaffected.
 
 
8-Jul-2003 11:15
Sleipner crashed this morning, probably due to the JFS2 filesystem bug. We have now rebooted Sleipner on the new kernel which fixes this bug.
All jobs on Sleipner were aborted, please check /scratch/save for the scratchfiles. Jobs on the other nodes seems to run unaffected.
 
 
1-Jul-2003
IBM's lab found the reason for the last crash. It has been fixed and Fenris has been opened again. Let's hope this time...
 
 
28-Jun-2003 15:00
Fenris crashed again. It has blocked the filesystems, making it impossible to login. Tonight I will try reboot Fenris in order to release the filesystem lock.
 
 
27-Jun-2003
The /scratch -filesystem on Fenris is corrupt although Fenris got a fixed kernel just two days ago.
IBM has now realized, that the problems are caused by TWO bugs in the JFS2-filesystem code. The kernel we got the other day only fixed one of the bugs.
IBM today made a new kernel, which they claim to fix both bugs. Instead of waiting that Fenris crashes (it will, it has a corrupted filesystem!), we decided to install the new kernel right away. We will therefore reboot Fenris at 14:00 All running jobs on Fenris will be aborted.
 
 
26-Jun-2003
IBM claims that they might have solved the problems we had in the last half year with the unstabillity of the IBM machines. To implement the fix, it is necessary to replace the kernel on each machine, which so far only has been done on Fenris (it requires a reboot).
Fenris has been opened as execution node again, but please observe that it will be closed 5-Aug-2003 for replacing some memory-blocks.
 
 
23-Jun-2003 11:40
After the crash 22-Jun-2003 1:50 Fenris has been opened again, but only for jobs submitted to queue 'qfe'. Please observe, that Fenris is unstable and may crash again. For short jobs, it should be OK to use the machine.
 
 
18-Jun-2003
Fenris will be closed today at 10:00 for installtion of a new kernel.
All running jobs on Fenris will be terminated. All other jobs will be unaffected.
 
 
12-Jun-2003
Hugin had a hang 10-Jun-2003. The machine has been repaired, but it was not possible to start LoadLeveler. IBM is working on a solution.
 
 
7-Jun-2003 8:15
Fenris crashed 5-Jun-2003 due to the same problem as before: a bug in the JFS2 driver. We received a new kernel from IBM but it didn't work.
Fenris is up and running now (on the old kernel) but as we expect a new kernel to be installed shortly, we decided ONLY to start the queue 'qfe' on Fenris. You can use this queue for both serial and parallel jobs BUT BE WARNED: as soon as we get the new kernel, Fenris will be rebooted to load that kernel (and all the jobs on Fenris will be terminated). For short jobs it should be ok to use Fenris, though.
 
 
4-Jun-2003 9:30
Sleipner crashed 2-Jun-2003 18:30. The reason was an error in the JFS2 filesystem driver - the same problem that we have experienced at least two times before. In our efforts to solve the problem, the OS has been upgraded to the newest maintenance level (Level 4). on all nodes. A reboot was required, causing all jobs on all nodes to be terminated. We are sorry about that, but we have to improve the stabillity of the IBM-systems.
It turned out, that one of the disks in the systemdisk-mirror of Munin was defect. A sparedisk will be installed this morning. Munin will be available when the disk has been installed.
 
 
13-May-2003 19:30
After a logical error in the /scratch filesystem on Sleipner was eliminated this morning, IBMs lab just gave us a "go" to start the queues on Sleipner again. All 80 CPUs in the IBM-cluster are now available.
 
 
12-May-2003
User meeting at CSC-AA May 30th 2003, 14:00 in room 520-732 at Institut for Fysik og Astronomi, Ny Munkegade bygn. 520, 8000 Aarhus C.
 
 
12-May-2003 10:30
Sleipner has been opened for users again. You can access your files as usual, and you can submit jobs. However only Fenris, Hugin and Munin will execute the jobs.
It may be necessary to close Sleipner with short notice if IBM needs it in their efforts of finding a solution to the stability problems last week.
 
 
11-May-2003 9:00
Sleipner crashed again Saturday evening ca. 23:30!!!
The system is currently unavailable to users. Jobs on Fenris Hugin and Munin seems to run unaffected.
21:30
Sleipner is still closed (you will get "Permission denied, please try again." when you try to login). IBM technicians in USA are working with the problems this evening.
12-May-2003 1:30
IBM Austin, TX, reported that our crashdump has revealed an error in IBM's JFS2 filesystem. The lab will try to fix the problem ASAP. The system will stay closed, at least until we get further informations.
 
 
9-May-2003 23:05
Sleipner crashed today at 12:15. IBM diagnosticed the system but found no evidence for HW errors. The problem might be in SW but it was not possible to determine the reason for the crash exactly. Please observe that jobs running on Fenris, Hugin and Munin were unaffected by this. Only jobs running on Sleipner were terminated.
The system is now available again. Please check the scratchfiles in /scratch/save
 
 
5-May-2003
After a week with severe problems, we hope the system is stable now. IBM tecnicians from USA, Germany and Denmark have debugged and checked the system in the weekend. Sunday morning we got the report from IBM telling that the system is OK now. We are sorry for the inconvenience you may have experienced last week.
 
 
30-Apr-2003
The IBM-cluster is now available again. Fenris (32 CPUs) has been merged together with the existing cluster (48 CPUs). Please notice the following:
  • To connect, use: ssh sleipner.cscaa.dk
    If you connect to fenris.cscaa.dk you will be redirected to Sleipner.
  • All interactive work is now done on Sleipner, Fenris only acts as a compute-node.
  • Your files on Fenris is copied to the directory-tree $HOME/FENRIS on the cluster.
  • The compiler has been upgraded to version 8.1. The old version 7.1 compiler is not available anymore.
  • Fenris will not be put into duty before tomorrow morning - we need to do some cleaning up.
 
 
27-Apr-2003 19:50
Due to a technical problem all jobs on Sleipner has vanished. At a very early stage, the problem seems to be caused by user-processes requesting too much memory, leaving no memory to system processes! This seems to have killed the master daemon in the queueing system and eventually all the jobs. We will investigate this problem further tomorrow, but as the system will be closed Tuesday 29th April, we will not start the queues before that.
 
 
15-Apr-2003
-- Updated 24-Apr-2003 --
Fenris AND the IBM-cluster (Sleipner, Hugin and Munin) will be closed Tuesday April 29th 2003 8:00 for merging Fenris together with the IBM-cluster AND do some rearrangements in the powersupply in the computerroom. All jobs on the IBM-systems will be terminated. The total IBM system is expected available again Wednesday 30th April 24:00
The SGI O2000 system, Karlsen, may be affected by this too. In this case all jobs on Karlsen will be terminated as well.
 
 
24-Mar-2003
The new IBM cluster has been opened for users. Connect with
ssh sleipner.cscaa.dk
(use same username & password as on Fenris).
Please read the messages appearing carefully.
 
 
20-Mar-2003
The main power supply to several buildings at the University Campus will be interrupted from 11-Jul-2003 16:00 until 14-Jul-2003 18:00 for renovation of the installations. Systems operated by CSCAA will not be affected , but irregularities in network access to the computers may occur during the period.
 
 
13-Mar-2003
IBM now reports all hardware problems solved. Remaining tasks is the configuration of the disksubsystem and establishing a GPFS filesystem between the 3 new machines. These things may take som time and users should continue to use Fenris as it will NOT be closed the next couple of weeks.
 
 
10-Mar-2003
IBM has severe problems making the global parallel filesystem between the 4 IBM computers to work properly. New hardware errors are emerging over and over again. We really hope that IBM soon will be able to solve the problems! Right now we don't have any prognoses.
 
 
24-Feb-2003
Finally we are ready to merge Fenris into the new IBM-cluster. Therefore Fenris will be closed Wednesday 26th February 2003 from 9:00 to 24:00. All jobs will be terminated.
 
 
13-Feb-2003
During the installation of the new IBM systems (one p690 with 32 CPUs/128 GB memory and two p655's each with 8 CPUs and 32 GB memory) we have encountered a lot of problems. IBM technicians tries hardly to solve the problems and as we prefer to present a fully functional environment to our users we will not open the new systems before all problems have been solved. We don't even have any prognoses at this moment.
 
 
29-Jan-2003
We have some problems with the queueing system (LoadLeveler). We expect the problems solved this morning, until then Fenris will stay closed.
 
 
27-Jan-2003
Fenris will be closed Monday 27th January from 8:00 to 24:00 for merging it into the new computer environment consisting of two IBM p690 and two IBM p655. All running jobs will be terminated. The SGI Origin 2000 system, Karlsen, will not be affected.
Please remember that all systems now only are available with the domainname "cscaa.dk", f.ex. "ssh karlsen.cscaa.dk" or "ssh fenris.cscaa.dk".
 
 
14-Jan-2003
In order to prepare the power supply in the computerroom fot the new machines arriving next week, Karlsen and Fenris will be closed Tuesday 14-Jan-2003 8:00-24:00.
 
 
12-Dec-2002
Karlsen hangs. A reboot will be performed this morning. All jobs will be terminated.
 
 
3-Dec-2002
Fenris will be closed Tuesday 3-Dec-2002 12:00-14:00 for installing new firmware. All jobs will be aborted.
 
 
23-Sep-2002
User meeting in Aarhus.
 
 
17-19-Sep-2002
Trainingseminar in Aarhus in "Optimization and Tuning on IBM Power4 based systems". See the Agenda for the seminar here (in PDF format).
 
 
5-Sep-2002
Opening of the Centre for Scientific Computing in Aarhus, CSC-AA. See the invitation to the opening celebration here (in PDF format).
 
 
8-Aug-2002
Fenris will be closed Monday 12-Aug-2002 9:00-12:00 for implementing new features in the queueing system. All jobs will be aborted.
 
 
8-Jul-2002
In July 2002 Kasper Hald will be vacant system manager. Please direct all questions to him on phone 26821302 or e-mail staff@cscaa.dk
 
 
3-Jul-2002
Friday 5th of Juli 2002 at 10:00 both systems (Fenris/IBM and Karlsen/SGI) will be closed for installation of a "transient protector" in the power supply. The operation is expected to take about 4 hours. All jobs will be terminated.
 
 
1-Jul-2002
CSC-AA, Centre for Scientific Computing in Aarhus, has been established.