mcgi test 0007




Advanced search

Questions and Answers : mcgi : mcgi test 0007

Author Message
Profile fsanz
Project administrator
Project developer
Volunteer developer
Project scientist
Send message
Joined: 23 Sep 09
Posts: 84
Credit: 1,071,256
RAC: 0
Message 686 - Posted: 28 Jan 2014 | 16:52:57 UTC

Estamos probando el envío automático de jobs de la aplicación mcgi (envio desde el investigador a nuestros servidores).

Hemos lanzado un job que contiene 2000 wus, se puede consultar el estado en http://registro.ibercivis.es/jobs.php y en http://registro.ibercivis.es/subprojects.php seleccionando mcgi

We're testing our job automatic submission ( from scientist to our servers)

We have launched a job that contains 2000 wus. You can check the status of this job at:
http://registro.ibercivis.es/jobs.php and
http://registro.ibercivis.es/subprojects.php by selecting mcgi

Un saludo a todos[/url]

Profile [PUGLIA] Riccardo
Send message
Joined: 16 Apr 10
Posts: 5
Credit: 15,382
RAC: 0
Message 710 - Posted: 30 Jan 2014 | 18:06:58 UTC - in response to Message 686.

Hi,
some minute ago, I had to abort last 2 wu of "mcgi, beta app v0.06" I was in charge of.

All I took in last days was making the pc to lag (5/6 seconds of total freezing, mouse too) and the charge on the CPU cores was going up/down continuously.
Sure was an ibercivis fault because suspending the project the CPU gone working regularly :(

Plus: 4 wus I was crunching this morning gone in error on a restart of BOINC (client too) needed to install optimized SETI apps.


Hoping this help, but is a work PC and cannot let 'em go :(
Riccardo.

Profile fsanz
Project administrator
Project developer
Volunteer developer
Project scientist
Send message
Joined: 23 Sep 09
Posts: 84
Credit: 1,071,256
RAC: 0
Message 714 - Posted: 30 Jan 2014 | 19:16:29 UTC - in response to Message 710.
Last modified: 30 Jan 2014 | 19:28:24 UTC

Hello Ricardo, thanks for your feedback

Just a question, can you confirm me that it was at the beginning of the workunit?. I say that because at the very beginning the mcgi app needs to write several times to the hardisk and maybe (it seems that) we need to optimize that.

Regards

Francisco

Profile [PUGLIA] Riccardo
Send message
Joined: 16 Apr 10
Posts: 5
Credit: 15,382
RAC: 0
Message 718 - Posted: 31 Jan 2014 | 9:02:00 UTC - in response to Message 714.

Hello Ricardo, thanks for your feedback

Just a question, can you confirm me that it was at the beginning of the workunit?. I say that because at the very beginning the mcgi app needs to write several times to the hardisk and maybe (it seems that) we need to optimize that.

Regards

Francisco


I aborted them after 1 hour of workink (and almost 3 remaining), but the progression bar was still under 1%.
Tried to let them work all the day, but every time the pc slow they was on run.

And yes: the hard writing to the hard disk could have affected, opening folders in my editor and watching VLC/Youtube and complex webpages was almost impossible :(

Bye, R!

BobCat13
Send message
Joined: 28 May 08
Posts: 1
Credit: 129,639
RAC: 0
Message 726 - Posted: 2 Feb 2014 | 16:00:14 UTC - in response to Message 686.

I'm not sure all of the results you are receiving are correct. It looks like some Windows machines are failing after a few seconds, but their results are declared Valid.

http://registro.ibercivis.es/result.php?resultid=376336294
http://registro.ibercivis.es/result.php?resultid=376243570
http://registro.ibercivis.es/result.php?resultid=376240258

All three of those have the following two lines:

    BOINC client no longer exists - exiting
    timer handler: client dead, exiting


Those look like the science application stopped well short a complete run, but they were listed as Valid. Especially this machine, which has over 200 of these short run Valids:

http://registro.ibercivis.es/results.php?hostid=119325&offset=0&show_names=0&state=0&appid=178

Profile fsanz
Project administrator
Project developer
Volunteer developer
Project scientist
Send message
Joined: 23 Sep 09
Posts: 84
Credit: 1,071,256
RAC: 0
Message 742 - Posted: 3 Feb 2014 | 16:17:57 UTC - in response to Message 726.

You're right BobCat13, some of the results that we're receiving are not correct. As you say, and you can see in the links that you provided, this wus are not completed ( for example 15600 steps of 100000000 in one of them).

It seems that these wus didn't restarted properly. I have to have a look deeply. Why these wus are marked as valid? For the mcgi app, we're using right now the sample trivial validator, that is, all the wus with no error in the execution (exit status=0) are valid. For sure we have to modify this validator before going to production. (We're still in beta)

Thank you very much for your feedback.

Francisco

Profile fsanz
Project administrator
Project developer
Volunteer developer
Project scientist
Send message
Joined: 23 Sep 09
Posts: 84
Credit: 1,071,256
RAC: 0
Message 766 - Posted: 6 Feb 2014 | 15:15:44 UTC - in response to Message 742.

Hello

We're working in a new version of this application, without the wrapper. That allows us to do the checkpointing in a better way. Without the wrapper, using the BOINC api we can define critical atomic sections, in the sense that it will be executed or not, but no partially executed.

Hola

Estamos trabajando en una nueva versión de la aplicación mcgi sin el wrapper, usando el API de BOINC, lo que nos permitirá definir "seciones atómicas", en el sentido de que se ejecutarán o no, pero no podrán ser parcialmente ejecutadas. Esto supondrá una mejora en el checkpointing.

Saludos


Post to thread

Questions and Answers : mcgi : mcgi test 0007