Free Thoughts

Thursday, September 24, 2015

Using a system installed conda to manage python environments

Using a system installed conda

Motivation

I want to provide a common scipy stack across platforms, and possibly other python environments. Anaconda provides binary packages that can be installed into a separate environment. However, it is normally geared to be installed and managed by a user, while I want to be able to centrally manage the configurations. With 4.1.6, I can basically use the conda upstream as is, with a couple minor modifications. I have a PR filed here with my changes.

Installation

At the root is the conda package manager. I want to be able to install this from rpms, so I created a conda COPR. This provides common conda and conda-env rpms for Fedora and EPEL7. I've made conda-activate optional as it installs /usr/bin/{activate,deactivate} which are very generic names, alhough it makes it much simpler to load the environments.

Configuration

The system conda reads /usr/.condarc as its config file. This is an unfortunate location, but that's what the current code looks for. I'd like to change this to /etc/condarc in the future. The COPR conda package has as default:

envs_dirs:
 - /opt/anaconda/envs
 - ~/conda/envs
pkgs_dirs:
 - /var/cache/conda/pkgs

So we:

Point to our local envs, install them in /opt/anaconda/envs.
Put packages into /var/cache. This requires a patch to conda - https://github.com/conda/conda/pull/1637

Locally, I also set channels to point to our InstantMirror cache by setting "channels" and "channel_alias".

Ansible

Configure everything and install the basic scipy env in ansible:

- name: Configure conda repo

  template: src=orion-conda.repo.j2 dest=/etc/yum.repos.d/orion-conda.repo

- name: Install conda
  package: name={{ item }} state=present
  with_items:
   - conda

   - conda-activate

   - conda-env

- name: Configure conda
  copy: src=condarc dest=/usr/.condarc

Then I have a conda_env.yml task to install and manage the environments:

- stat: path=/opt/anaconda/envs/{{ env_name }}
  register: conda_env_dir

- name: Create conda {{ env_name }} env
  command: /usr/bin/conda create -y -n {{ env_name }} {{ env_req }} {{ env_pkgs | join(" ") }}
  when: not (conda_env_dir.stat.isdir is defined and conda_env_dir.stat.isdir)

- name: Update conda {{ env_name }} env
  conda: name={{ item }} extra_args="-n {{ env_name }}" state=latest
  with_items: "{{ env_pkgs }}"
  when: conda_env_dir.stat.isdir is defined and conda_env_dir.stat.isdir

- name: Install conda/{{ env_name }} module
  copy: src=modulefile dest=/etc/modulefiles/conda/{{ env_name }}

Which is called like:

- include: conda_env.yml env_name={{ conda_env }} env_req={{ conda_envs[conda_env] }} env_pkgs={{ conda_pkgs }}
  with_items: "{{ conda_envs }}"
  loop_control:
    loop_var: conda_env
  tags:
  - conda

With defaults/main.yml defining the environments:

conda_envs:
  scipy: python=2
  scipy3: python=3
  scipy34: python=3.4

conda_pkgs:
- astropy
- basemap
- ipython-notebook
- jupyter
- matplotlib
- netcdf4
- pandas
- scikit-learn
- scipy
- seaborn

This uses the ansible conda module to manage the creeted conda environments.

Friday, September 4, 2015

Automatically (and hopefully securely) configure ansible-pull with a secure ssh git repository

We are just starting to play around with using ansible to configure our systems. Since we have a lot of laptops and other machines that shut themselves down when idle, we need to use ansible-pull to configure them. I also use cobbler to provision our systems and wanted to be able to configure ansible-pull automatically as part of the install process. The complicating factor is that since we do not want to have our playbooks public, we are using a ssh deployment key to get access to the git repository that ansible-pull will use. So we needed a way to distribute the ansible private ssh key to the new systems. Here is what I came up with:

* Create a ssh key pair for cobbler to use:

ssh-keygen -N '' -f ~/.ssh/id_rsa_cobbler

* Create a cobbler trigger to copy the ansible deployment key over to the newly installed system, in /var/lib/cobbler/triggers/install/post/ansible_key:

#!/bin/bash
[ "$1" = system ] &&
  /usr/bin/scp -i /root/.ssh/id_rsa_cobbler -o "StrictHostKeyChecking no" -p /root/.ssh/id_rsa_ansible ${2}:/root/.ssh/id_rsa_ansible

* In %post add the cobbler public key (id_rsa_cobbler.pub) to /root/.ssh/authorized_keys and only give it permission to scp to /root/.ssh/id_rsa_ansible:

cat >> /root/.ssh/authorized_keys <<EOF
command="scp -p -t /root/.ssh/id_rsa_ansible",no-pty,no-port-forwarding,no-X11-forwarding,no-agent-forwarding ssh-rsa AAAAB...==
EOF

* In %post, start up the sshd server so that cobbler can copy over the ssh key during the post install trigger:

/usr/sbin/sshd-keygen
/usr/sbin/sshd

* In %post, configure ansible-pull to run at each boot:

cat > /etc/systemd/system/ansible-pull.service <<EOF
[Unit]
Description=Run ansible-pull on boot
After=network-online.target
Wants=network-online.target

[Install]

WantedBy=multi-user.target

[Service]
Type=oneshot
ExecStart=/usr/bin/ansible-pull --url ssh://git@git.server.com/ansible-pull.git --key-file /root/.ssh/id_rsa_ansible
EOF
systemctl enable ansible-pull.service
echo localhost ansible_connection=local > /etc/ansible/inventory

* In %post, teach the machine about our git host:

echo [git.server.com]:51424,[10.10.10.10]:51424 ssh-rsa AAAA...== >> /root/.ssh/known_hosts

This assumes we're using a local.yml playbook that has:

- hosts: localhost

Thursday, August 13, 2015

Skype hanging on login on Linux

I just spent a couple days trying to figure out why Skype would hang after entering the login name and password. Skype would just sit there and spin. This only happened on a specific user's environment. After comparing some strace output from a successful login and a failed one I noticed mmap2() failing:

clone(child_stack=0xe21ff364, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|C
LONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xe21ffba8, tl
s={entry_number:12, base_addr:0xe21ffb40, limit:1048575, seg_32bit:1, contents:0, read_exec_only
:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xe21ffba8) = 5793
 mmap2(NULL, 201330688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0 <unfin
ished ...>
 <... mmap2 resumed> )             = -1 ENOMEM (Cannot allocate memory)

I finally determined that this the action of pthread_create() spawning a new thread and then failing to allocate stack space for it. However, it's trying to allocate 192MB for it, which is quite a lot! Since Skype is a 32-bit application, total stack space will be limited. Sure enough that user had:

limit stacksize 192 megabytes

in their .cshrc file, a leftover from many years ago. Removing that solved the problem.

Tuesday, May 12, 2015

Compiling numpy 1.9.2 with Intel compiler on EL7

While trying to compile numpy 1.9.2 with the Intel compiler on EL7 I got the following error:

+ /opt/pyvenv/nwra-1.0/bin/python -c 'import pkg_resources, numpy ; numpy.test()'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/builddir/build/BUILDROOT/pyvenv-nwra-numpy-1.9.2-1.el7.x86_64/opt/pyvenv/nwra-1.0/lib64/python2.7/site-packages/numpy/__init__.py", line 170, in <module>
    from . import add_newdocs
File "/builddir/build/BUILDROOT/pyvenv-nwra-numpy-1.9.2-1.el7.x86_64/opt/pyvenv/nwra-1.0/lib64/python2.7/site-packages/numpy/add_newdocs.py", line 13, in <module>
    from numpy.lib import add_newdoc
File "/builddir/build/BUILDROOT/pyvenv-nwra-numpy-1.9.2-1.el7.x86_64/opt/pyvenv/nwra-1.0/lib64/python2.7/site-packages/numpy/lib/__init__.py", line 8, in <module>
    from .type_check import *
File "/builddir/build/BUILDROOT/pyvenv-nwra-numpy-1.9.2-1.el7.x86_64/opt/pyvenv/nwra-1.0/lib64/python2.7/site-packages/numpy/lib/type_check.py", line 11, in <module>
    import numpy.core.numeric as _nx
File "/builddir/build/BUILDROOT/pyvenv-nwra-numpy-1.9.2-1.el7.x86_64/opt/pyvenv/nwra-1.0/lib64/python2.7/site-packages/numpy/core/__init__.py", line 6, in <module>
    from . import multiarray
ImportError: /builddir/build/BUILDROOT/pyvenv-nwra-numpy-1.9.2-1.el7.x86_64/opt/pyvenv/nwra-1.0/lib64/python2.7/site-packages/numpy/core/multiarray.so: undefined symbol: __builtin_ia32_storeups

Turns out this was because of an install issue. I had neglected to install the intel-compilerproc-vars-187 and intel-compilerprof-vars-187 packages. I had assumed that they only contained the iccvars.*sh and ifortvars.*sh files, which I did not need. However, they also contain crucial system headers including /opt/intel/composer_xe_2015.3.187/compiler/include/xmmintrin.h which would have superseded the gcc one and prevented the gcc implementation from sneaking in.

Thursday, May 7, 2015

Deploying A Scientific Python Environment

Rationale

I've been trying to figure out how to deploy a consistent set of scientific python applications and modules across different operating systems. I wanted to try to satisfy the following goals:

Use rpm for package deployment
Build directly from Fedora packages
Re-use as much of the base OS environment as possible

What I'm testing out now is building a base python virtual environment that will be the location that all other packages built will install into. Here's how it works:

Base environment package

I'm calling my base package 'pyvenv-nwra'. This indicates that it is a python virtualenv, and specifies the name (NWRA). The spec file is:

%global envname nwra

Name:           pyvenv-%{envname}
Version:        1.0
Release:        2%{?dist}
Summary:        NWRA python environment
License:        GPLv3+

BuildRequires:  python-virtualenv
%if 0%{?rhel} && 0%{?rhel} == 6
Requires:       environment-modules
%else
Requires:       environment(modules)
%endif

%description
NWRA python environment.

%package devel
Summary:        NWRA python environment development files
Requires:       %{name}%{?_isa} = %{version}-%{release}

%description devel
NWRA python environment development files.

%install
mkdir -p %{buildroot}/opt/pyvenv/%{envname}-%{version}
%if 0%{?rhel} && 0%{?rhel} == 6
source /opt/rh/python27/enable
%endif
virtualenv --system-site-packages --no-setuptools %{buildroot}/opt/pyvenv/%{envname}-%{version}
virtualenv --relocatable %{buildroot}/opt/pyvenv/%{envname}-%{version}
# Fixup buildroot
find %{buildroot} -type f -exec sed -i -e 's|%{buildroot}||g' '{}' +

# Needs to be in /etc to override EL python macros
mkdir -p %{buildroot}/etc/rpm
cat > %{buildroot}/etc/rpm/macros.zz-nwra-pyvenv << EOF
%%__python2 /opt/pyvenv/%{envname}-%{version}/bin/python2
%%pyvenv_name_prefix pyvenv-%{envname}-
EOF

mkdir -p %{buildroot}/usr/share/Modules/modulefiles/pyvenv/%{envname}
cat > %{buildroot}/usr/share/Modules/modulefiles/pyvenv/%{envname}/%{version} << EOF
#%Module 1.0

set prefix /opt/pyvenv/%{envname}-%{version}

prepend-path    PATH    $$prefix/bin
prepend-path    PYTHONPATH    $$prefix/lib/python2.7/site-packages
EOF


%files
/usr/share/Modules/modulefiles/pyvenv/%{envname}/%{version}
/opt/pyvenv/%{envname}-%{version}

%files devel
/etc/rpm/macros.zz-nwra-pyvenv

We install a virtual environment with --system-site-packages enabled and then strip out the buildroot paths. We also create a -devel package with a rpm macro file to override the %{__python2} macro for when we build dependent packages. Finally, we create an environment module file so our users can do:

module load pyvenv/nwra

to load the python environment.

Mock configuration

We use mock to build our packages. In order to have our environment in place we add to our config:

config_opts['chroot_setup_cmd'] = 'install @buildsys-build pyvenv-nwra-devel'

so we automatically pull in pyvenv-nwra-devel.

If you use COPR, you can add it to the "Additional chroot packages" list.

Building dependent packages

We want to modify the packages we build so that they use the "pyvenv-nwra-" namespace. This involves prefixing the old package name with "pyvenv-nwra-". We also need to change the name of any dependencies we have already built.

This does involve more modifications than I would have liked, but without it system packages and newer pyvenv-nwra- (or other pyvenv-*) packages could not co-exist. Using conditional macros though allows to the packages to build nomally in the normal build environments.

Example diff:

--- a/python-tornado.spec
+++ b/python-tornado.spec
@@ -4,7 +4,7 @@

 %global pkgname tornado

-Name:           python-%{pkgname}
+Name:           %{?pyvenv_name_prefix}python-%{pkgname}
 Version:        4.1
 Release:        1%{?dist}
 Summary:        Scalable, non-blocking web server and tools
@@ -38,16 +38,16 @@ ideal for real-time web services.
 %package doc
 Summary:        Examples for python-tornado
 Group:          Documentation
-Requires:       python-tornado = %{version}-%{release}
+Requires:       %{?pyvenv_name_prefix}python-tornado = %{version}-%{release}

 %description doc
 Tornado is an open source version of the scalable, non-blocking web
 server and and tools. This package contains some example applications.

 %if 0%{?with_python3}
-%package -n python3-tornado
+%package -n %{?pyvenv_name_prefix}python3-tornado
 Summary:        Scalable, non-blocking web server and tools
-%description -n python3-tornado
+%description -n %{?pyvenv_name_prefix}python3-tornado
 Tornado is an open source version of the scalable, non-blocking web
 server and tools.

@@ -57,12 +57,12 @@ reasonably fast. Because it is non-blocking and uses epoll, it can
 handle thousands of simultaneous standing connections, which means it is
 ideal for real-time web services.

-%package -n python3-tornado-doc
+%package -n %{?pyvenv_name_prefix}python3-tornado-doc
 Summary:        Examples for python-tornado
 Group:          Documentation
-Requires:       python3-tornado = %{version}-%{release}
+Requires:       %{?pyvenv_name_prefix}python3-tornado = %{version}-%{release}

-%description -n python3-tornado-doc
+%description -n %{?pyvenv_name_prefix}python3-tornado-doc
 Tornado is an open source version of the scalable, non-blocking web
 server and and tools. This package contains some example applications.

@@ -132,13 +132,13 @@ popd
 %doc python2/demos

 %if 0%{?with_python3}
-%files -n python3-tornado
+%files -n %{?pyvenv_name_prefix}python3-tornado
 %doc python3/README.rst python3/PKG-INFO

 %{python3_sitearch}/%{pkgname}/
 %{python3_sitearch}/%{pkgname}-%{version}-*.egg-info

-%files -n python3-tornado-doc
+%files -n %{?pyvenv_name_prefix}python3-tornado-doc
 %doc python3/demos
 %endif

I'm hoping this can be kept fairly easy in a separate git branch.

SCL rats nest

I took a bad turn trying to use the python27 SCL for EL6, since ipython requires it. However, this completely isolates the python install from the system one and so completely changes the rpm namespace. Everything needs to get built for the new python, and every rpm requires needs to get changed. The solution there may be the quick and dirty approach below:

Quick and dirty packages

I started out with the following approach, but then decided that I wanted something more modular. But if you just want to build up a quick pyvenv you could do something like the following. The major drawback to this approach is that you must do everything in one go - you can't build other packages later on top of this.

%global envname ipython

Name:           pyvenv-%{envname}
Version:        3.1.0
Release:        1%{?dist}
Summary:        An enhanced interactive Python shell

License:        (BSD and MIT and Python) and GPLv2+
URL:            http://ipython.org/

BuildRequires:  python-virtualenv

%install
mkdir -p %{buildroot}/opt/pyvenv/%{envname}-%{version}
virtualenv -v --system-site-packages %{buildroot}/opt/pyvenv/%{envname}-%{version}
source %{buildroot}/opt/pyvenv/%{envname}-%{version}/bin/activate
pip install --upgrade pip
pip install -v --no-use-wheel '%{envname}[all]'
echo y | pip -q uninstall setuptools
echo y | pip -q uninstall pip
deactivate
virtualenv --relocatable %{buildroot}/opt/pyvenv/%{envname}-%{version}
# Fixup buildroot
find %{buildroot} -type f -exec sed -i -e 's|%{buildroot}||g' '{}' +

%files
/opt/pyvenv/%{envname}-%{version}

and if you really wanted to try to leverage system packages for requirements you could add:

BuildRequires:  gcc-c++
BuildRequires:  python-jinja2
BuildRequires:  python-jsonschema >= 2.0
BuildRequires:  python-mistune >= 0.5
BuildRequires:  python-mock
BuildRequires:  python-nose >= 0.10.1
%if 0%{?fedora}
BuildRequires:  python-numpydoc
%endif
BuildRequires:  python-pygments
BuildRequires:  python-requests
BuildRequires:  python-sphinx >= 1.1
# Not in Fedora
#BuildRequires:  python-terminado >= 0.3.3
%if 0%{?fedora} >= 23
BuildRequires:  python-tornado >= 4.0
%endif
BuildRequires:  python-zmq >= 13
# Need to specify requires satisfied by the system
Requires:       python-jinja2
Requires:       python-jsonschema >= 2.0
Requires:       python-mistune >= 0.5
Requires:       python-mock
Requires:       python-nose >= 0.10.1
%if 0%{?fedora}
Requires:       python-numpydoc
%endif
Requires:       python-pygments
Requires:       python-requests
Requires:       python-sphinx >= 1.1
# Not in Fedora
#Requires:       python-terminado >= 0.3.3
%if 0%{?fedora} >= 23
Requires:       python-tornado >= 4.0
%endif
Requires:       python-zmq >= 13

Thursday, February 6, 2014

Running MPI tests in Fedora packages

Getting Started

I'm excited about getting starting building my packages for Fedora EPEL 7. Most of them are scientific in nature, and near the bottom of the stack is the HDF5 library. Unfortunately I quickly ran into a problem with running the mpich mpi tests in the package - the test would hang immediately:

make[4]: Entering directory `/builddir/build/BUILD/hdf5-1.8.12/mpich/testpar'
============================
Testing  t_mpi

Time to reproduce this on my local mock builder...

BLACS

BLACS is fairly simple MPI code. But the package didn't (yet) have a %check section. However the code did have a couple of tests that could be built and run directly, which would prove helpful later. So starting with the rawhide package I built and ran the tests, and lo - a hang! But this time only on the 32-bit build, which didn't match my original issue.

gdb fairly quickly pinpointed it getting stuck in a BLACS test routine - specifically a subroutine copied from the LAPACK library.   BLACS is pretty old (1997!) and I noticed that the current LAPACK library had a pretty reworked version of that routine, so I built against the system LAPACK and that fixed it. Yet another strike against bundled libraries!

Now the mpich tests were completing and we were on the the openmpi version. But that failed with:

It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

   orte_plm_base_select failed
   --> Returned value Not found (-13) instead of ORTE_SUCCESS

What a lovely error message! Luckily Google quickly pointed to the lack of a ssh or rsh binary - which is the case in the minimal build roots on Fedora. A BR on openssh-clients took care of that for now, though I've filed an issue to see if we can remove this requirement.

EPEL7

We were building okay on rawhide, now for the smoke test again on EPEL7.   And there it was again, an immediate hang in the mpich test! Still couldn't reproduce locally, so I added BR strace and did a scratch build:

+ strace -f mpirun -np 4 ./xCbtest_MPI-LINUX-0
execve("/usr/lib64/mpich/bin/mpirun", ["mpirun", "-np", "4",
"./xCbtest_MPI-LINUX-0"], [/* 46 vars */]) = 0
....
[pid 7662] execve("/usr/bin/ssh", ["/usr/bin/ssh", "-x",
"buildvm-11.phx2.fedoraproject.or"..., "\"/usr/lib64/mpich/bin/hydra_pmi_"...,
"--control-port", "buildvm-11.phx2.fedoraproject.or"..., "--rmk", "user",
"--launcher", "ssh", "--demux", "poll", "--pgid", "0", "--retries", "10",
...], [/* 46 vars */]) = 0
....
[pid 7662] write(2, "ssh: Could not resolve hostname "..., 94) = 94
[pid 7661] <... poll resumed> )        = 1 ([{fd=10, revents=POLLIN}])
[pid 7662] exit_group(255)             = ?
[pid 7661] fcntl(10, F_GETFL)          = 0 (flags O_RDONLY)
[pid 7661] fcntl(10, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
[pid 7661] fcntl(2, F_GETFL)           = 0x1 (flags O_WRONLY)
[pid 7661] fcntl(2, F_SETFL, O_WRONLY|O_NONBLOCK) = 0
[pid 7661] read(10, "ssh: Could not resolve hostname "..., 65536) = 94
[pid 7661] write(2, "ssh: Could not resolve hostname "..., 94ssh: Could not
resolve hostname buildvm-11.phx2.fedoraproject.org: Name or service not known
) = 94
[pid 7661] gettimeofday({1391730785, 32416}, NULL) = 0
[pid 7661] poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=8,
events=POLLIN}, {fd=10, events=POLLIN}], 4, 4294967295 <unfinished ...>
[pid 7662] +++ exited with 255 +++
<... poll resumed> )                    = 2 ([{fd=8, revents=POLLHUP}, {fd=10,
revents=POLLHUP}])
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=7662, si_status=255,
si_utime=0, si_stime=0} ---
brk(0)                                  = 0x1e0b000
brk(0x1e3a000)                          = 0x1e3a000
fcntl(8, F_GETFL)                       = 0 (flags O_RDONLY)
fcntl(8, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
fcntl(1, F_GETFL)                       = 0x1 (flags O_WRONLY)
fcntl(1, F_SETFL, O_WRONLY|O_NONBLOCK) = 0
read(8, "", 65536)                      = 0
close(8)                                = 0
read(10, "", 65536)                     = 0
close(10)                               = 0
gettimeofday({1391730785, 33070}, NULL) = 0

And there we're stuck. So it looks like mpich is making use of the ssh binary now provided for the openmpi tests, but because networking is completely disabled on the Fedora builders it seems to trigger a bug in mpich. But it suggested a workaround - specifying "-host localhost", which worked!

Summary

So, to summarize necessary steps for running MPI tests in Fedora packages:

Add BuildRequires: openssh-clients for openmpi tests.
Add "-host localhost" to mpich test runs.
Limit MPI processes to 4 or less.

HTH - Orion