F

Storage Systems Administrator

Facilities
Full-time
On-site
Nashville, Tennessee, United States
Description

The Storage Systems Administrator is part of the Advanced Computing Center for Research and Education (ACCRE) at Vanderbilt University and is a key individual contributor responsible for serving as systems administrator for the ACCRE cluster team that helps manage the >10,000-core Linux cluster and the various parallel distributed filesystems. This position will report to the Director of Research Computing Operations.


Computing is emerging as a third paradigm for discovery, complementing theory and experiment. To quote from a recent National Research Council report: "The exploding technology of computers and networks promises profound changes in the fabric of our world... As seekers of knowledge, researchers will be among those whose lives change the most... Researchers themselves will build this New World largely from the bottom up, by following their curiosity down the various paths of investigation that the new tools have opened. It is unexplored territory" The Advanced Computing Center for Research and Education (ACCRE) is being built and operated by Vanderbilt faculty. Its mission is to allow Vanderbilt researchers to define, benefit from, and explore the "New World" described above. Towards this aim, the center has established the following goals:



  • Low Barriers: Provide computational services with low barriers to participation, working with researchers to develop and adapt HPC tools to their avenues of inquiry.

  • Expand the Paradigm: Work with members of the Vanderbilt community to find new and innovative ways to use computing in the humanities, arts, and education.

  • Promote Community: Foster an interacting community of researchers and develop a campus culture that promotes and supports the use of HPC tools.


The center manages an over 10,000 processor Linux cluster comprised of multiple computer architectures and over 15 PB of disk storage.



Duties and Responsibilities
This position will be responsible for the critical aspects of the following:

ACCRE Storage Systems Administration



  • Maintain, administer, and improve ACCRE’s storage services

  • Set and implement user access controls and identity and access management systems

  • Aid in operational security implementations

  • Aid in triaging user support tickets

  • Troubleshoot hardware and software problems related to the storage

  • Be the primary support for remote user access to the storage systems

  • Support tape archive and data backup

  • Be a member of the team developing, deploying, and supporting a distributed NAS system

  • Work on adapting existing software tools to support the transport and management of research data between various storage pools both on and off campus. This will require additional work packaging and adapting tools for the various research communities



ACCRE Compute Cluster Administration



  • Set up/configure cluster hardware, including gateways, compute nodes, and cluster management infrastructure

  • Install operating system and related utility software

  • Monitor the status of the cluster utilizing tools such as Nagios, including customizing the tools for ACCRE-specific needs

  • Compile/install application software packages needed by researchers

  • Assist with the administration of the cluster job scheduler, including modifying user limits, creating/modifying/deleting node reservations, and diagnosing issues with the job scheduler 

  • Serve as a technical resource to users and other ACCRE staff members

  • Plan work for other team members to meet project guidelines

  • Train and lead other team members, as needed, and act as internal technical consultant to ACCRE staff, particularly related to projects on which this position is serving as the lead systems administrator


Other Responsibilities



  • Respond to help desk tickets to solve user problems and to educate users on cluster usage

  • On a rotating basis, serve as the on-call person for evening and weekend hours, such as a rotating 4-week schedule or every other week in a Level 2 support rotation

  • Work nights and weekends as needed for scheduled or unscheduled downtimes

  • Compile documentation in a timely manner for all ACCRE projects and tasks, both for new projects and for changes to ongoing projects

  • Physically move and lift hardware when needed

  • Actively identify and participate in training, education, and development activities to improve knowledge and performance and to sustain and enhance professional development

  • Keep up-to-date on software systems, operation procedures, and technological developments in systems, high performance computing, and programming

  • Research, design, and evaluate new technologies/concepts that could potentially improve ACCRE’s capabilities and/or services

  • Attend meetings, conferences, and seminars in systems, high performance computing, and programming, and in particular regularly attend and participate in OSG meetings. Give presentations on ACCRE services at conferences when requested

  • Work with outside companies to improve ACCRE services. Develop partnerships with vendors and service providers. Work with both software and hardware developers to implement needed customizations specific to our site requirements


Qualifications





  • Vanderbilt Export Compliance regulations designate that this position is limited to US citizens and permanent residents only.

  • A Bachelor’s degree from an accredited institution of higher education is required. 

  • The ability to physically move and lift hardware up to 50 pounds is required. 

  • Five years of experience with system administration with UNIX/Linux based operating systems is required.

  • Three years of experience with parallel clustered storage solutions including one of: IBM SpectrumScale (GPFS), Auristor, PanFS, or OpenAFS is required.

  • Demonstrated experience with Bash and/or Python scripting of moderate complexity is required.

  • Knowledge and experience of GIT version control is required.

  • Knowledge and experience with configuration management tools such as Ansible is required.

  • Demonstrated self-driven, inquisitive, and productive troubleshooting abilities is required.

  • Strong ability to work individually and in a team environment is required.

  • Ability to adapt to new technological dynamics is required.

  • Strong ability to share knowledge coherently with others, both verbally and written, is required.

  • Demonstrated success in taking initiative, meeting deadlines, adapting to changing priorities, and managing multiple projects simultaneously is required.

  • Experience with RedHat based systems is preferred.

  • Experience in an HPC environment is preferred.

  • Experience with disk storage hardware (SAS, JBOD, RAID, HBA, RAID controllers, etc.) is preferred.

  • ElasticSearch experience is preferred.


Commitment to Equity, Diversity, and Inclusion


At Vanderbilt University, we are intentional about and assume accountability for fostering advancement and respect for equity, diversity, and inclusion for all students, faculty, and staff. Our commitment to diversity makes us who we are.  We have created a community that celebrates differences and lets individuality thrive. As part of this commitment, we actively value diversity in our workplace and learning environments as we seek to take advantage of the rich backgrounds and abilities of everyone. The diverse voices of Vanderbilt represent an invaluable resource for the University in its efforts to fulfill its mission and strive to be an example of excellence in higher education.


 


Vanderbilt University is an equal-opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or status as a protected veteran, or any other characteristic protected by law.
 


Please note, all candidates selected for an offer of employment are subject to pre-employment background checks, which may include but are not limited to, based on the role for which they have been selected: criminal history, education verification, social media review, motor vehicle records, credit history, and professional license verification.