# Ab initio solution of macromolecular crystal structures without direct methods

^{a}Department of Haematology, Cambridge Institute for Medical Research, University of Cambridge, Cambridge CB2 0XY, United Kingdom;^{b}Department of Clinical Biochemistry, Cambridge Institute for Medical Research, University of Cambridge, Cambridge CB2 0XY, United Kingdom;^{c}Division of Matrix Biology, Department of Medical Biochemistry and Biophysics, Karolinska Institutet, 171 77 Stockholm, Sweden;^{d}Cardiovascular and Metabolic Disorders Program, Duke-NUS (National University of Singapore) Medical School, 16957 Singapore;^{e}Division of Molecular Structural Biology, Department of Medical Biochemistry and Biophysics, Karolinska Institutet, 171 77 Stockholm, Sweden

See allHide authors and affiliations

Edited by Axel T. Brunger, Stanford University, Stanford, CA, and approved February 27, 2017 (received for review January 30, 2017)

## Significance

It is now possible to make an accurate prediction of whether or not a molecular replacement solution of a macromolecular crystal structure will succeed, given the quality of the model, its size, and the resolution of the diffraction data. This understanding allows the development of powerful structure-solution strategies, and leads to the unexpected finding that, with data to sufficiently high resolution, fragments as small as single atoms can be placed as the basis for ab initio structure solutions.

## Abstract

The majority of macromolecular crystal structures are determined using the method of molecular replacement, in which known related structures are rotated and translated to provide an initial atomic model for the new structure. A theoretical understanding of the signal-to-noise ratio in likelihood-based molecular replacement searches has been developed to account for the influence of model quality and completeness, as well as the resolution of the diffraction data. Here we show that, contrary to current belief, molecular replacement need not be restricted to the use of models comprising a substantial fraction of the unknown structure. Instead, likelihood-based methods allow a continuum of applications depending predictably on the quality of the model and the resolution of the data. Unexpectedly, our understanding of the signal-to-noise ratio in molecular replacement leads to the finding that, with data to sufficiently high resolution, fragments as small as single atoms of elements usually found in proteins can yield ab initio solutions of macromolecular structures, including some that elude traditional direct methods.

Over the past century, determination of novel crystal structures has evolved from an exercise in logic identifying the locations of single atoms by inspecting diffraction patterns (1) or vector maps (2), through the development of direct methods for small molecules (3) and of isomorphous replacement (4, 5) or anomalous diffraction (6, 7) phasing for molecules as large as proteins.

Currently, about 80% of protein structures are solved by the method of molecular replacement (8), exploiting prior structural knowledge of related proteins. In principle, molecular replacement (MR) involves rotational and translational searches over many possible placements of a molecular model within the unit cell of an unknown structure. The most sensitive method of evaluating the fit to the observed data is a likelihood function (9, 10) that accounts for the effect of measurement errors in the observed diffraction intensities (11). Potential solutions are scored by the log-likelihood-gain on intensities (*LLGI*), the sum of the log-likelihoods for individual reflections minus the log-likelihoods for an uninformative model (*Methods*).

Success in MR depends on the signal-to-noise of the search, which varies according to two parameters in the likelihood function: *f*_{P}) of the X-ray scattering power accounted for by the model (where the total scattering power is the sum of the squares of the scattering factors for the atoms in the crystal), its estimated accuracy (rms error Δ), and the resolution (*d*) of the reflection (9), with (optionally) a correction for the effect of disordered solvent described by the parameters *f*_{sol} and *B*_{sol}:**1b** neglects the effect of disordered solvent at low resolution.

The signal for an MR search can be estimated before the calculation as the expected value, or probability-weighted average, of the *LLGI* for a correctly placed model. The expected value of the contribution of one reflection, *Methods*), an approximation that is particularly good for the low values of *LLGI*, summed over all reflections, as the *eLLG*.

The variance of *eLLG* can similarly be approximated as the sum over all reflections of *Methods*). By the same reasoning, the signal-to-noise ratio achieved in a particular search will be proportional to *LLGI* value has been validated by analyzing a database of nearly 22,000 MR calculations, where an *LLGI* of 60 or more in a 6-dimensional rotation/translation search typically indicates a correct solution. (Fig. 1, which also shows that the required signal scales with the number of degrees of freedom in the search.) The database of test calculations also reveals that the translation function Z score (TFZ: the number of SDs by which the translation function peak exceeds its mean) is roughly on the same scale as

An *LLGI* at the level required to distinguish the correct solution from up to millions of alternatives can be achieved by predictable trade-offs among model quality, completeness, and resolution of the data used. For example, this theoretical insight explains why it is possible to place individual α-helices with better than random success in the Arcimboldo pipeline (12), but also why it is a great advantage to have data extending beyond 2-Å resolution: helices are preserved very well, so that Δ is small and data to the highest resolution will contribute to the signal. The theory also predicts, correctly, that calculations limited to around 10-Å resolution can give unambiguous MR solutions for ribosome structures, because of the large numbers of diffraction observations available to that resolution with the large ribosomal unit cell. Importantly, it also allows researchers to anticipate when MR is unlikely to succeed, so that they avoid fruitless calculations.

This insight led us to consider the most extreme example of a small fragment, i.e., a single atom. A single atom is a perfect partial model (Δ=0), for which *eLLG* can rise to a substantial number. This is particularly true for atoms that are somewhat heavier than average. For instance, the square of the scattering power of a sulfur atom (i.e., the fourth power of its scattering factor) is about 50× greater than that of a carbon atom at a very low resolution such as 10–20 Å; because scattering drops off less rapidly for sulfur, that ratio increases to about 300 at 1-Å resolution. This effect is amplified if a sulfur atom is better ordered than the average atom in the structure, because its relative scattering power becomes even greater. Furthermore, only half as much signal should be required to place a single atom with 3 degrees of freedom compared with a molecule with 6 degrees of freedom (Fig. 1). Our insights predict that, for crystals that contain up to a few thousand unique ordered atoms and diffract beyond about 1-Å resolution, there should be a significant signal in a likelihood search carried out by translating a single sulfur atom over all of its possible positions. Even if the placement of the first atom is ambiguous, the signal will increase quadratically with the number of atoms placed (Fig. 2), allowing the ambiguity to be resolved.

## Results

Test calculations on a number of systems proved the principle of single-atom MR: it was indeed possible to find sulfur atoms in a variety of protein crystals, as well as phosphorus atoms in one RNA crystal tested (Table S1). The largest structure that yielded to this approach was that of aldose reductase [Protein Data Bank (PDB) ID code 3bcj] (13). The protein has a mass of 36 kDa with 2,525 nonhydrogen atoms (2,606 including ligands) and no atom heavier than sulfur, and the deposited data extend to 0.78-Å resolution. The *eLLG* for a sulfur atom with a B factor equal to the average in the crystal is 4.0, or 12.6 for a well-ordered sulfur atom with a B factor reduced by only 1 Å^{2}. MR implemented in Phaser was able to locate up to 10 atoms with clear signal (Table 1).

A structure comprising a few atoms can then serve as a seed for structure completion by using log-likelihood-gradient maps to select locations for new nitrogen atoms (as a surrogate for other types) that improve the MR likelihood score (14) (*Methods*). Starting from as few as the first two atoms placed by MR, the structure of aldose reductase was extended successfully by log-likelihood-gradient completion. The result was a model with 3,051 atoms (some accounting for solvent molecules and for static disorder) that yields an *LLGI* of 483,292 and an R value of 12.9% (Fig. 3). In contrast, all attempts to solve this structure by direct methods or their dual-space variants (15, 16) have failed. As far as we can determine, it is the largest reported ab initio structure containing nothing heavier than the sulfur atoms found in natural protein sequences, although larger ab initio structures containing metal ions have been solved (17).

The formulation predicts that it should also be possible to place sulfur atoms in smaller structures at lower resolution. This was crucial in solving a previously unknown structure, the N-terminal domain (residues 22–95) of Shisa3, which crystallized in space group *P*4_{3}2_{1}2 and diffracted to 1.39-Å resolution. The protein did not have detectable sequence identity with any protein in the PDB, so there was no template structure for traditional MR. The *eLLG* calculations predict that there should be some signal for placing well-ordered sulfur atoms, giving an *eLLG* of 4.0 for a sulfur atom with a B factor reduced by 1.5 Å^{2} from the average. Indeed, up to seven of the eight sulfur atoms in this protein could be placed with good signal (Table 1).

Log-likelihood-gradient completion is expected to work more poorly at resolutions where atomic peaks are not resolved. Nonetheless, this succeeded in expanding the Shisa3 structure to a total of 56 atoms, with the additional atoms largely corresponding to well-ordered main-chain oxygen and nitrogen atoms. At this point, the phase information was sufficient to enable phase improvement by density modification in Parrot (18), and the resulting map could be interpreted in terms of an atomic model in ARP/wARP (19). A hybrid approach exploiting direct methods algorithms implemented in ACORN (17, 20) or in SHELXE (21) was also able to expand a partial structure obtained by single-atom MR. This succeeded when starting from as little as one pair of sulfur atoms (Fig. 4). The structure, which contains no α-helices and represents a protein fold with no detectable similarity to other structures in the PDB, was refined to an R value of 11.5% and has been deposited in the PDB with accession code 5m0w. Details of the structure will be discussed elsewhere.

## Discussion

This work brings together high-resolution ab initio phasing and low-resolution MR in one unified framework that spans the continuum of data and model quality, with the *eLLG* directing the tailoring of structure solution to the optimal path for the data available. It demonstrates the considerable practical impact, compared with traditional direct methods, of accounting rigorously for the effects of sources of error in a likelihood target. It is also important to note that these results have been obtained by a deterministic algorithm. Direct methods, in contrast, are invariably implemented within a random multisolution framework, an approach that should also improve the outcome of single-atom MR. Finally, the results were obtained without taking advantage of any other information that would typically be present, e.g., from single-wavelength anomalous diffraction (SAD) effects in crystals with intrinsic anomalous scatterers such as sulfur, or even from isomorphous replacement experiments. A proper accounting for the effects of uncertainty, as demonstrated here, should allow us to extend our approach to use even weak information from these other sources.

## Methods

### Formalism for the *eLLG* and Its Approximation.

The likelihood function used to score MR solutions is based on the Rice distribution (9, 10), modified to account for the effect of measurement errors in the observed intensities (11). For acentric reflections, this is given by

where _{0} is a modified Bessel function of order 0.

The *eLLG* is defined as the probability-weighted average of the logarithm of the likelihood ratio, integrated over all pairs of observed and calculated normalized structure factors. The contribution of a single reflection to the *eLLG* is defined in Eq. **3**:

where, for the acentric case,

and

The Maclaurin series expansion of the integrand of Eq. **3a** for the acentric case, to fourth order in **4**:

where

The double integrals over *a* and *b* both evaluate to zero, whereas the double integral over *c* yields 1/2. Fig. S1 shows that

The variance of **5**:

where

For the small values of **5a** will be dominated by the first term (as the second term will have a value of the order of **5b** for the acentric case, to fourth order in **6**:

The double integral over this single term yields simply *eLLG*.

Because the variance of *eLLG*, summed over all reflections, is also proportional to the total *eLLG*. Therefore, the signal-to-noise ratio for any *eLLG* is proportional to *eLLG* is achieved through a combination of model quality, completeness, data quality, and data resolution. Similarly, the value of *LLGI* obtained in an MR search will indicate the confidence that can be placed in the corresponding solution, regardless of how the *LLGI* was achieved. Indeed the translation function Z score, which is used as a measure of confidence in an MR solution (10), is seen to be roughly proportional to the square root of the *LLGI* in the database of MR calculations.

### Mathematical Derivations.

Series approximations and integrals used in the derivation of Eqs. **3**–**6** were computed with Mathematica (22), which was also used to prepare Fig. 2 and Fig. S1.

### Single-Atom MR Protocol.

In the single-atom MR protocol, the first step is to carry out translation searches for a specified number of the heavier atoms expected in the structure. For the trials summarized in Table S1, the search looked for four atoms unless fewer sufficiently heavy atoms were expected. In the next step, log-likelihood-gradient completion (described in the next section) was used to complete each of the potential few-atom solutions by adding nitrogen atoms as surrogates for all remaining atom types. Refinement, at each step, of the occupancies of the nitrogen atoms compensates for the difference in scattering power compared with other atom types, such as carbon or oxygen. The log-likelihood-gradient completion continues to convergence, when no further peaks are identified.

The test cases in Table S1 were chosen from the PDB based initially on the criteria that data extending to atomic resolution (1.2 Å or better) were deposited in the form of intensities rather than amplitudes, and that there were no atoms heavier than S in the structure. The initial set was supplemented with several cases at lower than 1-Å resolution in which there are atoms heavier than S, as the success rate was otherwise low in this resolution range. Note that the *LLGI* per atom after the initial search for individual heavier atoms provides a reasonable diagnostic indication of success. For the cases where the protocol succeeded, *LLGI* per atom ranged from 21.5 to 272.2 with a mean of 88.3, whereas for cases where the protocol failed, *LLGI* per atom ranged from 19.0 to 43.5 with a mean of 28.4. The difference in *LLGI* per atom distributions for the data from Table S1 is illustrated in Fig. S2 by a box plot, generated with BoxPlotR (23).

### Log-Likelihood-Gradient Completion.

In a log-likelihood-gradient map, peaks show positions where the addition of atoms of a specified type would tend to increase the corresponding likelihood target. The single-atom MR algorithm implemented in Phaser computes a log-likelihood-gradient map corresponding to the MR likelihood function, but does so by using the equivalent functionality required for handling singletons (reflections with only one member of a Friedel pair, hence no anomalous scattering-phase information) in the SAD likelihood target (14). Peak picking is carried out using the same defaults as for log-likelihood-gradient SAD completion, i.e., peaks above 6× the rms value of the map are selected, unless the deepest hole in the map has a greater magnitude. Log-likelihood-gradient completion is iterative, with the addition of atoms increasing the signal in subsequent log-likelihood-gradient maps.

## Acknowledgments

We are grateful to the Local Contact at the European Synchrotron Radiation Facility (ESRF) for providing assistance in using beamline ID14-3, as well as Doreen Dobritzsch for help with the data collection. The diffraction data were collected on beamline ID14-3 at the ESRF, Grenoble, France. This research was supported by a Principal Research Fellowship from the Wellcome Trust (082961/Z/07/Z to R.J.R.), and grants from the NIH (Grant P01GM063210 to R.J.R.), the Swedish Research Council (Grant 521-2014-1833 to K.T. and Grant 2007-5648 to B.L.), the Knut and Alice Wallenberg Foundation (K.T.), the Novo Nordisk Foundation (K.T.), and the Röntgen Ångström Cluster (Grant 349-2013-597 to B.L.). The research was facilitated by Wellcome Trust Strategic Award 100140 to the Cambridge Institute for Medical Research.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: rjr27{at}cam.ac.uk.

Author contributions: B.L. and R.J.R. designed research; A.J.M., R.D.O., A.G.W., J.R.M.O., K.T., B.L., and R.J.R. performed research; A.J.M., R.D.O., A.G.W., B.L., and R.J.R. analyzed data; A.J.M. and R.J.R. wrote the paper; and all authors contributed to revisions.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Data deposition: The atomic coordinates and structure factors have been deposited in the Protein Data Bank, www.pdb.org (PDB ID code 5m0w).

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1701640114/-/DCSupplemental.

## References

- ↵.
- Bragg WL

- ↵
- ↵.
- Hauptman H,
- Karle J

*I. The Centrosymmetric Crystal*, American Crystallographic Association Monograph No. 3 (Edwards Brothers, Ann Arbor, MI). - ↵.
- Cork JM

- ↵
- ↵.
- Bijvoet JM

- ↵.
- Hendrickson WA

- ↵
- ↵
- ↵
- ↵.
- Read RJ,
- McCoy AJ

- ↵
- ↵.
- Zhao HT, et al.

- ↵
- ↵
- ↵.
- Arnold E,
- Rossmann M

- Sheldrick GM,
- Hauptman HA,
- Weeks CM,
- Miller M,
- Usón I

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Wolfram Research

- ↵
- ↵
- ↵

## Citation Manager Formats

## Article Classifications

- Biological Sciences
- Biophysics and Computational Biology

- Physical Sciences
- Biophysics and Computational Biology