batteryforpc: Inside Intel's Secret Overclocking Lab: The Tools and Team Pushing CPUs to New Limits

Thursday, 2 January 2020

Inside Intel's Secret Overclocking Lab: The Tools and Team Pushing CPUs to New Limits

We get an exclusive tour of the facility where Intel pushes chips to their absolute limit.
Intel says it loves overclocking, but saying it and proving it are two different things entirely. We here at Tom's Hardware certainly love overclocking, so we visited Intel's Jones Farm campus in Portland, Oregon, recently for an exclusive tour to see how the company designs and architects its chips for maximum overclockability.
Intel's overclocking lab has silently steered the company's overclocking efforts for years and, understandably, the company has been hesitant to open it up to the media. To our knowledge, we are the first tech news outlet to visit and report on the facility. As we can see below, the lab is packed with next-gen gear that the company isn't ready to divulge, some of it hidden under Intel-blue lab coats for our visit.
Security is an everyday concern at Intel, and as part of its standard operating procedures, sensitive gear is kept in locked Plexiglas cases in most of the building when it isn't actively being tested. The cases are large enough to make it difficult, if not impossible, to ferret hardware out of the building unnoticed. Inside of the confines of the OC lab, which is protected by a keypad entry system, the technicians can operate with a bit more freedom.
Intel is certainly a company at a crossroads: After more than a decade of dominance, it is pressured by AMD with its Ryzen processors that offer leading performance in key metrics, and at ultra-competitive pricing that challenges Intel's desktop PC dominance.
Much of AMD's advantage lays in TSMC's 7nm process, which is denser and more efficient than Intel's go-to 14nm node. But overclocking headroom remains an advantage for Intel, as its chips have much more margin between the rated boost frequencies and overclocked clock rates than AMD's processors.
Make no mistake, AMD has done a great job of extracting the utmost performance from the new 7nm node, but the fact remains: The industry is grappling with stagnant, if not declining, clock rates from process nodes as they shrink further, leading many to question if overclocking is dead.
During our visit, we discussed the labs' role and procedures, examined the gear, asked Intel's team about the future of overclocking, and peppered them with questions about some of the important concerns on the minds of overclockers, like safe voltage guidelines and the impact of overclocking on chip longevity.
Move the Bits Faster
Overclocking is all about pushing the bits inside the processor faster by altering it to operate beyond the rated specifications. It sounds simple enough, but as enthusiasts know, what appears to be a rather straightforward exercise can be extremely complex, especially if you're pushing the extreme edge of performance.
As challenging as wringing the utmost performance out of silicon can be, it's nothing compared to the complexity of designing and integrating that functionality into products that begin their life as grains of sand. Consider this: At 5 GHz, the nanometer-scale transistors inside today's modern chips switch on and off at a rate of five billion times per second, and that isn't even the pinnacle of performance – it isn't uncommon to see speeds over 7 GHz with liquid nitrogen cooling.
Exposing that hidden performance to enthusiasts and casual users alike presents tremendous challenges for the engineering teams tasked with ensuring the billions of transistors on your chip can survive the rigors of elevated voltages and temperatures that come as a byproduct of overclocking.
That's one of the fundamental reasons why Intel created its Overclocking Lab, a team of engineers led by Dan Ragland, the Principal Engineer of Performance Tuning & Overclocking Architecture. The team is tasked with not only exposing new overclocking features to users but also tracking silicon quality, voltage scaling and conducting long-term stress testing, all of which help the company improve its processors for all users – even those that aren't interested in overclocking.
The overclocking lab operates in relative secrecy as far as the public is concerned, but it is just down the hallway (and to the right) from where we attend Intel's annual data center briefings, so it isn't necessarily hidden inside Intel's Jones Farm campus in Portland, Oregon.
For our tour, Ragland was joined by four members of his eight-person engineering team that he handpicked for their experience in multiple disciplines, from mechanical/thermal engineering to software. Ragland selects from among the best the company has to offer, but aptitude and enthusiasm are key traits he looks for when selecting team members. A passion for overclocking is a ground-floor requirement to make the team.
A group of eight engineers seems like a small number of engineers dedicated to overclocking performance for a company with a market cap of ~$250 billion, but this team interacts with multiple other groups, from Intel's Israel-based IDC team during the design phases of processor architectures to the teams responsible for power code management and overclocking software, among many others.
The team also interfaces with motherboard vendors to help them optimize their platforms for overclocking, part of which includes overclocking workshops that we'll cover shortly. Intel also regularly brings in leading overclockers from around the world to work with them in the lab to help the company better understand the challenges of overclocking, and address issues.
We asked during our tour if the lab also does competitive performance analysis of competing products, like testing AMD's chips for comparison, "We're very aware of our competition, but we can't say much beyond that," Ragland responded.
All of Intel's first processors were overclockable, but they weren't specifically designed to offer much frequency headroom, or with features that improved overclockability. Instead, overclocking was truly opportunistic: Any overclocking headroom you received was down to the luck of the draw and your ability to work around the internal limits of the chip. Even with the limited tunable parameters available, enterprising enthusiasts found ways to push chips to higher speeds, thus wringing out more performance from cheaper models.
That eventually created a problem for the company, though. Intel tells us that back in 1999, counterfeit Pentium 3 processors began popping up in the grey market in Asia. The counterfeiters simply overclocked the cheaper models and then sanded off the product identifiers/frequency ratings, replacing them with the overclocked frequency and model numbers that matched the more expensive models. To put a stop to the practice, Intel locked all processors to the rated frequencies in an attempt to prevent counterfeiting.
Due to feedback from its customers (and media), Intel brought back overclockability in ~2003 and unveiled its new Extreme Edition processors. These chips allowed overclocking in exchange for a pricing premium, just like we see with Intel's K-Series processors today, but the remainder of Intel's processors remained locked.
And thus, Intel created a multi-billion dollar business centered on chip overclockability. However, because these early unlocked chips still weren't designed specifically for overclocking, there were shortfalls. For instance, Sandy Bridge chips could clock memory higher than allowed via the maximum programmed multiplier. As a result, Intel now assigns what could be considered ludicrous ratios when new tech rolls out, but years later, those ratios could come in handy. For instance, DDR4 ratios were assigned at a maximum of 8000 MHz years ago, but it wasn't until recently that the world record was set at over 6000 MHz, and due to foresight, there is still multiplier headroom left over.
Around the Haswell generation, Intel began adding in features and knobs to ease overclocking, which required a transition from 'opportunistic overclocking' to designing and architecting for it. The initial overclocking team consisted of engineers that volunteered for the task, like Francois Piednoel, a famed former Principal Engineer at the company. This loosely-banded team pushed the limits of silicon for the sake of furthering overclocking capabilities, but Intel didn't establish a formal team and lab until 2016, which now consists of eight team members that work on OC full time. There are still several volunteers throughout the company that help out, many simply because they are, like us, chip enthusiasts that enjoy overclocking.
Some of the notable advances include working with the Israel Design Center to design in BCLK overclocking from Skylake's initial design phases, along with addressing the Haswell cold bug issue (FIVR doesn't respond well to subzero temperatures) and designing in a workaround to ensure that future chips didn't suffer the same fate. Other advances include the introduction of the AVX offset, an incredibly useful tool for overclocking that the team patented. That offset also plays a crucial role in the auto-overclocking Intel Performance Maximizer (IPM) feature, which the team developed the algorithms for and worked with the software team to implement.
And Intel's forthcoming discrete Xe GPUs? We're told that the overclocking lab will play a role in those products, too, including the GPU overclocking software. There are a few 'organizational boundaries,' so the efforts won't be identical, but the team can define a wish list and also work with the GPU team to ensure those features make their way into the IPM software. That means the same one-button overclocking approach will come to Intel's GPUs.
We fully expect that overclocking Intel's discrete GPUs will be part of the lab's mission, and when asked, we were told with a smile that "some things" had been removed from the lab prior to our arrival. We suspect that 3D benchmarking of the new Xe cards is already underway.
The OC team also helped pioneer per-core CPU overclocking to expose extra performance from the diverse range of CPU core capabilities, and that feature eventually evolved into Turbo Max 3.0.
As it does with several other groups inside Intel, the team also has an integral role in developing software, like key modules inside XTU, including developing the XTU benchmark and adding in application profile pairing. The team also assures that all of the necessary OC'ing knobs are exposed and working correctly in Intel's various software utilities.
As you can imagine, there are likely also other areas the team has developed, or is developing, for new tech for that can't be disclosed because they will worm their way into future products. As Ragland refers to it, the "OC pipeline of innovation" never stops.
Even with the focus on exposing features and assuring overclockability, the Silicon Lottery still applies, but this team's mission is to increase your odds of getting a tunable chip. In fact, the team now compiles overclockability reports for high-ranking Intel executives as a key checkpoint in the standard chip design flow, a process that started with the Devil's Canyon processors. To compile those reports, the team culls 50 randomly-selected processors from the hundreds of chips it tests for each new design, then charts out several key metrics, such as average OC frequencies, so the company can assure new chips are worthy of the "K-series" badge.
Enthusiast and Vendor Engagement
One of the biggest steps forward came from merely engaging the community. In the past, Intel didn't directly promote overclocking with sub-zero cooling solutions, but that changed with the company's first subzero demo with cascade cooling, done by legendary overclocker Fugger.
Intel made its first live LN2 demonstration at IDF 2015 with overclockers Allen "Splave" Golibersuch, Fugger, and L0UD_SIL3NC3. The overclockers set an XTU world record live on stage in under two minutes, and now LN2 demonstrations at Intel's events are a common occurrence.
Intel also regularly engages leading overclockers, like Der8auer (among many others). Intel says it works with ten to 15 of the top overclockers under NDA so they can visit its lab to test new chips. These overclockers test chips and give Intel feedback on several characteristics that are important for overclocking, like features and cooling methods.
Intel team has also begun interfacing directly with the HWBot team that maintains the database of official world overclocking records, but that isn't strictly an "Intel initiative." Rather, the team does this to keep a healthy relationship with the enthusiast community. Case in point: The first-gen XTU benchmark had scalability issues beyond ten cores and could be 'hacked' to give out erroneous results, but the team sought feedback from the HWBot team to ensure that the second-gen XTU benchmark (which is already rolled into XTU) met their expectations.
All of these actions are possible because of a shift in Intel's thinking in regards to interfacing with the overclocking community: In the past, the PR and marketing teams dealt with these 'external customers,' but Intel has opened up the process to allow the team cooking up the silicon to work directly with the community.
But the work doesn't stop there. In fact, that's only the beginning. Intel's internal teams spend a vast amount of time overclocking chips themselves, often breaking world records that will never see the light of day, and work with motherboard vendors and other ecosystem partners to assure that not only the chips, but the platforms, too, are optimized for overclocking.
While we're accustomed to seeing overclockers bathed in liquid nitrogen (LN2) smoke with no protective gear whatsoever, and even guys like competitive overclocker Joe Stepongzi pouring LN2 in their mouths and spewing it everywhere for kicks, that runs counter to just about every hazardous materials warning known to man that's associated with liquid nitrogen.
In contrast, Intel places strict emphasis on safety throughout its entire company, and those same principles obviously apply to the work it does in the overclocking lab. Stepongzi is probably shaking his head somewhere in disappointment, but that means Intel's employees have to wear the full OSHA-approved line of gear to pour liquid nitrogen. In fact, it took the lab three months just to get the equipment approved internally, and all LN2-pouring employees have to attend dedicated training sessions to attain all of the requisite certifications.
Intel's team spends a lot of time doing sub-zero overclocking (remember, they're binning and testing chips en masse), and here we can see its lineup of LN2 tanks. Interestingly, the small silver 50-liter LN2 tank to the right is Intel's first tank that it purchased specifically for overclocking (back in the Haswell/Ivy Bridge timeframe - 2010/2012). The tank no longer works and is in dire need of maintenance, but the team keeps it around for nostalgia's sake. Now the company rents the two big 180L silver tanks. They're replenished regularly.
During heavy use, the knob on an LN2 tank will freeze up, potentially causing someone's hand to stick to it, so in keeping with the strict safety regulations, Intel has pairs of thermally-insulated gloves specifically used to turn the knob on the LN2 tanks. Given the rips, this glove has seen plenty of pours. Ragland tells us the company has to keep a fresh supply of LN2 and gloves on hand, not only for daily use but also for the company's overclockathons with motherboard vendors and top overclockers, which we'll cover shortly.
Here we can see Overclocking Performance Engineer Navya Pramod decked out in the Intel-approved gear for pouring and using liquid nitrogen, including an insulated smock, insulated gloves, and a full face shield. We're surprised there's no respirator involved, but we're told the lab is certified as having adequate ventilation for LN2 use.
As you can imagine, the cumbersome safety gear makes pouring LN2 less accurate, so we're not sure how often they follow these rules on a daily basis. We'll pretend it's 100%.
As expected, Intel's lab has plenty of LN2 flasks, and the model in the foreground is an Intel-certified vacuum flask. It has sandpaper on the outside of the flask to prevent slippage, various safety warnings, and a glass interior that shatters (loudly) even if the flask is tipped over on a table. The lab started out with six of these flasks, but now only has one remaining with an intact glass cover.
As we can see, the lab also has several Thermos containers that it uses regularly, but found its preferred solution from an unlikely source.
The world's leading overclocker, Splave, brought several of these Thermos-brand containers, which he purchased at a local Target, to an Intel overclocking bootcamp. The lab crew now prefers these pots because of the broad base, big capacity, and "nice pour handle." As Ragland says, "we use what works best."
We regularly do large scale CPU testing in our own labs, particularly when we spin up CPU reviews with upwards of 15 chips in a test pool, so we know that selecting the right equipment, and having a lot of it, is imperative. That makes the OC lab tour all the more interesting, as we had the opportunity to poke around Intel's labs and compare notes.
And, of course, there's the eternal question that has spawned perhaps millions of heated forum exchanges: What's the best TIM (thermal interface material)? Intel probably wouldn't wade into that debate, but we asked the company what it uses in its own lab, and it turns out the company uses multiple types of TIM, largely dependent upon the testing it is doing.
Intel's lab has a $2,200 "tub" of Vince "Kingpin" Lucido's KPX, so it's obviously one of the go-to solutions, along with plenty of Thermal Grizzly's Kryonaut and Conductonaut. The company also uses an Intel-designed formula that is made by the third-party company Shin-Etsu. The blend isn't available to the public, but it doesn't seem to be the lab favorite and was mentioned as an afterthought.
Surprisingly, but not really too surprisingly given that this author thinks they're great coolers, Intel uses Corsair H115i coolers for testing in its lab, and there are plenty of them interspersed throughout. We also use these same coolers for CPU testing, and like Intel, we've found them to be very durable and hold up well to constant processor re-mountings. Intel uses the H115i for standard water-cooled overclocking testing but steps up to custom loops for overclocking HEDT models.
Intel's building has its own cool water supply for its various labs, but the overclocking lab uses a more powerful water chiller of its own. The chiller distributes water throughout the lab via two large tubes that extend along both sides of test benches, which you can see behind the monitors. Lab technicians can then simply plug into these chilled water supply lines at multiple stations inside the lab.
Interestingly, the overclocking lab's chiller is cooled by the building's cold water supply, meaning that there are two separate cooling loops in the lab. That allows the chiller, which creates its own waste heat, to be cooled via the central cooled water supply instead of exhausing that heat into the lab.
The company also has a piece of gear named The Medusa that we would absolutely kill for. The unit, pictured above, is a custom-made Peltier cooler that allows the company to keep a processor at a set temperature regardless of load. Intel designed this thermo-electric cooler and has used it worldwide in its facilities for roughly 15 years.
The unit has a Peltier thermocouple that, once attached, will keep a processor at the specified temperature (say, 60C) under all conditions. This unit is also cooled via the building's cool water supply. Among many other uses, this type of cooling solution comes in handy for testing systems that may not be easily cooled due to a lack of conventional cooler mounts. For instance, it was used extensively for Hades Canyon testing.
If you're swapping motherboards constantly, a solid test bench is a must. Again, like us, Intel uses the Open Benchtable (OBT) as it's go-to mount for motherboards.
After years of searching for the best open-air test bench, even going back to the heady days of Danger Den torture racks, we can attest that the OBT is the best open-bench option for testing CPUs. You mount motherboards with a peg-like system, so there are no fasteners required when you swap out gear: You just pop the motherboard on and off when needed, but it remains firmly in place without fasteners.
The OBT also folds flat and can be thrown in a suitcase for travel, which is handy for Intel because it's technicians have to set up for demos and often run tests in hotel rooms and remote locations. It's also handy if you're a CPU reviewer because we often test in hotel rooms to meet NDA deadlines while covering trade shows or events (you'd be surprised how often that happens).
We can also see one of Intel's many thermal imaging units that it uses to spot hot spots and diagnose various issues with motherboards.
We counted four oscilloscopes in the lab, but this model stands out. The Tektronix MSO 70404C mixed signal oscilloscope is the most powerful model in the lab and runs about $68,000. This scope can cover up to roughly 8 GHz, which allows the engineers to examine and debug really fast interfaces, like the overclocked memory bus. The rest of the scopes are a bit more pedestrian ($20,000 range).
Interestingly, this model runs Windows, so Intel's security team had to work it over and install software to keep things tidy from a security perspective.
Intel makes plenty of alterations and modifications to motherboards to further its overclocking pursuits, both to its own internal designs and those from third-party motherboard vendors. As such, the lab has a rework station and an expert technician that can fix "just about anything." That also comes in handy during the company's overclocking workshops with motherboard vendors.
The overclocking labs' primary remit is to ensure that Intel's processors are designed and optimized for overclocking, but that same initiative extends to the platforms (i.e., motherboards) that play a huge role in determining how well you can overclock your chip.
That means the team has to work with vendors to make sure their motherboards are fully optimized for overclocking, and we found boards from every major vendor, like MSI, ASRock, Gigabyte, EVGA, and ASUS, in various stages of overclocking testing, and with various types of cooling solutions, scattered throughout the lab. Intel tests these boards and gives feedback and advice to vendors.
But the team is also responsible for making sure that Intel's own internal teams are testing new chips on platforms that are representative of the end products enthusiasts and overclockers will purchase. As we can see in the first image in the album above, Intel's previous-gen reference validation platform (RVP - left) for mainstream processors is a rather simple affair that lacks the generous accommodations we can see on MSI's Godlike motherboard on the right.
Sure, there are accommodations for some of Intel's internal testing tools and swappable PCH's that we'll cover on the following page, but features like the power delivery subsystem, not to mention finer aspects like optimized memory trace routing to facilitate higher overclocks, are far behind even most garden-variety Z390 motherboards on the market.
That's one of the first disconnects the team worked to solve, and it's an ongoing effort.
The lab team worked with Intel's reference validation platform (RVP) team to design newer platforms that look, and act, a lot more like what you would expect from an enthusiast-class motherboard. The boards are still quite spartan and lack the RGB goodness, OLED screens, and other fanciful trimmings of today's flagship boards, but Intel beefed up the power delivery subsystem for the chip and memory, along with adding meaningful VRM cooling. You can see they also stepped up from a single eight-pin EPS by adding an additional 4-pin connection for CPU power.
The team also made other adjustments, like fine-tuning the memory traces to improve memory overclockability. We also see they added some useful power and reset buttons for the lab crew, and metal reinforcements for the PCIe slots. We're sure the new digital debug display also comes in handy.
Those are just our external observations, though: the team tells us that it asked for roughly 100 new features to bulk up the existing validation platform. Many of these features will probably never be explained to us fully, but we're told there are plenty of new hooks and features for measuring performance and debugging issues. We'll cover some of those features on the following page.
As mentioned, this is still a board whose primary mission is validation, so it isn't the fanciest-looking affair. In either case, the team demonstrated some solid overclocking action with the new board under LN2, so it's clear that it's succeeding in its mission to bring overclocking closer to the chip design and validation process.
As a side note, you'll notice Intel uses pretty basic video adapters for testing, and I didn't notice any flagship AMD or Nvidia GPUs scattered about the lab. That's primarily because these slim adaptors are more suitable for overclocking testing as it allows easier access to bolt on various cooling solutions.
The OC lab team has broken many world records in its lab with the OC RVP boards, and the first world record that fell with the new RVP board was a big moment for them: That told them the design was ready.
But you won't ever see those records posted to HWBot: Intel has a policy of not competing with its customers, so Intel employees can't submit benchmark runs. The team has access to god-like tricks that aren't available to us regular users, so that's a good policy.
That doesn't preclude internal competitions, though, and there is a running competition among Intel employees for overclocking records. The competition extends to enthusiasts in other Intel labs, too. (For the record, Navya Pramod is currently 'spanking everyone').
High End Desktop (HEDT) Reference Validation Platform
Here we see a few permutations of Intel's internal HEDT RVPs. The HEDT chips are designed for high-end enthusiasts, so we're told they've had plenty of overclocking-capable features from the onset. The team continues to drive more features into the platform and is also working on a beefier OC RVP solution to sidestep some of the challenges associated with HEDT overclocking. As you can see, there's a massive VRM cooler attached to the board next to the CPU cooler, which we'll cover on the following page.
Overclocking Bootcamps
The team continuously works closely with motherboard vendors to optimize their boards, but that work intensifies as they prepare new generations of CPUs for launch. Intel holds two-day overclocking bootcamps for motherboard vendors to help speed the process.
Teams from motherboard vendors descend from around the globe on the overclocking lab for the event. Intel also flies in its own experts from around the world, like its power management team (p-code engineers) from Israel, debug teams, memory reference teams, and BIOS teams.
The goal is help the motherboard vendors understand the new chips and optimize their platforms for overclocking, so the motherboard vendors send their top engineers, like legendary board designers and overclockers Nick Shih (ASRock), Shamino (Peter Tan, ASUS), Toppc (MSI), and HiCookie (Gigabyte), along with teams of their BIOS and hardware engineers. Many of the motherboard vendors also have teams on standby in Taiwan that they communicate with throughout the event to address issues quickly, often working around the clock to maximize their time at the event.
The event originally ran from 8am to 11pm until Intel instituted a 'mercy' rule that constrained the official working hours from 8am to 8pm. That doesn't stop the motherboard vendors from taking gear back to their hotel rooms, though.
These overclockathons result in higher-quality overclocking gear for enthusiasts, but those same features and learnings bleed over to mainstream products, too, highlighting that the work the overclocking lab does impacts all facets of Intel's consumer desktop business.
Make no mistake, overclocking is a big business for motherboard vendors, too. They also regularly work with leading overclockers, like L0UD_SIL3NC3, mllrkllr88, and Splave, among many others, to optimize their platforms and make world records on their platforms. Everyone wants to be on top, and as a result, some motherboard vendors also bring their sponsored overclockers to the event, too. Intel often gives the professional overclockers trays of processors to bench during the event, which we're told becomes a sideshow all its own.
Here we can see the VRM cooling solution that Intel uses for its HEDT reference validation platform (RVP). After the issues we found with VRM cooling on Skylake-X motherboards, this certainly struck a chord.
Intel also ran into issues with VRM cooling while overclocking on its HEDT RVP (though the company didn't specify the exact generation of the chips under test), so the lab worked with its thermal engineering group to design a new heatsink (for internal use only) to address the issue. The lab pitched the project to its thermomechanical team, which includes a half-dozen PHDs, who then attacked the design with gusto. We're told that an amazing amount of design work and simulation went into the final heatsink design shown above. Unfortunately, there are only a few of these "insanely-overbuilt" heatsinks, and they are designed specifically for Intel's RVP boards.
On the topic of VRM cooling, Intel defines specs for chip power delivery but doesn't have a specific cooling recommendation for those subsystems. Instead, it's up to the motherboard vendors to assure that the cooling solutions meet the ratings of the various VRM componentry so it operates fully, particularly under high load during overclocking. While Intel doesn't set requirements for VRM cooling, they do test that as part of their normal flow with retail motherboards and give advice and feedback to the vendors.
How to Swap Your PCH
Intel's RVP boards have an interesting feature: sockets for platform controller hubs (PCH). For motherboards purchased at retail, these chips are soldered onto the board to provide the necessary I/O functions, but Intel's lab team has to test and validate multiple generations and new steppings of the various PCH chips during development. This socket allows for fast and flexible swapping of new PCH revisions.
These elastomer test sockets are fairly simple. The techs assemble the black retention mechanism around the bare BGA mounts on the motherboard, then drop in an interposer. These interposers are only rated for 15 insertions, though we're told they typically last much longer. The tech then drops the BGA PCH chip/substrate into the socket and tightens down the retention mechanism, which assures proper mating with the interposer and the underlying BGA pads on the motherboard. The housing provides enough thermal dissipation for the PCH, but Intel can also attach other cooling solutions to the top of the mounting mechanism if needed.
Intel Test Tools: XDP, XTU, PTU, TAT
Our demos of Intel's internal testing tools were very informative, but the company is very guarded with details of some of its test tools, like the ITP-XDP box above, so we can't share screenshots, or even descriptions, of some of the interfaces. This is definitely among the most secret of Intel's tech in the lab, so there was quite a bit of trepidation from the lab team about exactly what they could show us, and what we could show you. After several clarifying conversations between the lab crew and the PR team assigned to our tour, and some negotiation on our part, the team allowed us to get at least a broad outline of this box and its capabilities.
Intel's RVP platforms have an XDP socket (pictured above) that allows the company to have unprecedented insight and control of its chips in real-time.
The company also provides ITP-XDP units to its ecosystem partners for their own testing and debug use, and motherboard vendors also add XDP ports to their test boards. Intel also has scripts that it allows the motherboard vendors to run on their hardware during its overclocking workshops. However, the Intel-proprietary box has multiple layers of security that assure a tiered access level, with only Intel having full access to the features. The final layer of security is so strictly controlled that Intel's OC lab technicians have to log each and every use of the unrestricted feature layer to a central database.
At the unrestricted access level, in the lab's own words, the ITP-XDP enables a connection to the chip that is "like having a direct connection to your brain." The ITP-XDP connects to a host system, which is then connected to the target (the system being observed/tested) and allows Intel to monitor and change internal parameters, MSRs, and literally every configurable option inside of a processor, in real-time. It doesn't just monitor the CPU, either: the interface also monitors every component connected to the chip.
This tool allows the team to identify overclocking bottlenecks and issues, and then change settings on the fly to circumvent those limitations. The lab then relays that information back to other relevant teams inside of Intel to optimize the processor design for overclocking.
The real-time changes, paired with exclusive hooks, unlock possibilities that Intel will never expose to normal users. For instance, theoretically, you could change cache timings and internal fabric settings in real-time after the operating system is running, among many other possibilities. This allows the chip to operate in ways that wouldn't make it past boot up. The lab engineers can overclock and test all the different parameters of the chip in ways that we won't ever have access to, which is probably part of the reason why they techs aren't allowed to submit HWBot world records. We're told world records fall easily with the capabilities enabled by the system.
Just for kicks, we requested a sample unit. The odds of that request being approved are somewhere south of zero.
Intel's overclocking team played an integral part in the development of the publically-available XTU software, and also works on the ongoing updates. The team designed the software to be useful, and they eat their own dog food. The team often uses the tool to test overclocking with settings that normal customers have access to.
Intel also has other restricted-use software utilities, like the power/thermal utility (PTU) that we've tested with a few times (as we did here), and the thermal analysis tool (TAT), which is used to monitor and diagnose conditions that impact boost activity. The latter utility is incredibly useful for diagnosing problems associated with boosting activity, and because overclocking really boils down to running in a heightened boost state, it also proves useful for debugging OC issues, like which internal power limits are restricting higher clock speeds. Intel uses these utilities heavily, but also provides them to motherboard vendors for qualification work.
How to Void Your Warranty as Safely as Possible
Intel spends a tremendous amount of time and treasure assuring that its chips will run beyond the rated speed, thus delivering some value for the extra dollars you plunk down for an overclockable K-Series chip. However, in spite of what some might think given Intel's spate of overclocking-friendly features and software, we have to remember that, unless you pay an additional fee for an insurance policy, overclocking voids your warranty.
The reason behind that is simple, but the physics are mindbogglingly complex. Every semiconductor process has a point on its voltage/frequency curve beyond which a processor will wear out at an untenable rate. If the chip wears enough, it triggers electromigration (the process of electrons slipping through the electrical pathways), which leads to premature chip death. Some factors are known to increase the rate of wear, such as the higher current and thermal density that comes as a result of overclocking.
All this means that, like the carton of milk in your refrigerator, your chip has an expiration date. It's the job of semiconductor engineers to predict that expiration date and control it with some accuracy, but Intel specs lifespan at out-of-the-box settings. Because increasing frequency through overclocking requires pumping more power through the chip, thus generating more heat, higher frequencies typically result in faster aging, and thus lowered life span.
In other words, all bets are off for Intel's failure rate predictions once you start bumping up the voltage. But there are settings and techniques that overclockers can use to minimize the impact of overclocking, and if done correctly, premature chip death from overclocking isn't a common occurrence.
Because Intel doesn’t cover overclocking with its warranty, the company doesn't specify what it would consider 'safe' voltages or settings.
But we're in a lab with what are arguably some of the smartest overclockers in the world, and these engineers spend their time analyzing failure rate data (and its relationship to the voltage/frequency curve) that will never be shared with the public.
We're aware that, due to company policy, the engineers couldn't give us an official answer to the basic question of what is considered a safe voltage, but that didn't stop us from asking what voltages and settings the lab members use in their own home machines. Given that, for a living, they study data that quantifies life expectancy at given temperatures and voltages... Well, connect the dots.
Speaking as enthusiasts, the engineers told us they feel perfectly fine running thier Coffee Lake chips at home at 1.4V with conventional cooling, which is higher than the 1.35V we typically recommend as the 'safe' ceiling in our reviews. For Skylake-X, the team says they run their personal machines anywhere from 1.4V to 1.425V if they can keep it cool enough, with the latter portion of the statement being strongly emphasized.
At home, the lab engineers consider a load temperature above 80C to be a red alert, meaning that's the no-fly zone, but temps that remain steady in the mid-70’s are considered safe. The team also strongly recommends using adaptive voltage targets for overclocking and leaving C-States enabled. Not to mention using AVX offsets to keep temperatures in check during AVX-heavy workloads.
As Ragland explained, the amount of time a processor stays in elevated temperature and voltage states has the biggest impact on lifespan. You can control the temperature of your chip with better cooling, which then increases lifespan (assuming the voltage is kept constant). Assuming voltage remains constant, each successive drop in temperature results in a non-linear increase in life expectancy, so the 'first drop' in temps from 90C to 80C yields a huge increase in chip longevity. In turn, colder chips run faster at lower voltages, so dropping the temperature significantly by using a beefier cooling solution also allows you to drop the voltage further, which then helps control the voltage axis.
In the end, though, voltage is the hardest variable to contain. Ragland pointed out that voltages are really the main limiter that prevents Intel from warrantying overclocked processors, as higher voltages definitely reduce the lifespan of a processor.
But Ragland has some advice: "As an overclocker, if you manage these two [voltage and temperature], but especially think about 'time in state' or 'time at high voltage,' you can make your part last quite a while if you just think about that. It's the person that sets their system up at elevated voltages and just leaves it there 24/7 [static overclock], that's the person that is going to burn that system out faster than someone who uses the normal turbo algorithms to do their overclocking, so that when the system is idle your frequency drops and your voltage drops with it. So, There's a reason we don't warranty it, but there's also a way that overclockers can manage it and be a little safer."
That means manipulating the turbo boost ratio is much safer than assigning a static clock ratio via multipliers. As an additional note, you should shoot for idle temperatures below 30C, though that isn't much of a problem if you overclock via the normal turbo algorithms as described by Ragland.
Feedback
Hanging out with Intel's OC lab team was certainly a learning experience. The engineers have a passion for their work that's impossible to fake: Once you start talking shop you can get a real sense of a person's passion, or lack thereof, for their craft. From our meeting, we get the sense that Intel's OC lab crew members measure up to any definition of true tech enthusiasts, and we got the very real impression this is more than "just a job" to them.
Like employees at any company, there are certain things the engineers simply aren't allowed to answer, but they were forthright with what they could share and what they couldn't. We're accustomed to slippery non-answer answers to our questions from media-trained representatives (from pretty much every company) when a simple "I can't answer that" would suffice. There were plenty of "I can't answer that" responses during our visit, but we appreciate the honesty.
We peppered the team with questions, but they also asked us plenty of questions. The team was almost as interested in our observations and our take on the state of the enthusiast market as we were interested in their work, which is refreshing. We had a Q&A session where we were free to give feedback, and while Ragland obviously can't make the big C-suite-level decisions, his team sits at the nexus of the company's overclocking efforts, so we hope some of our feedback is taken upstream.
In our minds, overclocking began all those years ago as a way for enthusiasts to get more for their dollar. Sure, it's a wonderful hobby, but the underlying concept is simple: Buy a cheaper chip and spend some time tuning to unlock the performance of a more expensive model. Unfortunately, over the years, Intel's segmentation practices have turned overclocking into a premium-driven affair, with prices for overclockable chips taking some of the shine off the extra value. Those same practices have filtered out to motherboard makers, too. We should expect to pay extra for the premium components required to unlock the best of overclocking, but in many cases, the "overclocking tax" has reached untenable levels.
While segmentation is good for profits, it also leaves Intel ripe for disruption. That disruptor is AMD, which freely allows overclocking on every one of its new chips.
Unfortunately, we can't rationally expect Intel to suddenly unlock every chip and abandon a segmentation policy that has generated billions of dollars in revenue (the odds of that happening are close to zero), but there are reasonable steps it could take to improve its value proposition.
Case in point: Intel's policy of restricting overclocking to Z-series motherboards. AMD allows overclocking on nearly all of its more value-centric platforms (A-series excluded), which makes overclocking more accessible to mainstream users. We feel very strongly that Intel should unlock overclocking on its downstream B- and H-series chipsets to open up overclocking to a broader audience.
Intel's team expressed concern that the power delivery subsystems on many of these downstream motherboards aren't suitable for overclocking, which is a fair observation. However, there is surely at least some headroom for tuning, and we bandied about our suggestion of opening up some level of restricted overclocking on those platforms. Remember, back in the Sandy Bridge era, Intel restricted overclocking to four bins (400 MHz), so there is an established method to expose at least some headroom. We're told our feedback will be shared upstream, and hopefully it is considered. While AMD doesn't hold the overclocking performance advantage, it certainly holds the overclocking value advantage. We'd like to see Intel become more competitive in this area, as that would benefit enthusiasts on both sides of the ball.
Our other suggestions include that Intel works on a dynamic approach to its auto-overclocking mechanisms. AMD's Precision Boost Overdrive (PBO) opened up overclocking to less-knowledgeable users by creating a one-click tool to auto-overclock your system. Intel's relatively-new IPM is also a great one-click tool that accomplishes many of the same goals, but it is based on static overclock settings that don't automatically adapt in real-time to the chip's properties or changes in thermal conditions. Instead, you have to re-run the utility. We'd like to see a more dynamic approach taken, and we're told that Intel is already evaluating that type of methodology.
You've seen the "overclocking is dead" stories crop up from time to time, but they have certainly become more frequent as both Intel and AMD grapple with the challenges of scaling transistors down in the waning light of the traditional interpretation of Moore's Law. The advent of "Moore's Law 2.0" involves using advanced packaging techniques to tie multiple chips into heterogeneous packages, but the impact of that approach on overclocking is unclear, and there are still several hurdles along the path.
One of them have proven especially problematic for Intel: As per usual in the semiconductor industry, it all starts with the process technology. It’s no secret that Intel has struggled to shrink down to a 10nm process node. Given the company’s struggles with producing smaller nodes at volume, and the lower clock speeds we've seen with the new process, it’s logical to assume that also doesn’t bode well for overclocking headroom.
However, Ragland and his team are already working with Intel's future process nodes, so his words carry weight. We posed the question to Ragland: Is overclocking going to "die," particularly as we shrink to smaller process nodes?
"It will not. Even when you talk 7nm and into the future, it will not," Ragland said with a certainty and finality that can't be conveyed with text, "What the other guys are experiencing limit-wise, we will not."
"I can tell you that, and feel confident in telling you that now: People who think this the end of the world for overclocking because our competitors' 7nm has very little headroom, that's not true."
We pressed further on the question. Ragland's observation that TSMC's 7nm process has very little overclocking headroom is accurate, but some of that stems from AMD's practice of exposing nearly all of its frequency headroom at stock performance levels. This practice leaves very little overclocking headroom available, but Intel has also slowly folded more and more of its own 14nm overclocking headroom into its stock performance levels as the process node has matured.
Consider this: Intel's 14nm released for the desktop back in 2015 with stock frequencies of 4.2 GHz and the capability to overclock to ~4.9 GHz with conventional cooling, but the 9900K released in 2018 with a stock ~5.0 GHz frequency and ~5.1GHz maximum overclock. That's a shrinkage of 600 MHz off the top-line overclocking margin.
To be clear, the matrix is much more complicated when we take into account multi-core frequencies and voltages, but the margin between stock and overclocked frequencies is definitely shrinking. On the surface, that looks like the slow "death" of overclocking.
"The decreasing margin is a concern," Ragland responded, "but if you look at this over the last 15 years, you have a cycle where you've got massive margin, and then that margin erodes, then you get more margin, and it erodes again. If we talk just CPU core, there are cycles where we've had a larger margin than others, and it's true that you get paid more for POR (i.e., stock) performance more than overclocking performance, so that tends to win."
"At Intel, when you make the leap from engineer or manager to a Principal Engineer, you've committed to a path, that's pretty much your specialty for your career. I've committed to that path. I bet my career on the idea that overclocking will exist forever. Margin is a challenge. We will always give overclocking headroom back to POR if we're asked to, but there's just so much margin out there."
"A company like Intel is all about rock-solid reliability; our parts aren't going to fail. You've got this time window where you can count on your part running at spec, so there's so much inherent margin that we will always have overclocking headroom. Margin, you're right, that will be the thing we will have to watch, but I think users will be happy with the margin we can offer in the future."
It was incredibly enlightening to see first hand the work Intel does as it continues to bring overclocking closer to its design process, and the company's work in assisting its ecosystem partners, like motherboard vendors, assures that the learnings and advances made in the overclocking realm filter out to untold millions of users that never tune their chip. That's good for everyone because a competitive industry leads to more value for end users, regardless of what chip vendor they choose at checkout.
It's also encouraging to hear the company say it is confident it can continue to offer enthusiasts meaningful overclocking headroom even as it grapples with more complex process tech.

batteryforpc

Thursday, 2 January 2020

Inside Intel's Secret Overclocking Lab: The Tools and Team Pushing CPUs to New Limits

0 comments:

Post a Comment

Search

Blog Archive

Labels