A floating point conversion algorithm for mixed precision computations

Choon Lih Hoo, Sallehuddin Mohamed Haris, Nik Abdullah Nik Mohamed

Research output: Contribution to journalArticle

Abstract

The floating point number is the most commonly used real number representation for digital computations due to its high precision characteristics. It is used on computers and on single chip applications such as DSP chips. Double precision (64-bit) representations allow for a wider range of real numbers to be denoted. However, single precision (32-bit) operations are more efficient. Recently, there has been an increasing interest in mixed precision computations which take advantage of single precision efficiency on 64-bit numbers. This calls for the ability to interchange between the two formats. In this paper, an algorithm that converts floating point numbers from 64- to 32-bit representations is presented. The algorithm was implemented as a Verilog code and tested on field programmable gate array (FPGA) using the Quartus II DE2 board and Agilent 16821A portable logic analyzer. Results indicate that the algorithm can perform the conversion reliably and accurately within a constant execution time of 25 ns with a 20 MHz clock frequency regardless of the number being converted.

Original languageEnglish
Pages (from-to)711-718
Number of pages8
JournalJournal of Zhejiang University: Science C
Volume13
Issue number9
DOIs
Publication statusPublished - Sep 2012

Fingerprint

Computer hardware description languages
Digital signal processors
Interchanges
Field programmable gate arrays (FPGA)
Clocks

Keywords

  • Double precision
  • FPGA
  • HooHar algorithm
  • Single precision
  • Verilog

ASJC Scopus subject areas

  • Engineering(all)

Cite this

A floating point conversion algorithm for mixed precision computations. / Hoo, Choon Lih; Mohamed Haris, Sallehuddin; Mohamed, Nik Abdullah Nik.

In: Journal of Zhejiang University: Science C, Vol. 13, No. 9, 09.2012, p. 711-718.

Research output: Contribution to journalArticle

@article{0556c9993d1545fda8be60b35765f63f,
title = "A floating point conversion algorithm for mixed precision computations",
abstract = "The floating point number is the most commonly used real number representation for digital computations due to its high precision characteristics. It is used on computers and on single chip applications such as DSP chips. Double precision (64-bit) representations allow for a wider range of real numbers to be denoted. However, single precision (32-bit) operations are more efficient. Recently, there has been an increasing interest in mixed precision computations which take advantage of single precision efficiency on 64-bit numbers. This calls for the ability to interchange between the two formats. In this paper, an algorithm that converts floating point numbers from 64- to 32-bit representations is presented. The algorithm was implemented as a Verilog code and tested on field programmable gate array (FPGA) using the Quartus II DE2 board and Agilent 16821A portable logic analyzer. Results indicate that the algorithm can perform the conversion reliably and accurately within a constant execution time of 25 ns with a 20 MHz clock frequency regardless of the number being converted.",
keywords = "Double precision, FPGA, HooHar algorithm, Single precision, Verilog",
author = "Hoo, {Choon Lih} and {Mohamed Haris}, Sallehuddin and Mohamed, {Nik Abdullah Nik}",
year = "2012",
month = "9",
doi = "10.1631/jzus.C1200043",
language = "English",
volume = "13",
pages = "711--718",
journal = "Frontiers of Information Technology and Electronic Engineering",
issn = "2095-9184",
publisher = "Springer Science + Business Media",
number = "9",

}

TY - JOUR

T1 - A floating point conversion algorithm for mixed precision computations

AU - Hoo, Choon Lih

AU - Mohamed Haris, Sallehuddin

AU - Mohamed, Nik Abdullah Nik

PY - 2012/9

Y1 - 2012/9

N2 - The floating point number is the most commonly used real number representation for digital computations due to its high precision characteristics. It is used on computers and on single chip applications such as DSP chips. Double precision (64-bit) representations allow for a wider range of real numbers to be denoted. However, single precision (32-bit) operations are more efficient. Recently, there has been an increasing interest in mixed precision computations which take advantage of single precision efficiency on 64-bit numbers. This calls for the ability to interchange between the two formats. In this paper, an algorithm that converts floating point numbers from 64- to 32-bit representations is presented. The algorithm was implemented as a Verilog code and tested on field programmable gate array (FPGA) using the Quartus II DE2 board and Agilent 16821A portable logic analyzer. Results indicate that the algorithm can perform the conversion reliably and accurately within a constant execution time of 25 ns with a 20 MHz clock frequency regardless of the number being converted.

AB - The floating point number is the most commonly used real number representation for digital computations due to its high precision characteristics. It is used on computers and on single chip applications such as DSP chips. Double precision (64-bit) representations allow for a wider range of real numbers to be denoted. However, single precision (32-bit) operations are more efficient. Recently, there has been an increasing interest in mixed precision computations which take advantage of single precision efficiency on 64-bit numbers. This calls for the ability to interchange between the two formats. In this paper, an algorithm that converts floating point numbers from 64- to 32-bit representations is presented. The algorithm was implemented as a Verilog code and tested on field programmable gate array (FPGA) using the Quartus II DE2 board and Agilent 16821A portable logic analyzer. Results indicate that the algorithm can perform the conversion reliably and accurately within a constant execution time of 25 ns with a 20 MHz clock frequency regardless of the number being converted.

KW - Double precision

KW - FPGA

KW - HooHar algorithm

KW - Single precision

KW - Verilog

UR - http://www.scopus.com/inward/record.url?scp=84866485915&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84866485915&partnerID=8YFLogxK

U2 - 10.1631/jzus.C1200043

DO - 10.1631/jzus.C1200043

M3 - Article

VL - 13

SP - 711

EP - 718

JO - Frontiers of Information Technology and Electronic Engineering

JF - Frontiers of Information Technology and Electronic Engineering

SN - 2095-9184

IS - 9

ER -