Register now or log in to join your professional community.
There are many specialize software to built Data Warehouse. Most popular is Oracle - WMS. Warehouse Management System often utilize automatic identification and data capture technology, such as barcode scanners, mobile computers, and potentially RFID to efficiently monitor the flow of products. Once data has been collected, there is either a batch synchronization with, or a real-time wireless transmission to a central database. The database can then provide useful reports about the status of goods in the warehouse.
Also Advanced user of Microsoft Excel can manage Warehouse Data Manually
Some new paradigms of build a data warehouse.
1. Build a Logical Data Warehouse using Data Virtualization and Data Federation technologies like Composite software. Leave data where it is .i.e Do not ETL. Trying to come up to a one schema that can represent the needs of the Enterprise can be a3-5 yr project.2. Use a Hadoop + NoSQL approach. i.e dump all your data into a good folder structure , Use Hive for once in a while queries , Use NoSQL based datamarts for fast queries. Now
Better Performance through Parallelism: Three Common Approaches
There are three widely used approaches for parallelizing work over additional hardware:
• shared memory
• shared disk
• shared nothing
Shared memory: In a shared-memory approach, as implemented on many symmetric multiprocessor
machines, all of the CPUs share a single memory and a single collection of disks.
This approach is relatively easy to program: complex distributed locking and commit protocols
are not needed, since the lock manager and buffer pool are both stored in the memory system
where they can be easily accessed by all the processors.
Unfortunately, shared-memory systems have fundamental scalability limitations, as all I/O and
memory requests have to be transferred over the same bus that all of the processors share.
causing the bandwidth of this bus to rapidly become a bottleneck. In addition, shared-memory
multiprocessors require complex, customized hardware to keep their L2 data caches consistent.
Hence, it is unusual to see shared-memory machines of larger than8 or16 processors unless
they are custom-built from non-commodity parts, in which case they are very expensive.
Hence, shared-memory systems offer very limited ability to scale.
Shared disk: Shared-disk systems suffer from similar scalability limitations. In a shared-disk
architecture, there are a number of independent processor nodes, each with its own memory.
These nodes all access a single collection of disks, typically in the form of a storage area
network (SAN) system or a network-attached storage (NAS) system. This architecture
originated with the Digital Equipment Corporation VAXcluster in the early1980s, and has been
widely used by Sun Microsystems and Hewlett-Packard.
Shared-disk architectures have a number of drawbacks that severely limit scalability. First, the
interconnection network that connects each of the CPUs to the shared-disk subsystem can
become an I/O bottleneck. Second, since there is no pool of memory that is shared by all the
processors, there is no obvious place for the lock table or buffer pool to reside. To set locks,
one must either centralize the lock manager on one processor or resort to a complex distributed
locking protocol. This protocol must use messages to implement in software the same sort of
cache-consistency protocol implemented by shared-memory multiprocessors in hardware.
Either of these approaches to locking is likely to become a bottleneck as the system is scaled.
To make shared-disk technology work better, vendors typically implement a “shared-cache”
design. Shared cache works much like shared disk, except that, when a node in a parallel
cluster needs to access a disk page, it:
1) First checks to see if the page is in its local buffer pool (“cache”)
2) If not, checks to see if the page is in the cache of any other node in the cluster
3) If not, reads the page from disk
Such a cache appears to work fairly well on OLTP, but has big problems with data warehousing
workloads. The problem with the shared-cache design is that cache hits are unlikely to happen,
since warehouse queries are typically answered through sequential scans of the fact table (or
via materialized views.) Unless the whole fact table fits in the aggregate memory of the cluster,
sequential scans do not typically benefit from large amounts of cache, thus placing the entire
burden of answering such queries on the disk subsystem. As a result, a shared cache just
creates overhead and limits scalability.
In addition, the same scalability problems that exist in the shared memory model also occur in
the shared-disk architecture: the bus between the disks and the processors will likely become a
bottleneck, and resource contention for certain disk blocks, particularly as the number of CPUs
increases, can be a problem. To reduce bus contention, customers frequently configure their
large clusters with many Fibre channel controllers (disk buses), but this complicates system
design because now administrators must partition data across the disks attached to the different
controllers.