BIG DATA & HADOOP
Good to know that 73% of online adults now use a social networking site of some kind. In addition, Instagram users are nearly as likely as Facebook users to check in to the site on a daily basis. Want to know more, the list is on:
- 2.2 billion – Number of email users worldwide.
- 61% – Share of emails that were considered non-essential.
- 4.3 billion – Number of email clients worldwide in 2012
- 425 million – Number of active Gmail users globally, making it the leading email provider worldwide.
- 85,962 – Number of monthly posts by Facebook Pages in Brazil, making it the most active country on Facebook.
- 47% – Percentage of Facebook users that are female.(Wooh!!)
- 40.5 years – Average age of a Facebook user.
- 200 million – Monthly active users on Twitter, passed in December 2012
- 37.3 years – Average age of a Twitter user.
- 123 – Number of heads of state that have a Twitter account.
- 44.2 years – Average age of a Linkedin user
Where does this information come from? Even more than the information, it is the insight that makes this information useful and inexplicably important to organizations. Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This is big data.
Why Big Data?
It is a buzzword used to describe massive volume of both structured and unstructured data which is difficult to process using traditional database and software techniques. Big data is the recipe of doing business in today’s world. With the upsurge of bulks of colossal data- both structured and unstructured, flooding in every organization on a daily basis, proper management and emphatic insights have become a necessity. Wikipedia defines big data as” indescribably complex enough to be worked upon using traditional data processing applications.”
Big Data Analytics examines this huge amount of data to unleash hidden patterns, unknown correlations and other useful information. This helps companies strategize business decisions with the help of data scientists. They analyze chunks of data and amass meaningful insights that are quite often left untapped by conventional business intelligence programs. However, more than 70% of data all over the world is unstructured. As a result, a new class of big data technology has emerged and is being used in many big data analytics environments.
Apache Hadoop is a form of an open source software framework that supports the processing of large data sets across clustered systems.
Origin of Hadoop
Hadoop is named after the name of a toy stuffed elephant that belonged to a young boy!! Is this all about Hadoop.The answer is definitely a No!
In the 2000s, Google faced a serious challenge to handle the exploding volume of data coming from ever increasing number of websites. Google’s engineers designed a new data processing infrastructure and termed them as Google File System, or GFS, which provided fault-tolerant, reliable, and scalable storage, and MapReduce, a data processing system that allowed work to be split among large numbers of servers and carried out in parallel.
In2004, a well-known open source software developer named Doug Cutting used the technique and replaced the data collection and processing infrastructure on MapReduce and named the new software as Hadoop, after a toy stuffed elephant that belonged to his young son. The three trends — a shift to scalable, elastic computing infrastructure; adequacy for the most complex and variety of data available; and the power of deciphering disparate data for comprehensive analysis — make Hadoop a critical new platform for data-driven enterprises.
Since Hadoop is linear scalable on low cost commodity hardware, it removes the limitation of storage and compute from the data analytics equation. Instead of pre-optimizing data in the traditional ETL, data warehouse, and BI architecture, Hadoop stores all of the raw data and applies all transformation and analytics that might be done on demand.The platform is now used to support an enormous variety of applications with three key properties.
Hadoop is a single, consolidated storage platform for all kinds of data. It complements numerous file storage products available in market today by delivering a new repository where structured data and complex data may be combined easily.Hadoop is an excellent alternative to redundant and time consuming ERP systems of organizations to store huge data.
Being an open source software, Hadoop provides more storage at much lower cost. One of the cost advantages of Hadoop is that because it relies in an internally redundant data structure and is deployed on industry standard servers rather than expensive specialized data storage systems, you can afford to store data not previously viable.
Hadoop can consolidate all data types on a low-cost, reliable storage platform that delivers fast parallel execution of powerful analytical algorithms. Hadoop offers data-driven organizations ways to exploit data that they have simply never had before.
Hadoop is one of the most acceptable data storage and data processing e-hub due to the reason that it has been able to overrule various bottlenecks of traditional analytical solutions.
Career as a Data Scientist
Job opportunities for data scientists and Hadoop specialists are emerging across industries, from web companies and e-retailers to financial services, healthcare, energy, utilities and media. A Big Data Scientist is a business employee who is responsible for handling and statistically evaluating large amounts of data. The success of a Big Data Scientist lies in impactful and comprehensible illustration of bulk data he works upon. A Data Scientist must have a set of technical skills like Hadoop, visualization skills like power point, Excel, Tableau and business domain expertise of ones workplace, understanding and meeting the business needs, knowledge of risk analysis etc.
Career Prospects with Hadoop
Hadoop is mentioned in 612 of 83,122 job listings on Dice.com. Among the companies looking to hire Hadoop software engineers and big data scientists are AT&T Interactive, Sears, PayPal, AOL and Deloitte. Hadoop “is an emerging skill,” says Alice Hill, managing director of Dice.com. Hill says Hadoop is also a good skill for IT professionals with relational database management experience to pursue. “If you really understand data structure and queries, there’s going to be a lot of job opportunities,” she adds.