Being a data scientist


I decided to write a series of articles that could be useful for those who are taking their first steps in data science.

Also I use these articles as a summary or a quick reminder about how I did something in the past (we can not remember everything).

So, here is my Syllabus:

1 – Introduction to Data Science

  • Introduction to Data Science
    • What is a Data Scientist
  • Problems Solved by Data Science

 2 – Data

  • Data wrangling (a.k.a Data munging)
  • Acquiring data
  • Normalization
  • Data Scrubbing
  • Missing Values
    • Easy Imputation
    • Impute using Linear Regression
    • Tip of the Imputation Iceberg
  • Common Data Formats: JSON, XML, CSV

3 – Datababases

  • DB Basics
  • Relational Databases
  • SQL and NoSQL

4 – Programming

  • Languages for Data Science: Java, R, Python, Scala, Matlab, Perl, Fortran, C++,
  • Basics:
    • Expressions
    • Variables
    • Vectors
    • Matrices
    • Arrays
    • Factors
    • Lists
    • Regex
  • Reading raw data
  • Subsetting Data
  • Manipulate Data frames
  • Functions
  • Factor analysis
  • Algorithms

 5 – Machine Learning

  • ML – Introduction
  • Approaches
    • Decision tree learning
    • Association rule learning
    • Artificial neural networks
    • Inductive logic programming
    • Support vector machines
    • Clustering
    • Bayesian networks
    • Reinforcement learning
    • Representation learning
    • Similarity and metric learning
    • Sparse dictionary learning
    • Genetic algorithms

 6 Big Data

  • Basics of MapReduce
  • Mapper
  • Reducer
  • MapReduce Ecosystem (Cloudera, HortonWorks. Pig, Hive)
  • HDFS
  • Mahout
  • Apache Spark
  • Zookeeper Avro
  • Cassandra, Impala, MongoDB

7 – Maths and Statistics

7.1 – Maths
  • Matrices and linear algebra fundamentals.
  • Relational Algebra
  • Binary tree, O(n)
  • Dataframes and series
7.2 Statistics
  • Descriptive Statistics (mean, median, range, SD, Var)
  • Histograms
  • Percentiles Outliers
  • Probability Theory
  • Bayes Theorem
  • Random variables
  • Continuous Distributions (Normal, Poisson, Gausian)
  • Skewness
  • ANOVA
  • Central limit theorem
  • Cost Function
    • How to Minimize Cost Function
    • Monte Carlo Method
  • p-value
  • Chi2 Test
  • T Test
  • Welch T Test
  • Non-Parametric Tests
  • Non-Normal Data
  • Prediction with Regression
  • Coefficients of Determination
  • Covariance
  • Correlation
  • Euclidean Distance

8 – Communication and Data Visualization

8.1 Visualization
  • Effective Information Visualization
  • Visual Encodings
  • Perception of Visual Cues
  • Plotting in Python, R, D3
  • Reporting with Birt, Tableau
  • Data Scales
  • Visualizing Time Series Data

 

8.2 Communication
  • Engaging with senior management
  • Story telling skills
  • Translate data-driven insights into decisions and actions
  • Visual art design

9 – Bussiness acumen

  • Project Management
  • Financial Knowledge
  • “Selling Things” Skills
  • Industry Knowledge

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS