Data Science Learning Notes

Different Types of Data File Formats

Term	Definition
Comma-separated values (CSV) / Tab-separated values (TSV)	Commonly used format for storing tabular data as plain text where either the comma or the tab separates each value.
Data file types	A computer file configuration is designed to store data in a specific way.
Data format	How data is encoded so it can be stored within a data file type.
Data visualization	A visual way, such as a graph, of representing data in a readily understandable way makes it easier to see trends in the data.
Delimited text file	A plain text file where a specific character separates the data values.
Extensible Markup Language (XML)	A language designed to structure, store, and enable data exchange between various technologies.
Hadoop	An open-source framework designed to store and process large datasets across clusters of computers.
JavaScript Object Notation (JSON)	A data format compatible with various programming languages for two applications to exchange structured data.
Jupyter notebooks	A computational environment that allows users to create and share documents containing code, equations, visualizations, and explanatory text. See Python notebooks.
Nearest neighbor	A machine learning algorithm that predicts a target variable based on its similarity to other values in the dataset.
Neural networks	A computational model used in deep learning that mimics the structure and functioning of the human brain’s neural pathways. It takes an input, processes it using previous learning, and produces an output.
Pandas	An open-source Python library that provides tools for working with structured data is often used for data manipulation and analysis.
Python notebooks	Also known as a “Jupyter” notebook, this computational environment allows users to create and share documents containing code, equations, visualizations, and explanatory text.
R	An open-source programming language used for statistical computing, data analysis, and data visualization.
Recommendation engine	A computer program that analyzes user input, such as behaviors or preferences, and makes personalized recommendations based on that analysis.
Regression	A statistical model that shows a relationship between one or more predictor variables with a response variable.
Tabular data	Data that is organized into rows and columns.
XLSX	The Microsoft Excel spreadsheet file format.

Cloud computing

These cloud computing models are aptly referred to as Infrastructure as a Service (or IaaS), Platform as a Service (or PaaS), and Software as a Service (or SaaS). In an IaaS model, you can access the infrastructure and physical computing resources such as servers, networking, storage, and data center space without the need to manage or operate them. In a PaaS model, you can access the platform that comprises the hardware and software tools that are usually needed to develop and deploy applications to users over the Internet. And an SaaS is a software licensing and delivery model in which software and applications are centrally hosted and licensed on a subscription basis. It is sometimes referred to as “on-demand software.”

Cloud computing is the delivery of on-demand computing resources over the Internet on a pay-for-use basis. Cloud computing is composed of five essential characteristics, three deployment models, and three service models. The five essential characteristics of cloud computing are on-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service. There are three types of cloud deployment models: public, private, and hybrid. And the three cloud service models are based on the three layers in a computing stack (infrastructure, platform, and application), and they are referred to as Infrastructure as a Service (or IaaS), Platform as a Service (or PaaS), and Software as a Service (or SaaS).

Deep Learning & Machine Learning

Term	Definition	Video where the term is introduced
Artificial Neural Networks	Collections of small computing units (neurons) that process data and learn to make decisions over time.	Artificial Intelligence and Data Science
Bayesian Analysis	A statistical technique that uses Bayes’ theorem to update probabilities based on new evidence.	Applications of Machine Learning
Business Insights	Accurate insights and reports generated by generative AI can be updated as data evolves, enhancing decision-making and uncovering hidden patterns.	Generative AI and Data Science
Cluster Analysis	The process of grouping similar data points together based on certain features or attributes.	Neural Networks and Deep Learning
Coding Automation	Using generative AI to automatically generate and test software code for constructing analytical models, freeing data scientists to focus on higher-level tasks.	Generative AI and Data Science
Data Mining	The process of automatically searching and analyzing data to discover patterns and insights that were previously unknown.	Artificial Intelligence and Data Science
Decision Trees	A type of machine learning algorithm used for decision-making by creating a tree-like structure of decisions.	Applications of Machine Learning
Deep Learning Models	Includes Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) that create new data instances by learning patterns from large datasets.	Generative AI and Data Science
Five V’s of Big Data	Characteristics used to describe big data: Velocity, volume, variety, veracity, and value.	Neural Networks and Deep Learning
Generative AI	A subset of AI that focuses on creating new data, such as images, music, text, or code, rather than just analyzing existing data.	Generative AI and Data Science
Market Basket Analysis	Analyzing which goods tend to be bought together is often used for marketing insights.	Neural Networks and Deep Learning
Naive Bayes	A simple probabilistic classification algorithm based on Bayes’ theorem.	Applications of Machine Learning
Natural Language Processing (NLP)	A field of AI that enables machines to understand, generate, and interact with human language, revolutionizing content creation and chatbots.	Generative AI and Data Science
Precision vs. Recall	Metrics are used to evaluate the performance of classification models.	Applications of Machine Learning
Predictive Analytics	Using machine learning techniques to predict future outcomes or events.	Neural Networks and Deep Learning
Synthetic Data	Artificially generated data with properties similar to real data, used by data scientists to augment their datasets and improve model training.	Generative AI and Data Science

Data Science Tools Breakdown

Data Management

What it is: Storing, organizing, and retrieving data. Imagine data as clues for a detective, you need a place to keep them organized!
Open-Source Options:
- MySQL & PostgreSQL: Organize data in tables like spreadsheets (relational databases).
- MongoDB: Flexible storage for various data formats (NoSQL database).
Commercial Options (industry standard):
- Oracle Database, Microsoft SQL Server, IBM Db2 (relational databases).

Data Integration & Transformation

What it is: Streamlining data pipelines and automating data processing tasks. Think of it as cleaning and organizing the clues before you analyze them.
Open-Source Options:
- Apache Airflow: Automates arranging data for easier analysis.
- Apache Kafka: Moves data around quickly between different tools.
Commercial Options:
- Informatica PowerCenter & IBM InfoSphere DataStage: Graphical user interface (GUI) for designing data processing pipelines.
- Watson Studio Desktop – Data Refinery: Defines data integration processes in a spreadsheet-like way.

Data Visualization

What it is: Creating graphical representations of data to understand it better. Imagine turning your detective work into pictures and charts for clear storytelling!
Open-Source Options:
- Libraries requiring code: Pixie Dust (Python), Hue (SQL)
Commercial Options (business intelligence – BI tools):
- Tableau, Microsoft Power BI, IBM Cognos Analytics: Create interactive dashboards to visualize data.
- Watson Studio Desktop: Offers data exploration and visualization functionalities.

Model Deployment

What it is: Making machine learning models (think of them as detective hunches) usable by others. Imagine sharing your conclusions from the clues with a team.
Open-Source Options:
- Apache PredictionIO (limited model support)
- Seldon (supports various frameworks)
- MLeap (SparkML models)
- TensorFlow Serving (TensorFlow models)
Commercial Options: Integrate model deployment into the model-building process (various vendors).

Model Monitoring

What it is: Tracking the performance of deployed models over time. Imagine checking if your detective hunches are still accurate as new clues come in.
Open-Source Options:
- ModelDB (stores and queries model information)
- Prometheus (generic tool, also used for model monitoring)
Commercial Options: Limited options, open-source is preferred currently.

Code Asset Management

What it is: Storing & managing code, tracking changes, and allowing collaboration. Imagine keeping track of different versions of your detective notes and sharing them with your partner.
Open-Source Options:
- Git (version control system)
- GitHub (web-based platform to store and share code)
Alternatives: GitLab, Bitbucket (similar functionalities)

Data Asset Management

What it is: Organizing and managing data, access control, and backups. Think of keeping your detective work organized, with access permissions for your team, and backups in case something gets lost.
Open-Source Options:
- Apache Atlas
- ODPi Egeria
- Kylo
Commercial Options:
- Informatica Enterprise Data Governance
- IBM Cloud Pak for Data

Development Environments

What it is: Workspaces for data scientists to write, test, and run code. Imagine your detective headquarters where you analyze the clues.
Open-Source Options:
- Jupyter Notebook (interactive coding environment)
- Apache Zeppelin (inspired by Jupyter Notebook)
- RStudio (for R programming)
- Spyder (Python, similar to RStudio)
Alternatives: Watson Studio Desktop (combines Jupyter Notebooks with graphical tools)

Execution Environments

What it is: Provides computational resources to run data science code. Imagine having a powerful computer to analyze all the detective work.
Open-Source Options:
- Apache Spark (large-scale data processing)
- Apache Flink (real-time data processing)
- Ray (large-scale deep learning model training)
Alternatives: Watson Studio (fully integrated environment)

Fully Integrated Visual Tools

What it is: Tools with drag-and-drop functionalities for data science tasks. Imagine having a visual detective board to organize your clues and analysis.
Open-Source Options:
- KNIME (with R & Python programming extension)
- Orange

Cloud vs Desktop Tools

Cloud-based: Accessed through a web browser, convenient and easy to use from anywhere (becoming more popular). No need to install software on your own computer, and updates happen automatically. Great for collaboration, as multiple people can access and work on projects simultaneously. However, cloud-based tools may rely on a strong internet connection for smooth operation, and might have limitations on processing large datasets locally compared to desktop software.
Desktop-based: Installed directly on your computer, offering more control over processing power and data privacy. Useful for working offline or with very large datasets. However, installation and updates require manual intervention, and collaboration might require additional file sharing steps.

Choosing the Right Programming Language for Data Science

The world of data science offers a variety of programming languages, each with its strengths and weaknesses. Selecting the right one depends on several factors:

Your Needs: Are you focused on data analysis, machine learning, web development, or a combination?
Problem Domain: What types of problems are you trying to solve? (e.g., Natural Language Processing, image recognition, financial modeling)
Target Audience: Who are you building solutions for? (e.g., data analysts, business users, web applications)

Here’s a breakdown of some popular languages and their suitability for data science tasks:

Python:

Pros: Widely used, beginner-friendly syntax, vast ecosystem of data science libraries (NumPy, Pandas, SciPy, Matplotlib)
Cons: Can be slower for large-scale data processing compared to compiled languages
Applications: Excellent for data analysis, machine learning, web development, natural language processing (NLTK library)
Example: Building a machine learning model to predict customer churn using historical data and libraries like Pandas and Scikit-learn.

Pros: Open-source, strong focus on statistics and data visualization, large community and extensive libraries for specific statistical analyses
Cons: Can be challenging to learn for those without a strong statistical background, less versatile than Python for broader software development
Applications: Primarily used for data analysis, statistics, and creating high-quality visualizations
Example: Analyzing survey data to identify trends and relationships between variables using libraries like ggplot2.

SQL:

Pros: Essential for interacting with relational databases, standardized language across various database platforms
Cons: Not a general-purpose programming language, limited in data manipulation capabilities outside databases
Applications: Extracting and manipulating data from relational databases for analysis and reporting
Example: Writing SQL queries to retrieve customer information from a database for marketing campaigns.

Scala:

Pros: Powerful and scalable, designed for large-scale data processing, integrates well with Apache Spark, a popular big data framework
Cons: Steeper learning curve compared to Python or R, less common outside big data applications
Applications: Big data processing, machine learning with large datasets using Apache Spark libraries like MLlib
Example: Building a recommendation engine for a streaming service using Spark and Scala to analyze user behavior data.

Java:

Pros: Mature and widely used language, large developer community, good for building enterprise-grade applications
Cons: Can be verbose compared to other languages, not as popular as Python for data science specifically
Applications: Building data pipelines and backend systems for data science projects, libraries like Weka and Apache Mahout for machine learning
Example: Developing a data processing pipeline to clean and prepare data for analysis using Java libraries.

Julia:

Pros: Relatively new language, designed for high performance and scientific computing, gaining traction in data science due to its speed
Cons: Smaller community and library ecosystem compared to established languages
Applications: Scientific computing, large-scale simulations, potential for future growth in data science
Example: Using Julia for complex scientific calculations or simulations that require high performance.

JavaScript:

Pros: Widely used for web development, some libraries allow for data science tasks within web applications
Cons: Limited data science capabilities compared to dedicated languages, can be challenging for complex data analysis
Applications: Building interactive data visualizations and dashboards for web applications using libraries like TensorFlow.js
Example: Creating a real-time chart that displays stock price fluctuations using a JavaScript library.

Terms:

Concepts:

Data Management: This involves organizing, storing, protecting, and retrieving data throughout its lifecycle. It ensures data accuracy, consistency, and accessibility.
- Examples: Implementing data governance policies, designing data storage solutions (databases, data warehouses), creating data backups and recovery plans.
Data Integration & Transformation: This combines data from various sources into a unified format for analysis. It involves cleaning, transforming, and preparing data for use in models.
- Examples: Extracting data from different databases and APIs, transforming data formats (CSV to JSON), handling missing values and inconsistencies.
Data Visualization: This translates data into visual representations like charts and graphs to communicate insights effectively.
- Examples: Creating bar charts to compare sales figures, using scatter plots to show relationships between variables, developing interactive dashboards for data exploration.
Model Building: This involves designing and developing machine learning models that can learn from data and make predictions.
- Examples: Choosing an appropriate machine learning algorithm (e.g., linear regression, decision tree), training the model on historical data, evaluating the model’s performance on new data.
Model Deployment: This makes the trained model accessible for real-world use. It involves integrating the model into applications or systems.
- Examples: Deploying a fraud detection model into a financial transaction system, making a product recommendation model available on an e-commerce website.
Model Monitoring: This tracks the performance of deployed models over time and ensures they continue to function effectively.
- Examples: Monitoring the accuracy of a fraud detection model as new data becomes available, retraining the model if its performance degrades.
Data Asset Management: This involves organizing, cataloging, securing, and controlling access to data assets within an organization.
- Examples: Creating data dictionaries to document data definitions, setting access permissions for different user groups, ensuring data security and compliance with regulations.
Code Asset Management: This ensures code used for data analysis and machine learning is organized, version controlled, and accessible for collaboration.
- Examples: Using Git for version control of code, storing code in repositories like GitHub, documenting code for maintainability and collaboration.
Execution Environments: These provide the computational resources to run data science code. They can be local machines, cloud platforms, or distributed computing clusters.
- Examples: Using cloud platforms like Google Colab or Amazon SageMaker for data processing and model training, leveraging high-performance computing clusters for large-scale data analysis.
Development Environments: These offer workspaces for data scientists to write, test, and run code. They often integrate tools for data manipulation, visualization, and model building.
- Examples: Using Jupyter Notebooks for interactive coding and analysis, working in RStudio for statistical computing

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30